Design CMOS circuitry to unleash the potential of nonvolatile computing-in-memory in terms of energy-efficiency, speed, and scale for AI edge processors
We have integrated one-megabit resistive memory with innovative control and readout circuits on the same chip using 65nm CMOS process to build a nonvolatile computing-in-memory macro, which achieves high energy-efficiency and low latency for Boolean logic and multiply-and-accumulate operations.
Developing artificial intelligence (AI) processors for edge platforms is particularly challenging due to the need for low run-time energy consumption, ultra-low standby current, short computing latency, and cost effectiveness. The data-centric nature of AI algorithms further increases the difficulty in meeting these requirements when using the conventional von Neumann architecture, which is highly constrained in terms of latency and power consumption due to the movement of data between digital processors and memory as well as between the various levels of the memory hierarchy.
Non-volatile computing-in-memory (nvCIM) based on memristive devices could potentially circumvent the von Neumann bottleneck by performing the computation directly within the memristor array. This emerging computing paradigm inspired by the structure of the human brain can improve energy-efficiency and the speed of computation by reducing the movement of data between the processor and memory, simplifying the memory-hierarchy and increasing computing parallelism.
A number of pioneering works have demonstrated the functionality of nvCIM using in-lab small-capacity memristor crossbar arrays and off-chip external testers for control, input, and readout. Developing a CMOS-integrated nvCIM macro (i.e., a fully-functional circuit block integrating the memristive array with periphery circuitry), is a non-trivial issue. It is of importance to reduce latency and energy consumption by eliminating the large parasitic load induced by the interconnections between discrete components. This is also an indispensable step to embed nvCIM in AI processors and pave the way for mass manufacturing. Thus, advancements in fully CMOS-integrated macros could be decisive in the development of CIM technology. Nonetheless, there remain critical challenges pertaining to the rigorous constraints on energy- and area-efficiency associated with on-chip integration.
In a recently published work, we presented an nvCIM macro in which one-megabit resistive memory (ReRAM) was integrated with control and readout circuits on the same chip. It was fabricated using 65nm CMOS process with foundry-developed ReRAM technology. In CIM mode, it performs Boolean logic operations (AND, OR, and XOR) as well as multiply-and-accumulate (MAC) operations with negligible area overhead. Several of the key enablers are summarized below. First, we adopted a one-transistor-one-resistor (1T1R) single-level cell (SLC) array to suppress sneak current and ensure sufficient read margin. Second, we used highly area-efficient digital WL drivers for inputs to match the small WL pinch of the memory array. Third, we developed a multi-level readout scheme based on sense amplifiers with high area/energy-efficiency for the readout of analogue computing signals. We also introduced several circuit-level techniques to enhance tolerance for device non-idealities (e.g. small resistance ratio, leakage current, and process variation), which have proven essential in dealing with the considerable process variation associated with large-scale nvCIMs
Based on that nvCIM, we developed an inference system and a computational flow scheme to process deep neural network based on a split binary-input ternary-weight model. The resulting nvCIM can be used to perform MAC operations for convolutional as well as fully-connected neural networks. In experiments, the proposed system achieved short computing latency (≤14.8 ns), high energy efficiency (≥16.95 TOPS/W), and high inference accuracy (98.8% on the MNIST dataset). Our results also revealed a trade-off between latency, energy-efficiency, and the accuracy of nvCIM, which could be adjusted to satisfy the requirements of various applications.
This work experimentally demonstrates the advantages of CMOS-integrated nvCIM macros in terms of energy efficiency and computing latency. The proposed scheme also expands the range of computation tasks that can be performed by the CIM (i.e., Boolean logic and MAC operations) without imposing additional hardware costs. The proposed circuit techniques can effectively enhance the robustness of nvCIM macros against device non-idealities and process variation, which further advances nvCIM toward large capacity computing structures as well as mass production. The concepts and techniques reported here are readily applicable to the development of CIM macros based on other types of memory. We expect this work will help to unleash the potential of nvCIM macros in AI edge computation.
For more information, please look for our recent publication in Nature Electronics, CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors.