Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Nearly every computer we have built shares one structure. The place that computes (the processor) and the place that stores data (memory) are separate, and data shuttles endlessly between them. This is the von Neumann architecture.

That separation has become a heavy cost in the AI era. For a single matrix multiply, we move many weights from memory to the processor and send results back. Computation itself is cheap, but this round trip eats up most of the time and power. This is the von Neumann bottleneck, the memory wall.

In-memory computing, or compute-in-memory (CIM), offers a provocative answer. "Do not move data to where you compute; compute right where the data already sits, inside the memory." This article unpacks that principle step by step, especially the idea of using the physics of memory cells themselves as the compute engine.

Rather than diving deep into physics, this article aims to grasp intuitively "why this idea arose, how it works, where it is used, and what is hard." Through small numeric examples and ASCII diagrams, we unpack it so you can follow the core principle even without knowing electrical circuits.

0. The Intuition in One Sentence

Before diving in, let us nail down one sentence that runs through this whole article.

> "The cost of pulling data out of memory and moving it to the compute unit is far higher than the computation done on that data. So do not move the data; compute right where the data sits."

That one sentence is the whole of in-memory computing. Everything else is concrete methodology for "how to compute right there." Terms like crossbar array, Ohm's law, Kirchhoff's law, ReRAM, and ADC appear later, but all of them are attempts to physically realize that one sentence.

Why has this idea become important now? Because as transistors shrank, computation kept getting cheaper, but the cost of moving data did not fall as much. The result was a reversal where "movement is more expensive than computation," and in workloads handling huge data like AI, this reversal became a decisive bottleneck. CIM answers that reversal head-on.

1. Revisiting the von Neumann Bottleneck

Picture one layer of a typical neural network inference. The core is multiplying an input vector by a weight matrix (a bundle of MAC, multiply-accumulate, operations).

standard flow

-----------------------------------

1. read the weight matrix from memory

2. read the input vector from memory

3. move to the processor, multiply and add

4. write results back to memory

-> data crosses the bus many times

The expensive steps are 1 and 4: data movement. As the model grows, there are more weights to move and the bus gets busier. The compute unit often sits idle waiting for data.

CIM's core insight is this. The weights sit still in memory. So why drag them out at all? What if we only flow the input through and finish the multiply-accumulate inside the memory? Then the entire cost of moving weights disappears.

2. The Crossbar Array — Solving Matrix Multiply with Physics

The most elegant form of CIM is the crossbar array. Horizontal lines (wordlines) and vertical lines (bitlines) cross like a grid, with a resistive device at each intersection.

col0 col1 col2

| | |

row0 ---+G00----+G01----+G02---

| | |

row1 ---+G10----+G11----+G12---

| | |

(each crossing G is a conductance = stored weight)

You only need two laws of physics here.

- **Ohm's law**: apply voltage V across a device of conductance G and a current I = V x G flows. That is multiplication.

- **Kirchhoff's current law**: when several currents gather on one vertical line, they add automatically. That is addition.

Store weights as each device's conductance G, flow the input vector as voltage V on the horizontal lines, and the total current on each vertical line is the dot product (the MAC result) of input and weights. A matrix-vector multiply finishes in a single electrical operation. Instead of clocking countless digital multipliers, the physical phenomenon produces the answer instantly.

apply input voltage on horizontal lines

at each crossing, I = V x G (Ohm's law: multiplication)

sum currents on vertical lines (Kirchhoff: addition)

vertical-line current = matrix-vector product result

This single operation is the source of the energy efficiency CIM promises. There is no data movement, and multiply-add happen physically at the same time.

3. Analog vs Digital In-Memory

CIM splits into two broad branches.

| Aspect | Analog CIM | Digital CIM |

| --- | --- | --- |

| Compute method | multiply-add via current/voltage | digital logic near memory |

| Energy efficiency | very high (potential) | high |

| Precision | vulnerable to noise | easy to preserve |

| Conversion cost | needs ADC/DAC | low conversion burden |

| Maturity | largely research stage | closer to commercial |

Analog CIM computes directly with electrical quantities, like the crossbar above. Its theoretical efficiency is highest, but the result is an analog current, so it needs an ADC (analog-to-digital converter), and that conversion consumes a lot of area and power. It is also sensitive to device variation and noise.

Digital CIM places small digital compute logic right beside the memory cell (or near the SRAM bitcell), keeping digital precision without moving data far. Its efficiency is lower than analog, but easier precision and control put it closer to commercialization.

4. Memory Devices — SRAM, ReRAM, PCM

Which memory technology CIM is built on is another big fork.

- **SRAM-based**: fits existing CMOS processes well, fast, reliable. But large cells mean low density, and data is lost when power is off (volatile). Often paired with digital CIM.

- **ReRAM (resistive RAM)**: non-volatile memory storing values as a device's resistance state. Small and dense, it fits crossbar analog CIM well. Challenges include device-to-device variation, write endurance, and resistance drift.

- **PCM (phase-change memory)**: stores values via a material's crystalline/amorphous state. It can represent multi-level values, favorable for analog weight storage, but drift over time and write energy are challenges.

The appeal of non-volatile devices (ReRAM, PCM) is that once weights are written, they persist even with power off. No need to reload weights at inference time, which fits CIM's ideal of "etch the weights permanently into memory and just flow the input."

5. Precision and Noise — The Core Trade-off

Analog CIM's biggest enemy is noise and inaccuracy. Digital multiply always gives exactly 6 for 2 times 3, but in the analog world the current does not land exactly.

The sources of the problem:

- **Device variation**: even with the same intended weight, conductance differs slightly per device.

- **Drift**: stored values change subtly over time.

- **Conversion noise**: the ADC introduces quantization error converting analog current to digital.

- **Crosstalk / IR drop**: wire resistance and leakage deviate from the ideal current value.

Interestingly, neural network inference is fairly tolerant of such inaccuracy. Inference often works fine at lower precision (quantization) than training. So CIM research focuses on finding "the point where model accuracy survives despite noise." Common approaches model the noise and harden against it during training (noise-aware training), or process only critical layers digitally in a hybrid scheme.

the precision - efficiency tug of war

-----------------------------------

high bit precision -> accurate but costs efficiency/area

low bit precision -> efficient but risks accuracy

key to CIM design: find the lowest precision the model tolerates

6. The Energy Efficiency Advantage — Why It Matters

CIM draws attention not for raw speed but for energy efficiency. In an era where data-center power is becoming the real ceiling on AI scaling, "the same computation at far less power" is a powerful value.

The sources of the advantage:

- **Eliminating data movement**: no moving weights, so movement energy vanishes. As noted, movement is orders of magnitude more expensive than computation.

- **Parallelism**: a crossbar finishes a whole matrix-vector multiply in one operation. It is inherently massively parallel.

- **Leveraging non-volatility**: no reloading of weights reduces static cost.

Of course this advantage is potential under ideal conditions. Subtract ADC/DAC overhead, peripheral circuits, and precision-calibration cost, and the real gain shrinks. So CIM's practicality must be judged by "whole-system efficiency," not just "core-operation efficiency."

6.5. What It Means at the Edge — Battery and Always-On

The place where CIM's energy efficiency makes the most decisive difference is not the data center but the device in your hand. On battery-powered edge devices, a sliver of power is usage time and the very possibility of the product.

Picture concrete scenarios.

- **Always-on voice detection**: smart speakers and earbuds constantly run a small neural network to listen for "a specific keyword." If this small inference drains the battery, the product is unviable. CIM is ideal for such ultra-low-power always-on inference.

- **Wearable health sensors**: inferring on heart-rate and motion data inside the device to detect anomalies. Not sending data to the cloud is good for privacy too, and low power means long life.

- **Sensor nodes**: countless sensors on an industrial site each do small inference. Power budgets are extremely tight, so efficiency decides whether deployment is even possible.

cloud inference edge CIM inference

----------------- -----------------

send data to a server inference completes inside the device

network/latency/privacy burden low power, low latency, privacy protected

large models possible optimal for small, efficient models

At the edge, the demand for "small, repetitive inference at extremely low power" overlaps exactly with CIM's strengths. If CIM is a complement in the data center, at the edge it has the potential to be a game changer.

7. Application to AI Inference

CIM fits best in inference, especially edge inference.

- Inference has fixed weights, naturally suiting CIM's model of etching them once and reusing them repeatedly.

- Edge devices (sensors, wearables, always-on voice detection) have extremely tight power budgets, where CIM's efficiency makes a decisive difference.

- Small, repetitive inference like always-on keyword spotting is exactly where CIM's strengths overwhelm its losses.

Training, by contrast, must update weights constantly, which suits CIM less given its write-endurance and precision demands. And if the model changes quickly, the benefit of etching weights into non-volatile devices weakens.

8. Commercialization Challenges

The homework standing between CIM and the market is clear.

- **Device reliability**: ReRAM/PCM variation, drift, and write endurance must be tamed to production grade.

- **ADC overhead**: the cost of converting analog results to digital eats overall efficiency. Designs that reduce this conversion are key.

- **Software stack**: without a mature compiler/toolchain like GPU's CUDA, developers struggle to deploy models.

- **Accuracy guarantees**: calibration and training techniques that robustly preserve model accuracy despite noise are needed.

- **Process integration**: how smoothly it integrates with existing semiconductor processes drives cost.

8.5. Building Models That Withstand Noise — Noise-Aware Training

The key technique for solving CIM's accuracy problem from the software side is "noise-aware training." The idea is simple. If the hardware will add noise at inference time, then at training time deliberately imitate that noise and harden the model against it.

normal training noise-aware training

----------------- -----------------

train with clean computation inject noise during training

risk of accuracy drop on real CIM learn weights robust to noise

accuracy holds on real CIM

Concretely, during training you deliberately mix in noise with a distribution the CIM hardware would likely produce, into weights or activations. Then the model learns to "get the answer right even with this much jitter." It is a kind of vaccination. A model trained this way suffers less accuracy drop from noise when placed on real analog CIM.

This technique matters because it is a good example of complementing hardware imperfection with software collaboration. CIM's future rests not on hardware alone but on co-design where hardware, models, and compilers handle noise together.

9. 2026 Research Trends and the Big Picture

As of 2026, CIM is actively researched in both academia and industry, and some digital CIM and SRAM-based accelerators are approaching commercial products. Alongside other efforts to route around the memory wall (photonic interconnect and optical tensor-core research, Lightmatter and DARPA-related projects), it pursues the same goal of "move data less" through different physics.

Placed on the full accelerator landscape, the picture sharpens. NVIDIA holds about 75 to 80 percent of the accelerator market, cementing the mainstream with Blackwell and the next-gen Vera Rubin, while Google TPU, inference-specialized ASICs, and wafer-scale designs like Cerebras each offer their own answer. Within this, CIM stands as one branch that "eliminates data movement at the most fundamental level." In an era where inference capex first overtakes training capex and power becomes the ceiling, the value of CIM's extreme energy savings will only grow.

10. Relationship to GPUs and Digital Accelerators

CIM is not trying to replace the GPU. GPUs and digital accelerators are overwhelming in generality, precision, and mature ecosystem, and will remain the center of training and varied workloads.

CIM's place is closer to complementary. It fills the area where digital accelerators are structurally disadvantaged, namely specific inference, especially at the extremely power-constrained edge. Future systems are likely to go heterogeneous rather than have one technology do everything, with digital cores, CIM blocks, and other specialized accelerators dividing roles on one chip or one board.

11. The Crossbar in More Depth — Signs and Multiple Bits

The crossbar we saw earlier was a simplified picture representing positive weights with positive conductance. Real neural network weights include negatives, but conductance cannot be negative. How is this solved?

A common method is to use two devices as a pair. The weight is expressed as the difference of "positive conductance minus negative conductance." One device carries the positive contribution, the other the negative, and the difference of the two currents becomes the signed weight.

representing signed weights

-----------------------------------

weight w = G_plus - G_minus

G_plus : positive-contribution device

G_minus : negative-contribution device

difference of the two column currents = signed MAC

Another challenge is multi-bit precision. A single device can represent only a limited number of conductance levels, making it hard to hold a high-bit weight in one device. Here you split bits across multiple devices (bit slicing) or use devices that can represent multiple levels (PCM's multi-level storage, etc.). The higher the precision, the more devices or circuits you need, so the precision-versus-area-and-efficiency trade-off operates here too.

Looking at such detail shows that behind the crossbar's elegance of "matrix multiply in one shot with physics" hides sophisticated engineering handling signs, precision, and noise.

12. Facing ADC Overhead Head-On

Analog CIM's most realistic stumbling block is the ADC (analog-to-digital converter). No matter how efficiently the crossbar produces a MAC result as current, that analog current must be turned into a digital number the next layer can use. The ADC handles this conversion, and the ADC consumes a lot of area and power.

crossbar compute (cheap)

analog current result

ADC conversion (expensive!) <- efficiency leaks here

digital result

The severity is in the ratio. Even if the core operation (crossbar) is highly efficient, if the area and power the ADC takes are large, overall system efficiency is cut accordingly. In some designs the ADC takes up a substantial share of total power.

So one major thread of CIM research is "reducing ADC burden." You can have several columns share one ADC (time multiplexing), tune the algorithm so a low-resolution ADC suffices, or design circuits that make the conversion itself more efficient. For CIM's promise to become reality, taming this conversion cost is the key.

13. SRAM-Based Digital CIM — The Most Realistic Path

The CIM closest to commercialization is, surprisingly, the least flashy form: SRAM-based digital CIM. It avoids the uncertainty of new devices (ReRAM, PCM) and takes only the core benefit of "do not move memory far" on top of proven SRAM and CMOS processes.

The method is this. SRAM bitcells store the weights, and small digital multiply-add logic sits right beside those bitcells (or near the bitlines). Instead of data taking a long trip from memory to the processor, computation finishes after moving only a short distance inside the memory.

standard SRAM digital CIM SRAM

----------------- -----------------

storage only storage + nearby compute

sends data to the processor computes inside the memory

large movement cost small movement cost

The appeal of this method is balance. It is not analog's extreme efficiency, but it greatly reduces data-movement cost while keeping digital precision and reliability. It can be made on existing processes without the risk of new device technology, so from a production standpoint the risk is low. The CIM that reaches the market first is therefore likely to be this form.

14. Alongside Other Ways Around the Memory Wall

CIM is one of many attempts to route around the memory wall. Let us place it alongside its peers that solve the same problem with different physics.

| Approach | Core idea | Targeted gain |

| --- | --- | --- |

| HBM | stack memory beside the chip | bandwidth increase |

| Wafer-scale | grow the chip to confine data | reduce communication/movement |

| Chiplet/CoWoS | package dies close together | shorten distance |

| Photonic | transmit data with light | reduce movement energy |

| In-memory (CIM) | compute inside memory | eliminate movement itself |

The big picture this table shows is clear. In an era where data movement has become more expensive than computation itself, all roads lead to "move data less." HBM and chiplets shorten distance, photonics lowers movement energy, wafer-scale confines data inside the chip, and CIM eliminates movement altogether. CIM sits at the most radical end of this spectrum, because it is the most fundamental answer: "do not move the data; compute right where it sits."

Implications for Developers

Few developers will handle CIM chips directly soon, but understanding the trend matters.

- **Quantization and robustness**: preparing for the CIM era, the ability to build models that work well at low precision and are robust to noise grows ever more important.

- **Workload awareness**: knowing whether your inference is memory-bandwidth bound and whether data movement is a large share of power is the starting point for hardware choice.

- **Heterogeneous thinking**: future systems are likely heterogeneous accelerator combinations. A sense for designing "which operation runs where" becomes an asset.

- **Software-hardware co-design**: CIM is not a hardware-only problem. A collaborative sense, training models to withstand noise and aggressively using quantization so software fills the hardware's limits, grows ever more important.

15. Following a Crossbar MAC Through a Tiny Example

Let us concretize the abstract explanation with a small numeric example. We follow conceptually how a crossbar solves the product of a 2x2 weight matrix and an input vector. (Real circuits handle signs and precision in more complex ways, but we simplify for intuition.)

Say weights are conductances and inputs are voltages.

weight matrix (stored as conductance)

col0 col1

row0 G=2 G=1

row1 G=3 G=0

input vector (applied as voltage)

row0 -> V=4

row1 -> V=5

current at each crossing I = V x G

col0: (4 x 2) + (5 x 3) = 8 + 15 = 23

col1: (4 x 1) + (5 x 0) = 4 + 0 = 4

column currents = result vector [23, 4]

Savor what just happened. Four multiplications and two additions finished without a digital multiplier, in a single operation of just applying voltage. Ohm's law replaced multiplication, Kirchhoff's law replaced addition. The larger the matrix and the more inputs, the more the computation handled within this "single operation" explodes. This is the crossbar's inherent parallelism and the source of its efficiency.

Of course, in reality these clean numbers wobble with noise, negative weights need device pairs, and high precision needs multiple devices. But the core principle is all captured in this tiny example.

16. Frequently Asked Questions

**Q. Does in-memory computing replace the GPU?**

No. The GPU is overwhelming in generality, precision, and ecosystem, so it remains the center of training and varied workloads. CIM is a complement that shines in specific inference, especially at the low-power edge.

**Q. If the analog approach is weak to noise, why research it?**

Because neural network inference tolerates some inaccuracy. It often works fine at low precision (quantization), so designing to withstand noise leaves room to exploit analog's extreme efficiency.

**Q. Are non-volatile devices (ReRAM, PCM) the key?**

They have many advantages. Once weights are etched they persist with power off, fitting inference well. But challenges like variation, drift, and endurance remain, so in the short term SRAM-based digital CIM is more realistic.

**Q. As a developer, what should I prepare now?**

The ability to build models robust at low precision, and the habit of measuring where your workload's bottleneck is (compute or data movement). It is the most practical preparation for the heterogeneous-accelerator era.

17. The Core at a Glance

- In the von Neumann structure, data movement is orders of magnitude more expensive than computation. This is the memory wall.

- CIM fundamentally reduces this cost by computing inside memory without moving data.

- The crossbar array solves matrix multiply in one shot with Ohm's law (multiply) and Kirchhoff's law (add).

- Analog is highly efficient but has noise and ADC costs; digital's strength is precision.

- Rather than replacing the GPU, CIM complements digital accelerators in areas like low-power edge inference.

Closing

In-memory computing challenges one of computing's oldest assumptions: that computation and storage must be separate. Using the physics of memory cells as the compute engine is elegant, and the promise of eliminating data movement is attractive in an era where power is the ceiling.

Following this article, we met one pattern repeatedly. Every core decision, whether solving matrix multiply with a crossbar, placing logic beside SRAM, or etching weights into non-volatile devices, ultimately converges on a single goal: "move data less." The details of the technology are complex, but the direction is remarkably simple.

At the same time, it is not an easy road. The barriers of analog noise, device imperfection, and an immature software ecosystem are real. CIM replacing the GPU will not happen, but it has every chance to settle in as a complement offering efficiency that digital accelerators struggle to reach in specific inference domains. Among the many attempts to route around the memory wall, CIM is the elegant answer that touches the most fundamental root of the problem.

A technology's value often comes not from "is it the flashiest" but from "is it the most fundamental." The question CIM poses, whether computation and storage must be separated, is deep in itself, and whichever direction the answer goes, it broadens how we think about computing.

References

- Computer architecture research search (arXiv): [https://arxiv.org/list/cs.AR/recent](https://arxiv.org/list/cs.AR/recent)

- Emerging technologies / ML hardware research (arXiv): [https://arxiv.org/list/cs.ET/recent](https://arxiv.org/list/cs.ET/recent)

- NVIDIA Blackwell platform: [https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)

- Google Cloud TPU: [https://cloud.google.com/tpu](https://cloud.google.com/tpu)

- Lightmatter (photonic computing): [https://lightmatter.co](https://lightmatter.co)

- SemiAnalysis (semiconductor industry analysis): [https://www.semianalysis.com](https://www.semianalysis.com)