By Ron Wilson, Editor-in-Chief, Altera Corporation
The pursuit of memory bandwidth has become a dominant theme in system design. SoC designers, whether they use ASIC or FPGA technologies, must plan, architect, and implement with memory at the center of their thinking. System designers must, with a clear understanding of memory traffic patterns, provision the ports the chip designers created. Even memory vendors, facing the end of the DDR roadmap, are working to comprehend system behaviors in order to find a new way forward.
The search for bandwidth was a common theme running through the papers at the Hot Chips conference at Stanford University in August, with architectures illustrating many approaches to dealing with the challenge. From those papers, and from the experiences of designers working in the field, a trend appears to be emerging that may show at least the outline of the next stage in the evolution of memory system architecture.
The Memory Wall
The fundamental issue is obvious: with their GigaHertz clock frequencies and multiple cores, modern SoCs can issue more memory requests per second than a single channel of DDR DRAM can service. Stated this way, the problem would appear to have an obvious solution. But there is important fine structure beneath this surface that both complicates the problem and induces a wonderful variety of responses.
For further reading:
For discussion of Altera's Hybrid Memory Cube (HMC) interface, see the Hybrid Memory Cube page.
For more thorough discussion of DRAM controllers, read this system design article: DRAM Controllers for System Designers
For information on Altera's advanced multi-client DRAM controller, see Hard Memory Controllers page.
As SoC developers shifted their focus from faster clocks to more cores, they changed the nature of the memory problem. Instead of one CPU demanding ever more megabytes per second (MBps), now we face many different processors—often many different kinds of processors—all clamoring for access at once. Also, the dominant patterns of memory access have changed. Scientific and commercial data-processing tasks typically involve strong locality of reference, or at worst slowly spooling a very large data set through a relatively compact algorithm. Provisioned with modest local SRAM or cache, a single CPU on such a task could be quite undemanding of main memory.
DRAM chip designers exploited this simplicity in order to deliver greater density and energy efficiency. Consequently, DRAMs deliver their best bit rate when asked for large blocks of data in a predictable order—one that allows bank interleaving. If the SoC deviates from this regular pattern, effective bandwidth of the memory system can drop by an order of magnitude.
New Pattern of Access
Unfortunately, the evolution of SoCs has undermined the assumptions of the DRAM designers. Multi-threading and emerging trends in software design are changing the way individual cores access memory. Multicore processing and the growing importance of hardware acceleration mean that often, many hardware clients will contend for main memory. Both of these trends complicate the simple locality of reference upon which DRAM bandwidth depends.
Multi-threading means, among other things, that when a memory request misses its caches the CPU doesn’t wait: it begins executing a different thread, whose instruction and data areas are likely to be in entirely different physical memory regions than those of the previous thread. Careful multi-way cache design can help with this problem, but in the end, consecutive DRAM requests are still more likely to reference unrelated areas of memory, even if the individual threads carefully optimized their memory organization. Similarly, multiple cores contending for the same DRAM channel will tend to scramble the order of DRAM accesses.
Changes in the software world have also had their effects. Table look-ups and linked-list processing can scatter memory accesses randomly across huge data structures. As packet processing and big-data algorithms have moved these tasks from control code into high-volume data-processing flows, system designers have had to think more specifically about how to handle them efficiently. Virtualization further shuffles memory traffic by locating many virtual machines on the same physical core.
None of these problems is new in kind—they are just growing in scale. So chip and system designers have a number of proven approaches to dealing with both the need for increased raw bandwidth and the degradation of DRAM access efficiency. Among these tools are software optimization, caching, and deploying multiple DRAM channels.
Most embedded-system designers are trained to think first about software optimization. In a single-thread system, software has huge leverage over memory-channel utilization and energy consumption. But in multithreaded, multicore systems, software developers have little influence over the actual sequence of events at the DRAM controller. Beyond some rules of thumb there is little they can do to influence actual run-time traffic patterns. And the DRAM controller itself may be using reordering and fairness algorithms of which the programmers are not aware.
Caching can be far more effective—if the caches are large enough to significantly reduce DRAM traffic. For example, relatively small L1 instruction caches, working with a moderate-sized L2, may completely contain the code hotspots for all the threads in an embedded multicore implementation, substantially reducing instruction-fetch traffic to main memory. Similarly, fitting a relatively small amount of data into L2 or local SRAM may eliminate loads of a filter kernel in a signal-processing application. To have a major impact, a cache doesn’t have to substantially reduce the total number of DRAM requests—it merely has to protect the dominant source of requests from disruption by other tasks, so programmers can optimize the dominant task.
When chip designers can’t be certain about the mix of tasks that will run on their SoC, there is a tendency to provide as much cache as economics allows: L1 caches for all CPU cores and accelerators, large shared L2s, and, increasingly, huge on-die L3s. The Hot Chips conference provided numerous examples of aggressive cache provisioning, ranging from tablet-level application processors to enormous server SoCs.
At the low end, AMD’s Kabini SoC (Figure 1) is an interesting study. The chip, described by AMD Senior Fellow Dan Bouvier, includes four Jaguar CPU cores sharing a 2 megabytes (MB) L2 cache, while the Jaguars each have 32 kilobytes (KB) instruction and data caches—not an unconventional arrangement. More surprising is the chip’s graphics processor, which has its own L1 instruction cache and 128 KB L2 in addition to the usual color cache and Z-buffer for the rendering engine.
Near the other extreme is IBM’s POWER8 microprocessor (Figure 2), described by IBM Chief Nest Architect Jeff Stuecheli. The 650 mm2, 22 nm chip comprises 12 POWER-architecture CPU cores, each with 32 KB of instruction and 64 KB of data cache. Each core also has its own 512 KB of SRAM L2 cache, and the 12 L2s share an enormous 96 MB of embedded-DRAM L3. Stuecheli said the three levels of coherent cache could sustain 230 gigabytes per second (GBps) of aggregate memory bandwidth. Interestingly, the chip also contains a small transactional memory.
Somewhere between these two SoCs lies the multi-die module for Microsoft’s XBOX One (Figure 3), described at the conference by Microsoft’s John Snell. The module contains an SoC die with formidable memory resources. The SoC carries eight AMD Jaguar cores, grouped into two clusters of four. Each core has 32 KB each of L1 instruction and data cache. Each cluster of four CPU cores shares a 2 MB L2. In addition, there are four 8 MB shared SRAMs on the die, by themselves providing a minimum of 109 GBps of bandwidth to the CPUs.
Getting to DRAM
But there is another message to be read from the XBOX One SoC. No matter how much on-die cache you have, there is no substitute for massive DRAM bandwidth. The SoC die includes a four-channel DDR3 DRAM controller, providing 68 GBps peak bandwidth to the 8 GB of DRAM mounted in the module.
The idea of multiple DRAM channels is not limited to game systems. Packet-processing SoCs began providing multiple, fully-independent DRAM controllers several years ago. But the strategy has challenges. It imposes even more complexity on memory optimization, as system designers must decide what data structures to map to which channel or controller. It also, of course, creates the possibility of giving a particularly demanding task its own private DRAM controller, which in some embedded applications can be valuable. But multiple DRAM channels also can quickly eat through the available pin count and I/O power budget.
Pin count can become an issue even in FPGA designs, where the designer has great flexibility to reorganize logic and select a larger package. Altera’s Advanced System Development Kit (Figure 4), a board intended for prototyping and delivering bandwidth-intensive designs in areas such as HD video processing, layer-7 packet inspection, or scientific computing, is a useful case study.
Mark Hoopes, an Altera specialist in broadcast applications, explained that the board had to be provisioned for huge memory bandwidth without detailed knowledge of the specific designs that users might implement in the two large FPGAs. So to plan the board, Hoopes examined the memory use patterns of existing Altera® video intellectual property (IP), and interviewed external design teams about their needs.
The results were sobering. “When you look at individual functions, memory requirements look reasonable,” Hoopes says. “But when you combine functions, requirements climb through the roof.” In one case, an application developer requested both a full 256 bit-wide DDR3 interface and four channels of QDR II SRAM on each FPGA. Even with 1932 pin packages, that was insupportable. So the designers ended up with the four SRAM banks and a 192 bit DDR3 interface.
Hoopes makes an interesting point about the value of multiple memory controllers to an SoC. He says that often IP developers are very skilled at memory optimization at the subsystem level, and may even provide their own optimized DRAM controller. It can make sense to dedicate a DRAM channel to a subsystem just to keep other IP blocks from ruining the subsystem designers’ optimizations.
Into the Future
There is another interesting block on the development board: each FPGA attaches a MoSys Bandwidth Engine. This chip contains 72 MB of DRAM, organized into 256 banks to emulate SRAM timing, and tuned for poor-locality uses such as table storage. Uniquely, the chip uses a high-speed serial interface instead of the usual DDR or QDR parallel interface. “The interface was one reason we included these parts,” Hoopes remarks. “We had unused transceivers available on the FPGAs.” But as it turns out, the MoSys parts are a harbinger.
Three ideas—each of which we have seen already—are converging to define the next step in memory architecture. These ideas are large embedded memory arrays, high-speed serial interfaces using fault-tolerant protocols, and transactional memory.
The first two ideas are nicely illustrated in both the MoSys chip and the IBM POWER8 architecture. The CPU SoC communicates to DRAM through a second chip: the Centaur Memory Buffer. One POWER8 can attach up to eight Centaurs, each through a dedicated 9.6 gigabits per second (Gbps) serial channel. Each Centaur contains 16 MB of memory—used as both cache and scheduling buffer—and four DDR4 DRAM interfaces, along with a very smart controller. IBM puts the Centaur chips on the DRAM DIMM, eliminating up to eight DDR4 connector crossings from the system. Thus the design clusters significant memory and intelligence at the end of a fast serial link protected by a retry protocol.
Another Hot Chips example comes from MoSys, which described its next-generation Bandwidth Engine 2 at the conference. Following the pattern, the Bandwidth Engine 2 connects to the processing subsystem through up to 16 lanes of serial I/O running at up to 15 Gbps each. And the chip contains four partitions of memory, each comprising 64 banks of 32K 72 bit words: as in the first generation, 72 MB altogether. The dynamic nature of the individual bit cells is hidden by the many banks, by an intelligent, reordering controller, and by a substantial on-chip SRAM cache.
Going a step beyond announced features of the Centaur chip, the Bandwidth Engine 2 also provides transactional functions on the die. Various versions of the chip offer an onboard arithmetic-logic unit, so statistics collection, metering, and atomic arithmetic and indexing operations can be performed in the memory, without actually moving data over the external serial links. The internal arithmetic logic unit (ALU) has perhaps obvious uses for semaphores and linked lists. The additional hardware has, however, made the chips somewhat application-specific. MoSys vice president of technology Michael Miller described four different versions of the Bandwidth Engine 2 with different feature sets.
The next chapter may be written not by CPU architects, but by a cost-conscious commodity DRAM vendor. Micron Technology has solidified the specification, built prototypes, and begun announcing interface partners for its Hybrid Memory Cube (HMC). The HMC is a set of DRAM dice stacked onto a logic die, with a set of high-speed serial lanes reaching out to the rest of the system. Micron has not discussed the functions of the logic die publicly, but presumably they will include DRAM control and buffering to emulate SRAM functionality, plus perhaps application-specific transactional functions.
The idea of logic embedded in the memory subsystem can have interesting implications. Local DRAM controllers with access to lots of logic gates and huge amounts of cache can virtualize away essentially all of the DRAM chip characteristics that degrade memory bandwidth. IBM’s zEC12 mainframe architecture, also described at Hot Chips, applies the RAID 5 protocol from the disk drive world to the DRAM DIMMs it controls, in effect turning banks of DRAM into a multibanked, parallel redundant memory system. The same principle could be used to integrate large banks of NAND flash memory into the memory system, providing a RAID-managed hierarchy of storage that functioned as a virtual gigantic SRAM.
SoCs are unquestionably getting more demanding. In response, serial links, local memory, and especially local intelligence will entirely change the way we think about memory architecture.