By Ron Wilson, Editor-in-Chief, Altera Corporation
In 2003, amidst the recessionary hangover from the dot-com crash, Altera began its third decade. It was a year of endings: the tragic loss of the space shuttle Columbia, the last contact with the spacecraft Pioneer 10, the last VW Beetle from the assembly line. And it was a year of beginnings: the Iraq War, the start of the great bull market in U.S. stocks, the first trans-sonic flight of the privately-developed SpaceShipOne, the first manned spaceflight by China.
In the world of system design, the elements had quietly aligned for the next stage in the evolution of FPGAs. Use of the most aggressive CMOS technology had given FPGAs the logic density and speed to implement a CPU core and its peripherals in a single chip. Altera had released Nios, a RISC CPU core optimized for FPGAs, and partners had developed FPGA implementations of other popular CPU cores as well. Avalon, a multi-master bus architecture tuned for use in programmable logic, normalized interconnects between CPUs and subsystems on the chip. And a tool to add automation to the process of assembling intellectual property (IP) into an FPGA-based SoC, SoPC Builder, reached the market.
For further reading:
Read the following articles from Altera's 30th Anniversary series:
This groundwork enabled an entirely new way of thinking about programmable logic. Designers would continue to build glue logic in CPLDs. Seekers of high performance would continue to implement faster, ever more powerful accelerators and subsystems in packet-switching, signal-processing, and related applications. But in addition, Altera’s third decade would be the dawn of the FPGA as system on chip.
The CPU-Centric Phase
Early in the decade, SoCs tended to follow a simple pattern, based on the board-level computers they were replacing. An SoC typically comprised a single CPU core, a local cache or tightly-coupled SRAM, a DRAM controller, an on-chip version of a microprocessor bus, and whatever peripheral controllers the application required (Figure 1). Applications might include in this picture a DMA controller or an application accelerator for some frequent but taxing task, such as data movement, cryptographic computations, or Fast Fourier Transforms (FFTs).
Implementing the SoC in an FPGA offered some valuable benefits. Designer could select just the hardware blocks they needed in the CPU core. Numerical accelerators could use the fast digital signal processing (DSP) blocks in Altera FPGAs to achieve arithmetic performance well beyond what the combination of a microprocessor and a DSP chip could reach. And a designer could implement custom accelerators using the programmable logic, DSP blocks, and RAM blocks embedded in the FPGA fabric. These accelerators could be designed either as units on the microprocessor bus or as independent flow-through processors, creating a data plane separate from the microprocessor’s control plane.
An important advantage of this increased integration, according to Altera product planning manager Bernhard Friebe, was energy efficiency. Hard functions such as RAM and DSP blocks in the FPGA were at least as energy-efficient as an equivalent ASIC or off-the-shelf function. Functions implemented in the programmable logic would generally—but not always—consume more power than their standard-product equivalents. But during this period I/O dominated energy consumption in many systems. And moving data through the FPGA fabric was not only vastly faster, but far more efficient than moving it across chip boundaries. By confining high-bandwidth data transfers inside the FPGA, system designers could often achieve very substantial net energy savings at the system level.
With the hardware and IP to support CPU-centric SoCs already in place, Altera focused on the tool flow. It was quickly apparent that the tool needs of SoC developers were different from those of traditional logic designers. Traditionally, designers of interfaces or datapath components would express every detail of their design in VHDL or Verilog, and then follow each element through the steps of logic verification, mapping to the FPGA resources, and timing closure.
But SoC designers focused at a more abstract level. Was the hardware fast enough and the on-chip RAM large enough? Were the bus and memory bandwidths adequate? Did the bus interfaces interoperate? With heavy IP reuse, the focus of design effort shifted from the overall SoC logic to writing software, and to creating one or two new blocks to drop into a design assembled from existing IP. In other words, SoC developers were thinking like system designers, not like chip designers.
One result of this shift in emphasis was Incremental Compilation, first introduced by Altera in 2005. Often, design effort would focus on one or two blocks in an SoC, while the majority of the hardware work remained unchanged. Altera’s Incremental Compilation feature allowed designers to rework one portion of a design, subject to fixed location and pin constraints, without having to run the entire design back through the tool chain. It not only saved compilation time, but it removed the risk of disturbing the portion of the hardware that was already working.
SoC designs also introduced a shift in the use of FPGA I/O pins. As bus bridges or accelerators, FPGA s tended to have data flowing through the chip in bursts or streams, usually from one standard bus into another. Typically there would be only a few clock domains, mostly defined by the busses.
CPU-centric SoCs presented new requirements. There would often be a standard external bus, such as PCI or USB. But now the FPGA would be the originator of the bus, not simply a client on the bus. There would also, almost certainly, be a DRAM port, drawing FPGAs onto the challenging trajectory of DDR SDRAM interface technology. And there would likely be a number of serial or parallel connections between on-chip peripheral controllers and their external devices. This diversity could mean more pins, more signaling and voltage variety in the I/O, and more clock domains. These changes were reflected in increasing complexity of FPGA I/O cells and clock networks.
Core and Multicore
The treadmill of semiconductor process improvements continued to run, grinding out ever higher transistor densities. But during Altera’s third decade, the mill became less and less able to produce higher circuit speeds. Accordingly, CPU manufacturers refocused: from ever-higher core clock frequencies to two--and then four, and then more—CPU cores on one die: multicore architectures. SoC designers quickly followed, both in ASIC designs and in FPGAs.
Multicore thinking had two significant threads in FPGA use. One thread simply replicated CPU cores. By this time it was relatively easy to compile several processor cores into one FPGA. It was less simple, though, to figure out how to connect them. Here, programmable logic offered an embarrassment of riches, as architects could implement virtually anything from arrays of tightly-coupled cores to shared L2 cache architectures to independent CPUs on the multimaster Avalon® bus.
The second line of multicore thinking led down a different path: heterogeneous systems. The same bus, IP, and tools that made multiple instances of one CPU core feasible made combinations of a CPU core and multiple, peer-level accelerators just as possible (Figure 2). And this, in turn, led to an entirely different way of thinking about multicore design: a software-centric approach.
Planning a homogeneous multicore system can be—to vastly oversimplify—pretty straightforward. Figure out how many times faster than single-CPU speed you need to go. Put in that many more CPUs, and maybe an extra or two to account for inefficiencies. Choose an interconnect architecture based on the level of memory sharing between threads that you expect. Divide up your software threads among the CPUs, simulate the system, and repeat until it works within specs. This process remains firmly hardware-centric, selecting an architecture, implementing it, and then dividing up the code to fit the hardware.
But the ability to create your own accelerators opens up a whole new methodology. It goes like this. Profile your code to find the hot spots. For the nastiest code segments, create custom accelerators that will save both CPU cycles and energy. Simulate the system, and return to the profiling step and repeat, until the performance requirements are met. This approach starts with working software on one CPU core, and generates a constellation of hardware accelerators customized to the actual system software. For the first time, the system becomes a reflection of the software requirements, rather than a Procrustean bed into which the software will be condemned.
In 2006, Altera introduced two innovations that supported this heterogeneous multicore design style. One was a compiler that would transform a block of executable ANSI C code into an accelerator optimized to work with a Nios® CPU core in an Altera FPGA. This C-to-Hardware Acceleration (C2H) compiler tool automated one of the most time-consuming and error-prone steps in software-centric design: generation of the accelerators.
The second innovation was less obvious. If you compare the power consumption of a fast single-core processor to that of an equivalent cluster of slower-clocked processors, dynamic power should go down sharply because of the efficiencies of the accelerators. But leakage—a growing problem throughout the decade—increases with the total number of transistors, regardless of circuit activity. So leakage currents could take away much of the energy efficiency that multicore design provided.
Altera responded to this problem with a second innovation: Programmable Power. This combination of hardware and software-tool features selects slower, low-leakage circuits for non-critical timing paths, minimizing leakage current in the FPGA while delivering timing closure. The result could be recapturing the big energy gains that heterogeneous multicore design had to offer, despite the higher leakage of deep-submicron processes.
Consensus and Hardening
A final phase marked the closing years of Altera’s third decade: the growth of consensus on IP selection. Gradually, the system design community is tightening its focus on specific solutions to some of its most pressing problems. In particular, C has become nearly ubiquitous among embedded-system developers, ARM® cores are coming to dominate embedded computing, and a relative few interface standards are coming to dominate specific uses, such as high-speed system busses, backplane connections, and inter-chip connections. That focus is allowing Altera to innovate in its support of these solutions.
One example is in the way programmers express parallelizeable chunks of code. C, while it is sufficient to define a sequential procedure to implement a task, cannot express the opportunities for parallelism that a skilled programmer can find. But a C-derivative called OpenCL™ can. In 2011, Altera introduced a set of tools that allowed programmers to write parallel algorithms in the increasingly popular OpenCL, and translate them—without specialized knowledge of FPGA design—into parallel hardware in the FPGA and control code on a conventional CPU.
The growing consensus around the use of ARM Cortex™-A-class CPU cores in multicore SoCs enabled a second innovation. As long as every design team wanted a different CPU, FPGA vendors had to meet these needs with soft cores implemented in the programmable logic. But that flexibility had its costs: logic-element consumption, power consumption, and lower speed.
Altera responded to a specific trend: the use of the Cortex-A9 in a growing number of embedded and wireless applications. In 2012, the company began sampling an FPGA with an on-die Hard Processor Subsystem: a dual-core Cortex-A9 cluster with its own caches, local RAM, optimized memory controller, and selected peripheral controllers, all in ASIC-style cell-based hardware. The chip architects took particular care to optimize the interconnect between the subsystem and the programmable logic fabric for implementing heterogeneous multicore systems.
This increasing convergence between multicore processor systems and FPGAs led to one more major innovation. In 2013, Altera announced that its next generation of high-end FPGAs would be fabricated not by a traditional foundry partner, but by Intel Corp., using a 14 nm Tri-Gate process whose heritage was Intel’s own CPUs and SoCs. This shift from the ASIC-oriented foundry market to the foundry arm of a CPU specialist in effect put Altera’s FPGAs on a separate power-performance trajectory, optimizing the semiconductor process characteristics that are vital to processing elements, local RAM, and high-speed interconnects, rather than optimizing across the much wider space that a broad-market ASIC foundry must serve.
Altera believes that the result of this choice will be a discontinuity in the performance and energy consumption patterns that have dominated the FPGA industry for years. It is a promising way to start a new decade.