By Ron Wilson, Editor-in-Chief, Altera Corporation
As Altera enters its fourth decade, there seems to be little agreement among experts about the future of programmable logic. Some argue that spiraling SoC development cost and shrinking product cycles are making the two obvious alternatives to FPGAs - ASICs and ASSPs - untenable, opening the door to a much wider market for programmable devices.
Experience suggests that it is pointless to guess at the future - especially if there is a risk of someone remembering what you said. And it is equally futile to ask vendor product planning teams to divulge their secret roadmaps. So for this final celebratory article we have taken a different approach. We have asked two luminaries from academia - Jason Cong, Chancellor’s Professor and director of the VLSI Architecture, Synthesis, and Technology Laboratory at the University of California, Los Angeles, and Jonathan Rose, successful FPGA-industry entrepreneur and professor of electrical and computer engineering at the University of Toronto - to join us in a thought experiment.
For further reading:
Read the following articles from Altera's 30th Anniversary series:
Many engineers have noticed a pattern in the evolution of PLDs. Because the devices are so flexible, they often pioneer the implementation of new applications. If an application becomes a dominant source of revenue, the devices begin to sprout new dedicated hardware blocks for that application, but with the potential for use in other areas as well.
For example, during the massive Internet build-out of the late 1990s dot-com bubble, network-equipment vendors consumed huge numbers of FPGAs for packet processing. Soon after, in 2001, high-speed serial transceivers - ideal for the internal data connections in switches and routers - appeared on FPGAs. A bit later, rapid growth in signal-processing applications - such as radar and early software - defined radio - led to inclusion of DSP-specific multiply-accumulators, woven into the logic fabric of FPGAs. And early use of FPGAs as SoCs led to first experiments with embedding hard CPU cores in the chips.
So we posed to our luminaries this question: what new applications will arise to drive FPGA sales? And how will these applications influence the architecture of the chips?
The Evolving Smart Phone
Rose at once suggested that one of the most likely emerging market for FPGAs might be one that already exists: mobile communications. “The future may be like the past,” he says. “It’s about enabling the cellular network.”
Today, FPGAs are used primarily in two points in the cellular network, both inside base stations: to provide connectivity and signal-processing for remote radio heads at the top of the towers, and to implement the mobile backhaul network adapters that connect the base station to the provider’s wide-area network. Both of these are likely to grow and become more intense as 4G base stations proliferate and the industry moves toward a more software-defined 5G future. But Rose also sees opportunities beyond these existing applications, in a place where most experts aren’t looking.
Rather surprisingly, he points not to the base station, but to the smart-phone handset. Small, low-power PLDs have been used in smart phones already - for example, in the Samsung Galaxy S4, Rose offered. But he has something different in mind.
“Think of the smart phone as a hub on the Internet of Things,” Rose suggests: “as a data collector that you can program to do virtually anything.” This perspective opens all sorts of possibilities for implementing control networks over short-range wireless links, providing local processor accelerations, and so forth. At least some of these new local-processing tasks would benefit from the energy savings of FPGA acceleration.
One of the ideas that particularly interests Rose is vision processing, as shown in Figure 1. Using images from the handset’s cameras, its libraries, the cloud, and nearby surveillance cameras, a smart phone could construct a virtual, enhanced image of the user’s surroundings containing far more information, organized in very different ways, from what the unaided eye would report: in other words, enhanced perception. The phone could identify and label objects and people, zoom on details, remove extraneous clutter, even get feedback on its actions by watching the user’s face. Adding images from the cloud and from near-by surveillance cameras, the handset could look around corners and through walls, and function as a virtual microscope or telescope.
All of this takes a lot of processing power - memory, multipliers, and bandwidth - and the ability to move quickly between algorithms. That’s where the FPGAs come in. Their adaptability to problems in visual computing - essentially, Rose observes, the inverse of graphics processing - is well established, as is their greater energy efficiency relative to GPUs.
There are obvious objections, the first being cost. Rose suggests that if shrinking process geometries and the potential for very high volumes don’t solve this problem, the answer may lie in architectural changes, or in an embeddable programmable-logic intellectual property (IP) core. And that raises the second half of our question: what impact might this application have on FPGA architectures?
“The architectural implications are not obvious,” Rose admits. For one thing, there are many successful approaches to implementing vision-processing algorithms, including coherent clusters of accelerators, deep pipelines, and tessellated arrays of processing elements. All implement well in FPGA fabric, but each suggests different optimizations. It is possible, though, to make some useful generalizations.
“These applications need very great bandwidth,” Rose states. That means not just high I/O bandwidth to get image data into and out of the processing array. It also implies an on-chip memory hierarchy optimized to the movement of data through the system. And it brings up a less obvious issue: it is getting harder and harder - in all kinds of chips, not just FPGAs - to move enough data around fast enough on the chip. “When you have these high-bandwidth data flows, networks-on-chip (NoCs) become very compelling,” Rose observes.
There is another issue Rose emphasizes. “Processing power may not be the main concern. In many potential FPGA applications the greatest challenge designers face is how to express the system design in a way that is both natural for the application and implementable in a design flow” register transfer level (RTL), after all, is not widely used outside IC design circles. And it describes a particular style of logic design, not the algorithms being implemented. C, as an historical programming language originally intended for a particular brand of minicomputer, is similarly less than ideal. The problem remains unsolved except in some specific application domains.
At the opposite end of the Internet from Rose’s vision-processing smart phone lies the unearthly world of data centers (Figure 2).
In this land of darkened warehouses, discussions of storage capacity use Greek prefixes you have to surreptitiously look up, and the purpose of air conditioning is to keep copper from melting. This is the natural habitat of Jason Cong’s research into FPGAs as general-purpose computing devices.
In this world, Cong explains, FPGAs are not independent devices: they work intimately with general-purpose CPUs. Cong describes an accelerator-rich environment in which rarely-executed code - generally, code that controls program flow or handles exceptions - executes on small multicore clusters of CPUs. Intensively executed code - the inner loops that handle the data - executes on optimized accelerators, sharply reducing energy consumption and, if the task is not constrained by storage bandwidth, boosting performance. “Everything is eventually about energy,” Cong emphasizes.
The accelerators could be either designed ahead of time for well-defined tasks like search ranking or Fourier transform calculation, or they could be synthesized from the code to be accelerated: probably, for a number of reasons, at compile time or job-initialization time rather than run time. Once created, the accelerators can be statically assigned to particular FPGAs, or they can be created and released dynamically. In either case they are sharable resources.
In many ways the dynamic systems are the most interesting. Software executes on a virtual machine in which hardware accelerators are transparent. The accelerators come and go beneath the surface, changing only execution time and energy consumption. A hierarchy of resource managers, from automatic map-reduce administration to operating-system virtualization services to hardware resource managers, labor to continuously tune the system to the task load.
The implications for the FPGA are significant, Cong says, at several different levels. At the top level sits a global acceleration manager that virtualizes the FPGA resources, composing and tracking the accelerators, memory resources, and interconnect resources at run time, Cong explains. This manager could be dedicated hardware or an unchanging region of the FPGA.
Beneath that level there are further implications for the chip, Cong continues. One, obviously, is the need for rapid partial reconfiguration, including both generation of accelerators and reconfiguration of the memory hierarchy. The need for flexibility in memory structures seems particularly demanding. At various times the same physical memory may need to be shared and coherent or private; and organized as a cache, local scratchpad, or a streaming buffer, for example.
This flexibility is within the range of today’s FPGAs, but the need for on-the-fly reconfiguration present challenges. For example, it would not be feasible to place, route, and close timing on an accelerator at run time with today’s technology, so, absent a revolutionary change in design flows, some sort of pre-timed blocks would be necessary. But how to interconnect the blocks?
“A coarse-grained NoC might be a solution, “Cong offers. “Or you could use a crossbar switch. Some of our results suggest that a very sparsely-populated crossbar is sufficient.” Either approach would be a significant departure from the wire-segment-level routing in today’s FPGAs.
Another significant issue for the FPGA hardware is the connection between the CPU cores and the accelerators. Clearly this path demands a lot - low latency, high bandwidth, usually memory coherency, error protection, support of privileged CPU modes such as ARM’s TrustZone, and energy efficiency.
The obvious solution, perhaps, is to combine the multicore CPU cluster and the FPGA fabric on one die, as in Altera’s SoC products. But Cong points out that you can only get a modest number of relatively small CPU cores onto an FPGA die before the area becomes impractically large. Alternatively, as costs come down and reliability increases, the CPUs and fabric could be joined in a 2.5D or, eventually, 3D module. But practical considerations sometimes limit the scope of integrated solutions. In these cases it is vital to support an inter-chip connection, such as Intel’s QuickPath Interconnect (QPI), that can meet the requirements. Limitations in this channel will not only limit the potential performance of the accelerators, but will limit the granularity of tasks that can be accelerated successfully.
At the finest level of detail, the needs of the accelerators may influence the FPGA logic-cell design. In many applications there is a need for floating-point arithmetic, which may influence either logic cell structure or the nature of embedded multiply-accumulate blocks. But needs can grow in other directions as well. Cong points, for example, to the growing body of research on artificial neural networks, a technology whose application domain is still largely unmapped. “Artificial neural networks may benefit from changes in the design of the logic elements,” Cong suggests.
Finally, as with smart phones, there is the question of design flow. In the high-performance computing world, there is little interest in RTL, C-derived hardware descriptions, synthesis, or timing closure. The experts in computing are algorithm and software experts, sometimes working with legacy code that may date back to Fortran, and sometimes writing in modern, loosely functional languages such as Python.
As a group, programmers of high-performance computers seem skeptical about heterogeneous architectures. In a panel on the future of large computers at SuperComputing 2013, there was reluctant recognition that heterogeneous multiprocessing might be a necessary future. Yutong Lu of China’s National University of Defense Technology argued that the community would need a new programming model that could both exploit and hide the underlying system architecture. Pete Beckman of Argonne National Laboratory went further. “No one wants to program heterogeneous computers,” he said.
The solution might lie, some software experts suggest, in functional languages. These languages can in principle keep the program logic at high, machine-independent levels, while isolating accelerator descriptions and hardware-dependent code deep in the library hierarchy. But Cong suggests a different way of isolating machine dependencies in the software hierarchy.
“I am not sure if functional language is the future,” he says. “We actually use a coordination language called Concurrent Collection (CnC) in our research, where each computational step is described in a traditional language like C or C++, but the dependency and interaction of these steps are described separately in CnC. Different steps might be executed on different computing engines.”
Today’s commercial OpenCL™ flows take a similar approach, separating control flow from computational kernels - both programmed in a C dialect - and then allowing kernels to be compiled or synthesized for accelerators. While all these programming models have their similarities, none is at all similar to the traditional FPGA design flow.
Imagining the Future
Rose and Cong have very different research interests, but still it seems possible to find some shared points in their visions. First, the role of FPGAs in the future will transcend the past. FPGAs will no longer be used as logic blocks, or subsystems, or even SoCs, FPGAs in these views are becoming computing systems, parts of heterogeneous multiprocessing clusters. And they will rely increasingly on dynamic reconfigurability.
Although there may be some optimization for specific data types, that role may not require major changes to the fundamental logic fabric, nor are there any obvious next big blocks to be moved into fixed hardware. But reconfigurability does seem to argue in favor of some sort of global interconnect structure, be it NoC or crossbar, hard-wired or created from the fabric.
The other great issue appears to be the design flow. The potential new users of FPGAs are not inclined to learn traditional logic design, even with a higher-level entry point. They need to express their designs in languages natural to themselves, and generate heterogeneous systems based on that code, without the benefit of specialized hardware designers.
Here, then, are two visions of the future, each of which implies a manifest of changes. Changes, of course, require investment and risk. Are the changes worth the gain?
Each vendor in the FPGA world will answer that question on a case-by-case, return-on-investment basis. But there is a larger question as well, from society’s point of view. From Altera’s vantage point, on the brink of its fourth decade, the larger question is worth asking, too.
“There is a quotation from Martin Luther King posted on the door of my office,” Rose says. “‘… the arc of the moral universe is long, but it bends toward justice.’
“I believe the Internet has contributed to that bend toward justice. And that could not have happened without FPGAs.”
Enhanced vision, unimaginable computing tasks: we undertake our contributions without knowing the full story of where they will lead. But the tale of our first thirty years gives reason enough for another ten. Please wish us well.