Concerns about energy and power efficiency are becoming as important as performance in markets where traditionally there has been a significant gap, setting the stage for significant shifts in both chip architectures and in how those ICs are designed in the first place.
This shift can be seen in a growing number of applications and vertical segments. It includes mobile devices, where batteries are expected to last longer between charges, and in data centers, where the cost of powering and cooling server racks continues to rise. Unlike in the past, though, there is no single best way to achieve these goals, and approaches that were considered too expensive or difficult to implement in the past are now being re-examined. Put simply, all options are on the table, and new ones are being developed.
“In the mobile space, there’s no end in sight to the performance needed for the next generation,” said Steven Woo, fellow and distinguished inventor at Rambus. “The low power space is about maintaining the signal integrity, getting to the performance levels needed, all while improving the power efficiency. We’re seeing a lot of concepts, which formerly were used in one type of memory or one space, now are being used across lots of different types of markets. Utilizing multiple channels is a good example. While low power engineers have been coming at it primarily from the standpoint of trying to save power and staying within a certain power envelope, they are questioning what concepts can be borrowed for other types of memories.”
For example, error correction is becoming more sophisticated, and there’s a greater need for that type of technique. “All up and down through links, and the memory devices themselves as process geometries are getting smaller,” Woo said. “That means interference types of effects are challenging the reliability of the devices. So there’s more going on inside memory devices, and links to try and compensate for the more challenging reliability requirements.”
In the case of a laptop that contains a CPU and a GPU, as power is consumed, the temperature rises, and once it goes beyond a certain threshold the performance is throttled back. “Power, thermal, and performance are very tightly tied. If designs are not taking care of their power efficiency, their GPUs or the CPUs will have to run slower,” said Preeti Gupta, director of product management at Ansys.
This has a direct impact on architecture decisions. “Architecturally, there are various approaches that design teams have taken in the past, whether it is power gating early on, deciding how many different supplies are needed, when the power supplies can be shut off, dynamic frequency scaling, and others. These techniques are well-known, and design teams are increasingly applying them,” she said.
The impact of power is everywhere in a design. “We view power as a number,” said Rob Aitken, fellow and director of technology at Arm Research Group. “We look at it as how many watts a core, chip, or package consumes. The days of creating technology to be as fast as possible no matter the power consumption are long gone. However, this means power is often relegated to secondary status as a constraint on design, rather than something to be optimized. But to get to the highest levels of performance, we have to think about power delivery as the flip side of performance. Physically, computation requires charging and discharging capacitances, and doing that faster requires a source for that charge and the ability to deliver it where it’s needed and when it’s needed. That, in turn, requires understanding the dynamic behavior of the power delivery network, its inductive properties, its capacitive stores and their distance to logic, the frequency properties of the waves traveling in the network, and more. A circuit’s power network is like a manufacturer’s supply chain. If charge delivery is delayed, everything slows down. These physical limitations are why innovative techniques like backside power delivery are gaining traction, and we’re looking for ways to maximize power delivery and logic performance simultaneously.”
Energy efficiency in the data center
Inside the data center, power has long been a concern because of the cost of powering and cooling racks of servers. Increasingly, the focus is shifting to energy efficiency, as well. Large systems companies such as Google, Amazon and Facebook have been designing their own chips for some time to improve efficiency, tightly integrating software and adding custom accelerators to make sure that nearly every compute cycle is optimized for a specific workload.
“Broken down, it’s power versus performance, or megahertz per milliwatts,” said Piyush Sancheti, senior director of system architecture at Synopsys. “That’s the constraint they’re shooting for. It’s no longer a case where you just go for performance and worry about power at the tail end of the process. It is a built-in constraint. It starts from architecture, and goes all the way down into design, implementation, and sign-off.”
Fig. 1: Low-power design flow. Source: Synopsys
Sancheti points to several reasons why this is changing. “Data centers are one of the biggest consumers of energy in the electronics ecosystem, so there are constraints at the macro level. They’re trying to keep their overall energy consumption below a certain limit. While there are green initiatives in play, at the chip level it’s essentially a constraint for scaling. For them, it’s not just about performance, but also how big those chips can get. There, energy efficiency plays a role in the ability to cool the chip, so there are heating, cooling, and package considerations. Also, power integrity is driven by the power consumption, but all of these issues have a foot in the performance, as well.”
Thermal is another consideration. The faster the chip runs, the more energy it will consume and the hotter it will get. At advanced nodes, because there are more transistors packed into a given area than previous nodes, that creates a cooling challenge.
In the past, this was dealt with by adding margin into a design. But guard-banding increases power consumption and reduces performance because signals need to travel farther, and it takes more energy to drive those signals over longer distances and through increasingly skinny wires.
“Those guard-bands are no longer a viable option since we are running at voltages approaching threshold voltages, so we can’t guard-band for power while optimizing for performance,” Sancheti noted. “For those reasons, we’re beginning to see just about every single HPC data center user look at power and performance as cohorts, not optimized for one versus the other.”
While demand for performance continues to rise, the need for energy and power efficiency is an equally important consideration.
“Data centers are crunching data like never before, automobiles are equipped with multiple levels of electronics, and the Internet of Things is making homes and appliances even smarter,” said Mohammed Fahad, principal technical marketing engineer at Siemens EDA. “All of this has made power a prominent, first-class metric over the past few years. A decade ago, some power methodologies did not even exist in CAD flows at various customers. Over the past 10 years we have witnessed a growing need for robust power methodologies that are seamlessly integrated and effortlessly interoperable with rest of the tool-chain, to not only help to estimate power but also to optimize it.”
All of these changes are having an impact on memory architectures, as well. Scott Durrant, product marketing manager at Synopsys, noted that for data centers, the energy consumed and the heat that systems generate are reaching the limits of what can be dissipated with air-cooled systems. “As a result, data center managers are looking for solutions that enable them to reduce the amount of energy they consume, and therefore the amount of heat they must remove from the building.”
Reduced data movement can help improve efficiency, whether that is within a chip, a package, between servers in a rack, or between servers and storage.
“Minimizing the amount of data movement that has to take place is becoming very important to them,” Durrant said. “One of the ways they’re addressing that is to have localized or distributed compute capabilities. Data processing units are an example of one type of chip that is being used to implement compute inside the network, or inside the storage, so that you don’t have to move everything to the compute server to process the data there, and then move it back. This is being implemented both at the macro scale and the micro scale, and we’re seeing this capability implemented as application-specific or optimized devices that operate on data where it resides, or as it’s passing through, rather than having to move it specifically for these types of operations.”
The use of advanced packaging, being driven by the need to minimize the amount of data movement that is required, adds another option. “If you can pack a bunch of processing capability inside a single advanced package, then you are moving data millimeters instead of meters. That makes a big difference in the amount of energy that’s required for that data movement,” he explained.
With any of these techniques, the ultimate goal is to continue to improve data processing performance while keeping energy consumption at its current level.
Power vs. energy
Again, to design efficient systems, designers are now paying attention to both power and energy. Power is the total energy consumed to perform a specific task, but the two definitions often get confused.
“When somebody mentions power, I always ask if they mean power or energy,” said George Wall, product marketing director, IP Group at Cadence. “Energy is power x time. Energy is usually what the customer gets told by the battery or the power meter. Energy is a combination of both the power and the performance of the system. So both power and performance impact the energy. A low-power, low-performance solution actually might be less energy-efficient than a higher-performance yet higher-power solution. Let’s say you can perform a task and the power is 1 milliwatt. If the next solution requires 20% more power, so 1.2 milliwatts, but it can do the task twice as fast, your energy is going to be better with the higher performance, higher power solution in that case.”
This becomes a complex balancing act in different applications. “The energy requirements vary with the duration of operation of a particular application, such as the energy required to run a ‘need-for-speed’ game for one minute on a smartphone could be same as making a voice call for an hour,” said Siemens EDA’s Fahad. “This means scenario-specific power and energy estimations are needed. This helps the user with an insight to design or re-design the architecture/algorithm so that the same operation can be managed within the allowed power/energy budgets.”
It becomes more complicated in more advanced designs. For example, a significant aspect in overall efficiency is glitch power, which is causing new anxieties for chip designers. “Sometimes the percentage of glitch power versus the total power could be as high as 50%,” Fahad said. “This is nothing but the sheer misuse of the energy available to operate the system. Here, power estimation tools can help, as they generate various metrics pertaining to power and energy consumption of the design. Leaving the power unaddressed until the last minute may lead to downstream issues like thermal self-heating, battery power overrun, packaging issues, etc.”
While the fundamental issues remain the same, the number of options is growing. “If you can do the same task in less time than your previous solution, even with a slightly higher energy level, the overall dynamic energy consumption is going to be lower,” said Prakash Madhvapathy, product marketing director for Cadence’s IP Group. “Higher-performing DSPs complete a task in a fraction of the cycles compared to the lower-performing DSPs. When they do that, the total dynamic energy consumption actually goes down quite a bit.”
Determining the best approach is heavily reliant on the workload. In the audio and vision world, for example, different workloads are now on the table, including traditional DSP and signal processing algorithms, along with AI and ML that are taking center stage.
“Most of the algorithms are now moving away from pure, traditional DSP type toward more AI and ML,” Madhvapathy explained. “The data types used there are different. They’re not the same as what you would find in previously used DSP algorithms. This means we have to adapt to those kinds of workloads and those kinds of operator types. When we do that, we have to be careful not to increase the energy consumption so much that we give up too much for the extra performance.”
Static energy consumption is of particular concern. “There is more area in today’s technologies, and the lower the technology node, the higher the static energy consumption,” he said. “Static energy is nothing but wasted energy. When you’re not doing anything, and the device is on but you’re not using it, it’s still consuming static energy. You have to balance both static energy and dynamic energy. For applications like keyword spotting in devices such as earbuds, where the battery is extremely small, this has to be implemented in an energy-frugal way. We make sure that with the modicum of new instructions that we add, we’re able to achieve the higher performance for that particular task.”
This gets very complicated as new architectural features are added, such as machine learning. “You have networks with weights sitting in main memory, and you must fetch that into your DSP and then operate on that,” he said. “Even if you design a DSP to be very low energy in running the algorithm, if you consume too much energy pulling that from main memory over your buses, you’re going to give up all of the energy savings. You’ve got to find out where in the system the energy consumption is dominant, and work on that area. It requires architectural features besides just the technology part of it to make sure that we maintain a very low energy profile overall,” Madhvapathy noted.
Tradeoffs can include such considerations as memory bandwidth required versus the performance required versus the power required, Wall noted. “Much like a lot of architectural tradeoffs have historically been done in terms of performance versus area, they have to do similar tradeoffs — performance versus power, and power for the system, not just one little piece of the system. It’s the power of the whole system accessing main memory versus accessing more localized resources. These are all factors that play into the overall power of the system.”
Much of this needs to be addressed early in the design to be effective. “Instead of looking at only small vectors representing one scenario or another small scenario, the paradigm shift now is that companies are looking at real use cases and trying to figure out that for these long real use cases, am I really saving power,” said Ansys’ Gupta. “And when we say ‘real use cases,’ we’re talking about video frames of data, a boot-up sequence, or post-silicon measurements. Why can’t we simulate that upfront to then enable the early design decisions?”
The simulations are enabled through hardware emulation, which is used for these long runs.
“Software simulators run out of steam in running very long scenarios, and hence, the hardware emulator boxes can generate these realistic application scenarios,” she explained. “That’s only one part though. The other part is how to take these long scenarios, which is hundreds of gigabytes or terabytes of data, consume it in a reasonable fashion in terms of power and thermal analysis, such that these architectural decisions can be made early on. The objective of the semiconductor design industry should be how such simulations can be done up front. We all talk about Shift Left and early visibility to what the design will do. That should enable the design decisions.”
These issues also bring many new questions and concerns to the surface for engineering teams.
“I have X number of power domains in my design. I pay a heavy cost for having these separate power domains during physical design and physical implementation. Do I really need that many? Am I going overboard? Or am I at risk of thermal throttling challenges? What are the right number of power domains?’ All of this gets enabled by earlier feedback,” Gupta noted.
If the data taken into account involves post-silicon measurements, how does that inform the current design being worked on?
This is where software simulation based approaches come into play, she said. “Say you have an RTL design description very early on. You have RTL design description, you push it through a hardware emulator, you get a real use case within hours. You take that real use case, and then those billions of cycles are analyzed for thermal effects. Then, a fast thermal solver is needed, and a fast way of bringing in that activity to create thermal maps to see, ‘Maybe I need to move this part of the logic to that part in order to distribute heat better.’ Or, ‘I need to put sensors in these different locations so that I know where to throttle back the performance.’”
Optimizations are key
As designs become increasingly heterogeneous, every component needs to be looked at.
“Since DRAMs are all based on the same basic cell technology, how do you optimize everything else that’s around it,” asked Rambus’ Woo. “How do you optimize for a low power environment? How do you optimize for a high-performance environment? If you had asked me 20 years ago, I would have said, ‘We’re not so worried about power. Power is not what I would call a first-class design constraint, so you can optimize things very differently.’ But now we’re seeing that power actually has become a first-class design constraint. Immediately, the first thought then is, ‘If power is going to be a big constraint for me, then maybe I ought to look at what the low-power experts have been doing, because that’s what they’ve been optimizing around from the get go.”
Developing high-end devices is very complicated and very expensive, Durrant noted, so modeling ahead of time is becoming more important all the time.
Sancheti agreed. “If you look at the HPC data center market historically, the time horizon on their time-to-market was measured in years. These days, because it’s such a competitive environment, they’re now beginning to think about designing chips like the classic SoC companies did in the past. All of this adds another layer of complexity. Today’s designs and chips are hungrier, faster, and have to contend with shrinking schedules. Because of this, all aspects of the flow — starting from software into emulation into actual architecture and front-end design — need to have a notion of power and performance.”
This means when engineering teams are looking at power analysis and optimizing the design, they are thinking about where the workloads come from, and what scenarios need to be run for power analysis and optimization. But there is no single best answer to how to achieve this.
Software plays an important role here, according to Sancheti, because from a system perspective, the real workload or scenarios are determined by the software applications, and how the power management software interacts with the rest of the hardware.
At the same time, that optimization has to happen early enough in the design flow for it to make a difference. “Power traditionally has been estimated at the gate level, but by that time it becomes nearly impossible to contain it,” Fahad said. This has led designers to look for approaches that help to estimate and optimize power early on at the RTL level, all of which calls for a Shift Left methodology. Although power estimation at the RTL level makes it possible to make logic redesigning easy and quick, it nonetheless suffers from some degree of inaccuracy due to lack of implementation level details like parasitic data during the initial stages of designing. In fact, designers are sometimes more interested in trending the power numbers at the early stages of the design than measuring them very accurately. As the design matures along the cycle, and more information starts to fall in, iterative power analysis makes it possible to close power more efficiently without impacting the project schedule. Apart from being accurate, power estimations have to be realistic, as well, and this is where we get the increased traction on software-driven power analysis using the real-world scenarios running on emulators.”
Others agree. “Emulation has turned into a bridge for this,” Sancheti said. “With fast emulation systems, you can actually bring in real scenarios or workloads that the software is going to execute, and then you want to look at both the performance as well as power in the context of those workloads. It’s the same thing on architecture definition. When you’re looking at the throughput of the system, it cannot be just the throughput. Again, it has to be in the context of what throughput can be sustained for a certain power envelope or energy-efficiency targets for that particular architecture.”