12 June 2020
Optical transceiver revenue is increasing at a compound annual growth rate (CAGR) of 15% from $7.7bn in 2019 to more than double, to about $17.7bn, in 2025, reckons Yole Développement in its report ‘Optical Transceivers for Datacom & Telecom 2020’
“This growth will be driven by high-volume adoption of expensive high data rates including 400G and 800G modules by big cloud service operators,” says Martin Vallo PhD, technology & market analyst Solid-State Lighting technologies, in Yole’s Photonics, Sensing & Display division. “Therefore, such players invest more and more in new data centers and, on top of that, telecom operators have also increased their investments into the 5G networks that use wireless optical transceivers,” he adds.
High demand from data-center and telecom operators has been confirmed as follows:
- Datacom transceiver module revenue growth at about 20% CAGR will be driven by the adoption of expensive higher-data-rate optical modules, migrating from core/spine networks down to inter-rack connections.
- Telecom transceiver module revenue growth at a 5% CAGR will be driven by coherent technologies for data-center interconnect (DCI) optical transport solutions and 5G optical transceivers deployment in Asia.
The sharp difference in revenue growth is caused by lower sales expectation in 2020 due to the COVID-19 pandemic. In addition, total revenue is expected to rise only moderately in 2020 due to the effect of the pandemic. Indeed, COVID-19 is naturally affecting telecommunications globally and hence sales of optical transceiver modules. However, demand from data-center operators for optical modules is very strong in China, pushed by the local government. Its strategy is focused mainly on 5G deployment and the development of cloud data centers.
“The state of the art of fiber-optic communication technologies has advanced dramatically over the past 25 years,” notes Pars Mukish, business unit manager, Solid-State Lighting (SSL) & Display at Yole. “The highest capacity of commercial fiber-optic links available in the 1990s was only 2.5-10Gb/s, while today they can carry up to 800Gb/s. The last decade of developments has enabled higher-efficiency digital communication systems and solved problems with degraded signals.”
Network traffic growth has been increasing at an enormous pace over the decades and across all the network architectures from the long-haul, mobile access to intra-data-center networks. This growth has been driven by streaming ultra-high-definition (UHD) videos (which need ever higher data throughput) and now newly emerging digital applications and services requiring fast access to the digital networks. It appears that the success and demand for existing applications is continuously driving the scale and capacity of the underlying network infrastructure (including optical transceivers) to points where further applications are enabled, renewing the cycle.
Optical transceivers are widely used in server network cards, switches, routers and wireless base-station equipment in a variety of network architectures and applications. Distances covered start from less than 50m for server and storage interconnections in data centers and enterprise networks to more than 800km in telecom networks.
The evolution of multiple technologies has enabled transmission speed of 400G and beyond in long-haul and metro networks. Today’s trend of migration to 400G speeds stem from cloud operators’ demand to interconnect data centers. Furthermore, the exponential increase in digital communication network capacity and the growing number of optical ports is impacting optical module technology hugely. The new form factors are increasingly universal and designed to reduce their size and thus power consumption. Inside modules, the optics and integrated circuits are getting closer together.
Silicon photonics hence may represent a key enabling technology for the further development of optical interconnect solutions needed to address growing traffic. This technology will play an important role in 500m–80km distance applications, reckons Yole. The industry is working on the heterogeneous integration of indium phosphide (InP) lasers directly onto silicon chips. The advantage is scalable integration and the elimination of cost and complexity of the optical package. Reduced efficiency and lower optical power at high temperature are typical challenges for these lasers.
“Besides increasing speed by integrating amplifiers, the higher data throughput is also achieved by integrating state-of-the-art digital signal processing chips, providing different multi-level modulation techniques such as PAM4 or QAM,” notes Eric Mounier PhD, Fellow Analyst at Yole. “Another technique to increase data rates is parallelization or multiplexing, which enables increasing capacity using parallel fibers or different wavelengths onto a single fiber,” he adds.
Progress in the integration of optical component technologies has led to dramatic reductions in the complexity and cost of optical transceivers. The massive growth in bandwidth has yielded a 10-100-fold decrease in cost per transmitted bit, Yole concludes.
Maximizing Edge AI Performance
Inference of convolutional neural network models is algorithmically straightforward, but to get the fastest performance for your application there are a few pitfalls to keep in mind when deploying. A number of factors make efficient inference difficult, which we will first step through before diving into specific solutions to address and resolve each. By the end of this article, you will be armed with four tools to use before building your system.
Why accelerate convolutional layers?
Broadly speaking, convolutions are all about sliding a function over something else. In the context of image data, we slide a window over pixels with three channels (RGB) and apply the same function on each window.
Fig. 1: Convolving a window over an image.
In a convolutional layer of a CNN, the function performed in every window is actually an element-wise multiplication with a matrix (necessarily of equal size) of fixed values called a filter. A set of multiple filters is also known as a convolutional kernel. The number of filters in this kernel will ultimately be the number of channels that the layer will output.
Fig. 2: In a convolutional layer, the actual function we are convolving is a series of element-wise matrix multiplications with different filters. Note: Each mathematical operation is actually a fused multiply and add (FMA) operation, also known as a ‘tensor op’.
Use fast matrix multiplication algorithms
The first and biggest challenge with CNN inference is that each layer requires a massive amount of matrix multiplies, as mentioned above. The number of operations scale with the size of the image, as well as the number of filters in each layer. While there’s no way to avoid these computations, specialized inference solutions have hardware for fast matrix multiplication algorithms such as the Winograd transformation. On common 3×3 convolutional kernels, such transformations can have the effect of reducing the number of operations needed by 2.25x! Therefore, the first and most general optimization you can make is to ensure that your deployment solution is able to leverage the advantages that fast matrix multiplication algorithms like Winograd can provide. For example, dedicated SoCs like Flex Logix’s InferX X1 have circuitry built in that can dynamically perform the transformations necessary for Winograd multiplication.
Quantize to lower precision data types
Just as the number of multiplications can vary dramatically between layers, so too does the amount of data that needs to be passed between layers. This data is known as activation energies, or activations. Inherently neural networks are approximations, and once a function has been trained in FP32 or FP16, the extra precision that these data types provide is unnecessary for inference. The process of changing the data type of a CNN is known as quantization. In common frameworks like PyTorch and TensorflowLite, quantization to INT8 can be accomplished after training with a tiny fraction of the data required for training, and only a few extra lines of code. The benefit of quantizing for inference can result in an immediate 2x improvement in latency over inference even in FP16!
Choose hardware with flexibility
Next up, as inference proceeds through a CNN, each layer does a different convolution from the previous layer. Whether it’s changing the window size of the kernel or using a different number of filters, the operations that mold and shape the activations end up having different ratios of memory access to computation. An early layer may have many more computations relative to the amount of memory it requires, whereas a middle layer will be operating on a very large activation data but only perform a fraction of the computations. Inherently, then, an architecture that can adapt to these changing memory and computation access patterns will have an advantage over one that does not. For example, the InferX X1 leverages Flex Logix’s eFPGA technology to dynamically reconfigure between layers to maintain an optimal datapath throughout inference. So, when looking to deploy, choose an architecture that can adapt.
Lastly, when training models, in a process known as backwards propagation, much information is generated to update the weights of the model based on each piece of training data. One way to cut down the amount of memory bandwidth required is to ‘batch’ the data and sum up the different changes to these weights over that set of data. In the context of inference, the approach of batching and calculating multiple inferences in parallel, going layer by layer can also improve throughput, but at the cost of latency. For example, in realtime applications, you will have to wait for enough data to come in before starting, and with some hardware, instead of using all the processing elements on a single job, you end up splitting the resources to process multiple inferences in parallel. If the fastest possible inferences is a concern for your application, remember to infer on a batch size of 1.
Faster inference for real-time applications opens up new design possibilities and can ultimately save you and your customers not just time, but also money. As this article highlights, now you have a template you can apply to improve inference performance in your end application, whether that be for medical imaging, factory automation, ADAS, or something else entirely! Just remember these four key tools: 1) make sure you’re taking advantage of fast matrix multiplication algorithms, 2) quantize to INT8, 3) deploy on flexible hardware, and 4) use batch=1 for real-time applications. Leveraging these tools will ensure you get the fastest inference possible for your applications.
Vinay Mehta is the inference technical marketing manager at Flex Logix.
Safeguarding Data Over PCIe & CXL In Data Centers
As more devices enter the market and drive exponential growth of data in the cloud, cloud computing is going through a significant overhaul. The increasing presence of “hyperscale” cloud providers for big data and analytics, 5G for rapid IoT connectivity, and the wide use of AI for natural data processing and for extracting insights are compounding both the amount of connected data and the data vulnerability.
To keep up with the rapid data growth, designers are driving innovation in interface and storage technologies to support increased capacity and performance, as well as more acceleration and new compute architectures. High-speed interfaces like PCI Express (PCIe) 5.0/6.0 and Compute Express Link (CXL) 2.0 are proliferating:
- Faster data rates for cloud-based computing systems are setting the stage for PCIe 5.0 and PCIe 6.0, which are replacing PCIe 4.0 interfaces
- Storage/SSDs are moving to PCIe 5.0/6.0 interfaces
- Data centers that typically deal with many bandwidth-hungry devices and vast shared memory pools are moving to CXL 2.0 interfaces
How can system architects protect cloud data that contains confidential, sensitive, or critical information that can be corrupted, replaced, modified, or stolen by malicious actors? I/O interconnects need to implement security from the start of the design. With limited security, attackers might aim to profit from secrets learned, interfere with the operations of a targeted company, or obstruct a government agency. The types of hacks differ in nature and continue to evolve, like attacks from malicious peripherals delivered over PCIe links, or root access attacks to access memory of other processes to capture secrets and/or alter code execution.
In addition, industry is faced with increasing laws and regulations, such as:
- GDPR (Global Data Protection Regulation) in Europe that imposes steep fines on corporations if private user data is compromised
- Health Insurance Portability and Accountability Act (HIPAA) in the US that stipulates how Personally Identifiable Information (PII) maintained by the healthcare and healthcare insurance industries should be protected from fraud and theft
- Payment Card Industry Data Security Standard, and many others
As the attacks become more sophisticated, the security standards have to continuously adapt to better protect sensitive data and communications and ultimately protect our connected world. To this end, the PCI-SIG and CXL standards organizations added security requirements like Integrity and Data Encryption to PCIe 5.0 and CXL 2.0 specifications in late 2020. Security is expected to continue to be adopted for the next generation PCIe 6.0 and CXL 3.0 interconnects as well.
PCIe and CXL security system components
Security for PCI and CXL interfaces has two main components: 1) Authentication & Key Management, and 2) Integrity and Data Encryption (IDE), as depicted in Figure 1.
Authentication & key management
Authentication and key management include functions like authentication, attestation, measurement, identification, and key exchange, all running in a trusted execution environment / secure module.
The main reference standard for authentication and key management is the Security Protocol and Data Module (SPDM) that is managed by the Distributed Management Task Force (DMTF). SPDM defines messages, data objects and sequences for performing message exchanges between devices over various transport and physical media and enables efficient access to security capabilities and operations. The message exchanges’ description includes authentication of hardware and measurement of firmware identities.
The PCI-SIG introduced two Engineering Change Notices (ECNs) for authentication and key management:
- Component Measurement and Authentication (CMA) defines how SPDM is applied to PCIe/CXL systems
- Data Object Exchange (DOE) supports data object transport over different interconnects
Integrity and Data Encryption (IDE)
IDE provides confidentiality, integrity and replay protection for Transaction Layer Packets (TLPs) for PCIe and Flow Control UnITs (FLITs) for CXL, ensuring that data on the wire is secure from observation, tampering, deletion, insertion and replay of packets. IDE is based on the AES-GCM cryptographic algorithm and receives keys from the Authentication & Key Management security component.
- Reference standards
- PCI-SIG: PCIe IDE ECN
- CXL 2.0: IDE for CXL.cache/mem protocols. CXL.io protocol refers to PCIe IDE ECN.
Fig. 1: PCIe & CXL security system level view.
PCIe & CXL IDE IP solutions
When looking for PCIe and CXL solutions with security, the tradeoffs to consider are performance, latency, and area. All of this needs to be in compliance with the latest standards, of course, and backed by experts.
Things to look for include:
- Throughput full-duplex for receiver and transmitter directions
- Integration with flexible data bus widths and the same clock configurations as the controllers
- Encryption, decryption, and authentication for TLPs for PCIe and FLITs for CXL, based on the AES-GCM cryptographic algorithm with 256-bit key size
- Configurable widths for cipher and hash algorithms for area and latency optimized solutions
- Inflight key refresh for seamless changes of keys in the system
- Low-latency in-order bypass mode for non-protected traffic
Fig. 2: PCIe IDE Security Module block diagram & integration with PCIe Controller.
Figure 3 depicts a CXL 2.0 IDE security module with pre-verification.
Fig. 3: DesignWare CXL IDE Security Module block diagram & integration with DesignWare CXL Controller.
With the tremendous data growth in our connected world, security is essential to protect private and sensitive information in data as it transfers across systems, including over high-performance interconnects such as PCIe and CXL.
Synopsys recently announced the industry’s first security modules for protecting data in high-performance computing SoCs that use the PCIe 5.0 or CXL 2.0 protocols. The DesignWare IDE Security Module IP for PCIe 5.0 or CXL 2.0 are already being deployed with hyperscaler cloud providers. The robust IDE Security Modules are pre-validated with controller IP for PCIe or CXL, making it faster and easier for designers to protect against data tampering and physical attacks on links while complying with the latest versions of the interconnect protocols. Synopsys’ security IP solutions help prevent a wide range of evolving threats in connected devices such as theft, tampering, side channels attacks, malware and data breaches.
Dana Neustadter is a senior manager of product marketing for security IP at Synopsys. She holds a M. Eng. and B. Sc. in electrical engineering from Technical University Cluj-Napoca.
New Uses For AI
AI is being embedded into an increasing number of technologies that are commonly found inside most chips, and initial results show dramatic improvements in both power and performance.
Unlike high-profile AI implementations, such as self-driving cars or natural language processing, much of this work flies well under the radar for most people. It generally takes the path of least disruption, building on or improving technology that already exists. But in addition to having a significant impact, these developments provide design teams with a baseline for understanding what AI can and cannot do well, how it behaves over time and under different environmental and operating conditions, and how it interacts with other systems.
Until recently, the bulk of AI/machine learning has been confined to the data center or specialized mil/aero applications. It has since begun migrating to the edge, which itself is just beginning to take form, driven by a rising volume of data and the need to process that data closer to the source.
Optimizing the movement of data is an obvious target across all of these markets. So much data is being generated that it is overwhelming traditional von Neumann approaches. Rather than scrap proven architectures, companies are looking at ways to reduce the flow of data back and forth between memories and processors. In-memory and near-memory compute are two such solutions that have gained attention, but adding AI into those approaches can have a significant incremental impact.
Samsung’s announcement that it is adding machine learning into the high-bandwidth memory (HBM) stack is a case in point.
“The most difficult part was how to make this as a drop-in replacement for existing DRAM without impacting any of the computing ecosystem,” said Nam Sung Kim, senior vice president of Samsung’s Memory Business Unit. “We still use existing machine learning algorithms, but this technology is about running them more efficiently. Sometimes it wasn’t feasible to run the machine learning model in the past because it required too much memory bandwidth. But with the computing unit inside the memory, now we can explore a lot more bandwidth.”
Kim said this approach allowed a 70% reduction in total system energy without any additional optimization. What makes this so valuable is that it adds a level of “intelligence” into how data is moved. That, in turn, can be paired with other technology improvements to achieve even greater power/performance efficiency. Kim estimates this can be an order of magnitude, but other technologies could push this even higher.
Fig. 1: Processing in memory software stack. Source: Samsung
“As an industry, we have to look in a few different places,” said Steven Woo, fellow and distinguished inventor at Rambus. “One of them is architectures. We have to think about what are the right ways to construct chips so they’re really targeted more toward the actual algorithms. We’ve been seeing that happen for the last four or five years. People have implemented some really neat architectures — things like systolic arrays and more targeted implementations. There are some other ones, too. We certainly know that memory systems are very, very important in the overall energy consumption. One of the things that has to happen is we have to work on making memory accesses more energy-efficient. Utilizing the PHY more effectively is an important piece. SoCs themselves are spending 25% to 40% of their power budget just on PHYs, and then the act of moving data back and forth between and SoC and a PHY — about two thirds of power being used is really just in the movement of the data. And that’s just for HBM2. For GDDR, even more of the power is spent in moving the data because it’s a higher data rate. For an equivalent bandwidth, it’s taking more power just because it’s a much higher speed signal.”
Fig. 2: Breakdown of data movement costs. Source: Rambus
Another place where this kind of approach is being utilized is network configuration and optimization. Unlike in the past, when a computer or smart phone could tap into any of a number of standards-based protocols and networks, the edge is focused on application-specific optimizations and unique implementations. Every component in the data flow needs to be optimized, sometimes across different systems that are connected together.
This is causing headaches for users, who have to integrate edge systems, as well as for vendors looking to sell a horizontal technology that can work across many vertical markets. And it is opening the door for more intelligent devices and components that can configure themselves on a network or in a package — as well as for configurable devices that can adapt to changes in algorithms used for those markets.
“It’s going to start out as software-defined hardware, but it’s going to evolve into a self-healing, self-orchestrating device that can be AI-enabled,” said Kartik Srinivasan, director of data center marketing at Xilinx. “It can say, ‘I’m going to do this level of processing for specific traffic flows,’ and do a multitude of offloads depending upon what AI is needed.”
AI/ML is proving to be very good at understanding how to prioritize and partition data based upon patterns of behavior and probabilities for where it can be best utilized. Not all data needs to be acted upon immediately, and much of it can be trashed locally.
“We’re starting to view machine learning as an optimization problem,” said Anoop Saha, senior manager for strategy and business development at Siemens EDA. “Machine learning historically has been used for pattern recognition, whether it’s supervised or unsupervised learning or reinforcement learning. The idea is that you recognize some pattern from the data that you have, and then use that to classify things to make predictions or do a cat-versus-dog identification. There are other use cases, though, such as a smart NIC card, where you didn’t find the network topology identifying how you maximize your SDN (software defined networking) network. These are not pure pattern-recognition problems, and they are very interesting for the broader industry. People are starting to use this for a variety of tasks.”
While the implementations are highly specific, general concepts are starting to come into focus across multiple markets. “It differs somewhat depending on the market segment that you’re in,” said Geoff Tate, CEO of Flex Logix. “We’re working at what we’re calling the enterprise edge for medical imaging and things like that. Our customers need high throughput, high accuracy, low cost, and low power. So you really have to have an architecture that’s better than GPUs, and we benchmarked ours at 3 to 10 times better. We do that with finer granularity, and rather than having a big matrix multiplier, we have our one-dimensional tensor processors. Those are modular, so we can combine them in different ways to do different convolution and matrix applications. That also requires a programmable interconnect, which we’ve developed. And the last thing we do is have our compute very close to memory, which minimizes latency and power. All of the computation takes place in SRAM, and then the DRAM is used for storing weights.”
AI on the edge
This modular and programmable kind of approach is often hidden in many of these designs, but the emphasis on flexibility in design and implementation is critical. More sensors, a flood of data, and a slowdown in the benefits of scaling, have forced chipmakers to pivot to more complex architectures that can drive down latency and power while boosting performance.
This is particularly true on the edge, where some of the devices are based on batteries, and in on-premises and near-premises data centers where speed is the critical factor. Solutions tend to be highly customized, heterogeneous, and often involve multiple chips in a package. So instead of a hyperscale cloud, where everything is located in one or more giant data centers, there are layers of processing based upon how quickly data needs to be acted upon and how much data needs to be processed.
The result is a massively complex data partitioning problem, because now that data has to be intelligently parsed between different servers and even between different systems. “We definitely see that trend, especially with more edge nodes on the way,” said Sandeep Krishnegowda, senior director of marketing and applications for memory solutions at Infineon. “When there’s more data coming in, you have to partition what you’re trying to accelerate. You don’t want to just send raw bits of information all the way to the cloud. It needs to be meaningful data. At the same time, you want real-time controller on the edge to actually make the inference decisions right there. All of this definitely has highlighted changes to architecture, making it more efficient at managing your traffic. But most importantly, a lot of this comes back to data and how you manage the data. And invariably a lot of that goes back to your memory and the subsystem of memory architectures.”
In addition, this becomes a routing problem because everything is connected and data is flowing back and forth.
“If you’re doing a data center chip, you’re designing at the reticle limit,” said Frank Schirrmeister, senior group director for solution marketing at Cadence. “You have an accelerator in there, different thermal aspects, and 3D-IC issues. When you move down to the wearable, you’re still dealing with equally relevant thermal power levels, and in a car you have an AI component. So this is going in all directions, and it needs a holistic approach. You need to optimize the low-power/thermal/energy activities regardless of where you are at the edge, and people will need to adapt systems for their workloads. Then it comes down to how you put these things together.”
That adds yet another level of complexity. “Initially it was, ‘I need the highest density SRAM I can get so that I can fit as many activations and weights on chip as possible,’” said Ron Lowman, strategic marketing manager for IP at Synopsys. “Other companies were saying they needed it to be as low power as possible. We had those types of solutions before, but we saw a lot of new requests specifically around AI. And then they moved to the next step where they’d say, ‘I need some customizations beyond the highest density or lowest leakage,’ because they’re combining them with specialized processing components such as memory and compute-type technologies. So there are building blocks, like primitive math blocks, DSP processors, RISC processors, and then a special neural network engine. All of those components make up the processing solution, which includes scalar, vector, and matrix multiplication, and memory architectures that are connected to it. When we first did these processors, it was assumed that you would have some sort of external memory interface, most likely LPDDR or DDR, and so a lot of systems were built that way around those assumptions. But there are unique architectures out there with high-bandwidth memories, and that changes how loads and stores are taken from those external memory interfaces and the sizes of those. Then the customer adds their special sauce. That will continue to grow as more niches are found.”
Those niches will increase the demand for more types of hardware, but they also will drive demand for continued expansion of these base-level technologies that can be form-fitted to a particular use case.
“Our FPGAs are littered with memory across the entire device, so you can localize memory directly to the accelerator, which can be a deep learning processing unit,” said Jayson Bethurem, product line manager at Xilinx. “And because the architecture is not fixed, it can be adapted to different characterizations, and classification topologies, with CNNs and other things like that. That’s where most of the application growth is, and we see people wanting to classify something before they react to it.”
AI’s limits in end devices
AI itself is not a fixed technology. Different pieces of an AI solution are in motion as the technology adapts and optimizes, so processing results typically come in the form of distributions and probabilities of acceptability.
That makes it particularly difficult to define the precision and reliability of AI, because the metrics for each implementation and use case are different, and it’s one reason why the chip industry is treading carefully with this technology. For example, consider AI/ML in a car with assisted driving. The data inputs and decisions need to be made in real time, but the AI system needs to be able to weight the value of that data, which may be different from how another vehicle weights that data. Assuming the two vehicles don’t ever interact, that’s not a problem. But if they’re sharing information, the result can be very different.
“That’s somewhat of an open problem,” said Rob Aitken, fellow and director of technology for Arm’s Research and Development Group. “If you have a system with a given accuracy and another with a different accuracy, then cumulatively their accuracy depends on how independent they are from each other. But it also depends on what mechanism you use to combine the two. This seems to be reasonably well understood in things like image recognition, but it’s harder when you’re looking at an automotive application where you’ve got some radar data and some camera data. They’re effectively independent of one another, but their accuracies are dependent on external factors that you would have to know, in addition to everything else. So the radar may say, ‘This is a cat,’ but the camera says there’s nothing there. If it’s dark, then the radar is probably right. If it’s raining, maybe the radar is wrong, too. These external bits can come into play very, very quickly and start to overwhelm any rule of thumb.”
All of those interactions need to be understood in detail. “A lot of designs in automotive are highly configurable, and they’re configurable even on the fly based on the data they’re getting from sensors,” said Simon Rance, head of marketing at ClioSoft. “The data is going from those sensors back to processors. The sheer amount of data that’s running from the vehicle to the data center and back to the vehicle, all of that has to be traced. If something goes wrong, they’ve got to trace it and figure out what the root cause is. That’s where there’s a need to be filled.”
Another problem is knowing what is relevant data and what is not. “When you’re shifting AI to the edge, you shift something like a model, which means that you already know what is the relevant part of the information and what is not,” said Dirk Mayer, department head for distributed data processing and control in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Even if you just do something like a low-pass filtering or high-pass filtering or averaging, you have something in mind that tells you, ‘Okay, this is relevant if you apply a low-pass filter, or you just need data up to 100 Hz or so.’”
The challenge is being able to leverage that across multiple implementations of AI. “Even if you look at something basic, like a milling machine, the process is the same but the machines may be totally different,” Mayer said. “The process materials are different, the materials being milled are different, the process speed is different, and so on. It’s quite hard to invent artificial intelligence that adapts itself from one machine to another. You always need a retraining stage and time to collect new data. This will be a very interesting research area to invent something like building blocks for AI, where the algorithm is widely accepted in the industry and you can move it from this machine to that machine and it’s pre-trained. So you add domain expertise, some basic process parameters, and you can parameterize your algorithm so that it learns faster.”
That is not where the chip industry is today, however. AI and its sub-groups, machine learning and deep learning, add unique capabilities to an industry that was built on volume and mass reproducibility. While AI has been proven to be effective for certain things, such as optimizing data traffic and partitioning based upon use patterns, it has a long way to go before it can make much bigger decisions with predictable outcomes.
The early results of power reduction and performance improvements are encouraging. But they also need to be set in the context of a much broader set of systems, the rapid evolution of multiple market segments, and different approaches such as heterogeneous integration, domain-specific designs, and the limitations of data sharing across the supply chain.
SoC Integration Complexity: Size Doesn’t (Always) Matter
It’s common when talking about complexity in systems-on-chip (SoCs) to haul out monster examples: application processors, giant AI chips, and the like. Breaking with that tradition, consider an internet of things (IoT) design, which can still challenge engineers with plenty of complexity in architecture and integration. This complexity springs from two drivers: very low power consumption, even using harvested MEMS power instead of a battery, and quick turnaround to build out a huge family of products based on a common SoC platform while keeping tight control on development and unit costs.
Fig. 1: Block diagram of a low-power TI CC26xx processor. (Sources: The Linley Group, “Low-Power Design Using NoC Technology”; TI)
For these types of always-on IoT chips, a real-time clock is needed to wake the system up periodically – to sense, compute, communicate and then go back to sleep; a microcontroller (MCU) for control, processing, plus security features; and local memory and flash to store software. I/O is required for provisioning, debugging, and interfacing to multiple external sensors/actuators. Also necessary is a wireless interface, such as Bluetooth Low Energy, because the aim is first at warehouse applications, and relatively short-range links are OK for that application.
This is already a complex SoC, and the designer hasn’t even started to think about adding more features. For a product built around this chip to run for years on a coin cell battery or a solar panel, almost all of this functionality has to be powered down most of the time. Most devices will have to be in switchable power domains and quite likely switchable voltage domains for dynamic voltage and frequency scaling (DVFS) support. A power manager is needed to control this power and voltage switching, which will have to be built/generated for this SoC. That power state controller will add control and status registers (CSRs) to ultimately connect with the embedded software stack.
Fig. 2: There are ten power domains in the TI CC26xx SoC. The processor has two voltage domains in addition to always-on logic (marked with *). (Sources: The Linley Group, “Low-Power Design Using NoC Technology”; TI)
Running through this SoC is the interconnect, the on-chip communications backbone connecting all these devices, interfaces, and CSRs. Remember that interconnects consume power, too, even passively, through clock toggling and even leakage power while quiescent. Because they connect everything, conventional buses are either all on or all off, which isn’t great when trying to eke out extra years of battery life. Designers also need fine-grained power management within the interconnect, another capability lacking in old bus technology.
How can a design team achieve extremely low power consumption in IoT chips like these? By dumping the power-hungry bus and switching to a network-on-chip (NoC) interconnect!
Real-world production chip implementation has shown that switching to a NoC lowers overall power consumption by anywhere from two to nine times compared to buses and crossbars. The primary reasons NoCs have lower power consumption are due to the lower die area of NoCs compared to buses and crossbars and multilevel clock gating (local, unit-level, and root), which enables sophisticated implementation of multiple power domains. This provides three levels of clock gating. For the TI IoT chips, the engineering team implemented multiple overlapping power and clock domains to meet their use cases using the least amount of power possible while limiting current draw to just 0.55mA in idle mode. Using a NoC to reduce active and standby power allowed the team to create IoT chips that can run for over a year using a standard CR2032 coin battery.
Low power is not enough to create successful IoT chips. These markets are fickle with a need for low cost while meeting constantly changing requirements for wireless connectivity standards, sensors, display, and actuator interfaces. Now engineers must think about variants, or derivatives, based on our initial IoT platform architecture. These can range from a narrowband internet of things (NB-IoT) wireless option for agricultural and logistics markets to an audio interface alarm and AI-based anomaly detection. It makes perfect strategic sense to create multiple derivative chips from a common architectural SoC platform, but how will this affect implementation if someone made the mistake of choosing a bus? Conventional bus structures have a disproportionate influence on the floorplan. Change a little functionally, and the floorplan may have to change considerably, resulting in a de facto “re-spin” of the chip architecture, defeating the purpose of having a platform strategy. Can an engineer anticipate all of this while still working on the baseline product? Is there a way to build more floorplan reusability into that first implementation?
A platform strategy for low-power SoCs isn’t just about the interconnect IP. As the engineer tweaks and enhances each design by adding, removing or reconfiguring IPs, and optimizing interconnect structure and power management, the software interface to the hardware will change, too. Getting that interface exactly right is rather critical. A mistake here might make the device non-operational, but at least someone would figure that out quickly. More damaging to the bottom line would be a small bug that may leave on a power domain when it should have shut off. An expected 1-year battery life drops to three months. A foolproof memory map can’t afford to depend on manual updates and verification. It must be generated automatically. IP-XACT based IP deployment technology provides state-of-the-art capabilities to maintain traceability and guarantee correctness of this type of design data throughout the product lifecycle.
Even though these designs are small compared to mega-SoCs, there’s still plenty of complexity, yet plenty of opportunity to get it wrong. At Arteris IP, we’re laser-focused on maximizing automation and optimization in SoC integration to make sure our users always get it “first time right.” Give us a call!
Kurt Shuler is vice president of marketing at ArterisIP. He is a member of the US Technical Advisory Group (TAG) to the ISO 26262/TC22/SC3/WG16 working group and helps create safety standards for semiconductors and semiconductor IP. He has extensive IP, semiconductor, and software marketing experiences in the mobile, consumer, automotive, and enterprise segments working for Intel, Texas Instruments, and four startups. Prior to his entry into technology, he flew as an air commando in the US Air Force Special Operations Forces. Shuler earned a B.S. in Aeronautical Engineering from the United States Air Force Academy and an M.B.A. from the MIT Sloan School of Management.
Dota 2 Patch 7.29 Will Reveal a New Hero
Valorant Redeem Codes: How to redeem?
How to watch the TFT Fates Championship
Best Warzone guns: the weapons you need to use in Black Ops Cold War Season 2
W33 Removed From Team Nigma’s Active Roster
Unternehmen gründen Crypto Council: Fidelity und Coinbase mit dabei
Novatti’s Ripple partnership live to The Philippines
Overwatch Archives event 2021: new challenges, skins, and more
Evil Geniuses Partner With Cryptocurrency Exchange Platform Coinbase
Standard Chartered turbocharges digital payments proposition with investment and the merger of CurrencyFair with Assembly Payments
Bitcoin Preis Update: BTC fällt unter 59.500 USD
indiefoxx was just banned from Twitch again, but why?
Krypto-News Roundup 8. April
100 Thieves reveal NFTs in Enter Infinity Collection
Ripple Klage: CEO zeigt sich nach Anhörung positiv
Fintechs are ransomware targets. Here are 9 ways to prevent it.
Fortnite: How To Reboot A Friend And Earn In-Game Rewards
Astralis vs Gambit Esports: ESL Pro League betting analysis
DFB bringt digitale Sammelkarten auf die Blockchain
Dota 2 Dawnbreaker Hero Guide
Blockchain1 week ago
Bitcoin Cash Price Prediction: BCH/USD Price Turns Bearish; Can the $540 Support Hold?
Blockchain1 week ago
How Chainlink will help secure Polkadot’s environment
Blockchain1 week ago
Blockchain-based renewable energy marketplaces gain traction in 2021
Esports1 week ago
GeneRaL is replaced by RAMZES666 on Na’Vi
Cleantech1 week ago
Volkswagen’s European Factories Up To 95% Powered By Renewables
Aviation1 week ago
World2Fly gears up for July launch with roll-out of Airbus A350-900
PR Newswire1 week ago
Stärkung von Frauen in einer aufstrebenden Branche
Blockchain1 week ago
Mark Cuban Thinks Dogecoin ($DOGE) Could Get to $1, but Could It Get to $10?
Esports7 days ago
Amouranth becomes Twitch’s top female streamer, beats Pokimane
Coinpedia1 week ago
BitTorrent Token Price Analysis – BTT Poised to Hit $1 In April?
Blockchain1 week ago
Hardware Hacker Modifies Old School Game Boy To Mine Bitcoin
Blockchain1 week ago
‘Silent crash’ as price floors collapse across NFT space