In February, several startups emerge from stealth, with one company working on AI inference architectures for the data center and another trying to make lenses thinner by patterning surfaces with tiny structures. Two new Chinese companies are trying to expand the country’s semiconductor design ecosystem with GPUs and interface IP. Plus, a maker of AI chips for ADAS draws another massive round this month as we take a look at 22 startups that collectively raised over $1.2B.
Semiconductors & design
GPU startup Moore Threads raised an undisclosed amount of pre-A Series funding (stated as billions RMB — 1.0B RMB is about $155M) led by Shenzhen Capital Group, Sequoia Capital China, and GGV Capital and joined by China Merchants Capital, ByteDance, Xiaoma Zhixing, Ronghui Capital, Haisong Capital, Famous Investment, No. 1 Venture, Wuyuan Capital, Heertai, Minghao, and others. The company is focused on GPU R&D, design, and manufacturing and hopes to form the base of a complete GPU ecosystem in China. In addition to graphics, the company is targeting AI and high-performance computing tasks. Founded in 2020, Moore Threads is based in Beijing and Shanghai, China.
Storage startup Pliops drew $65.0M in venture funding led by Koch Disruptive Technologies. All current investors, including NVIDIA, State of Mind Ventures, Viola Ventures, Intel Capital, SoftBank Ventures Asia, Expon Capital, Western Digital, Xilinx, and Sweetwood Capital joined the round. Pliops makes a storage processor available as a hardware-enabled storage engine deployed in a PCIe card form factor or as a cloud service. The company says the technology can improve the cost, performance, and endurance of SSD storage and accelerate the processing of data-intensive applications in data centers. Pliops plans to scale its technology into new use cases, expand its product line, and double the size of the company by the end of 2021. Founded in 2017, Pliops is based in Ramat Gan, Israel, and has raised $105M to date.
IP startup AkroStar (Xinyaohui) closed an angel and pre-A round with CN¥ 400M (~$61.8M) led by GL Ventures, Sequoia China Capital, V Fund Management, as well as Gaorong Capital. Green Pine Capital Partners, state-owned Da Heng Qin Group, 5Y Capital, and Panorama Capital all joined the latest round, along with angel round investors ZhenFund and DaShu Finance. AkroStar’s focus is high-speed interface IP for 14/12nm and smaller processes. It also provides IP integration and customization services. The former Deputy General Manager of Synopsys China, Zeng Keqiang, is heading up the new company as CEO and Chairman. Founded in 2020, AkroStar is based in Zhuhai, China.
Wireless IC startup SPARK Microsystems raised CDN$17.5M (~$13.9M) in private equity funding led by Cycle Capital and joined by new investors ND Capital and Export Development Canada, as well as existing investor Real Ventures and private investors including Sanjay K. Jha (former GlobalFoundries CEO) and Paul Jacobs (former Qualcomm CEO). Recently, SPARK launched ultra-wide band (UWB) wireless transceiver ICs for short-range wireless connectivity applications such as gaming peripherals and AR/VR headsets, smart home devices, and battery-less IoT sensors. Proceeds will be used to fund high-volume manufacturing, sales ramp and expanded R&D for next generation products. Montreal, Canada and founded in 2016, SPARK has raised about $17.3M.
Signal conversion chipmaker Scalinx drew €10.5M (~$12.7M) in Series A funding led by NCI WaterStart Capital and Normandie Participations, joined by BNP Paribas Développement, CEN Innovation, and Unexo. Scalinx designs ASICs, ASSPs, and IP for wideband and low power signal conversion in applications such as wired and wireless communications, test and measurement equipment, and radar. Based in Paris, France, Scalinx was founded in 2015.
RF fabless startup Mobix Labs raised $10.0M in seed funding. The company focuses on making high frequency RF chips for mmWave 5G devices, including fully integrated, single-chip, single-die, mmWave beamformers, antenna solutions, and RF semiconductors. Based in Irvine, Calif., Mobix Labs was founded in 2020 and has total funding of $12.5M following another recent seed round.
Cambridge GaN Devices (CGD) raised $9.5M in a Series A round led by IQ Capital, Parkwalk Advisors, and BGF Ventures, and includes investment from Foresight Williams, Cambridge Enterprise, Martlet Capital, Cambridge Angels, and Cambridge Capital Group. The startup designs power transistors and ICs based on gallium nitride (GaN) to provide fast switching with low power loss. CGD is developing a range of GaN transistors that are customized for applications such as consumer and industrial Switch Mode Power Supply (SMPS), lighting, data centers, and automotive HEV/EV. Funds will be used to double staff and expand the product portfolio. A spin out from the power device group at Cambridge University founded in 2016, CGD is based in Cambridge, UK.
AI hardware startup NeuReality emerged from stealth with an $8.0M seed round with investment from Cardumen Capital, OurCrowd, and Varana Capital. The company is working on purpose-built AI platforms, particularly for inference in the data center. NeuReality says its solution reduces the dependency on CPUs, NICs and PCI-switches and moves simple but critical data path functions from software to hardware. Current prototypes are based on Xilinx FPGAs. Founded in 2019, the company is based in Caesarea, Israel.
Wafer maker NexWafe raised €10.0M (~$12.2M) in Series B funding from Fraunhofer Venture, Saudi Aramco Energy Ventures, GAP Technology Holding, Lynwood Schweiz AG, and Bantina Invest Limited. NexWafe produces high-efficiency monocrystalline silicon wafers using in-line epitaxy. The wafers are designed as a drop-in replacement for Cz-Si wafers in solar cell production and the company says they provide higher efficiency and yield at a lower cost. The funding will allow NexWafe to begin pilot manufacturing activities. Spun out from Fraunhofer Institute for Solar Energy Systems ISE in 2015, it is based in Freiburg, Germany.
Adapdix received an undisclosed amount of venture funding from SoftBank’s Opportunity Fund. The company specializes in edge AI automation and control software for industrial equipment, with an initial focus on manufacturing customers in the semiconductor, electronics, and automotive industries. Adapdix is based in Pleasanton, Calif., and was founded in 2014. It has raised $10M to date.
Optics startup Metalenz launched from stealth with a $10.0M Series A round from 3M Ventures, Applied Ventures, Intel Capital, M Ventures, TDK Ventures, Tsingyuan Ventures, and Braemar Energy Ventures. Acting as a fabless company, Metalenz designs optical metasurfaces, or meta-optics, that use patterned sub-wavelength structures to combine the functions of several refractive lenses into a single, thin, and flat surface. Able to be fabricated using standard semiconductor processes, the company is currently applying the technology to pattern generators, 3D imaging lenses, and diffusers. The funds will be used to scale production and accelerate development. It plans to enter the end-user device market later this year. A spin out from Harvard Labs founded in 2017, Metalenz is based in Boston, Mass., and has raised $17.4M to date.
Autonomy & ADAS
AI chipmaker Horizon Robotics continued its Series C round with a $350.0M investment led by Great Wall Motors and joined by BYD Company, Changjiang Automobile Electronic, Dongfeng Motor Group, Sunny Optical Technology, Changzhou Xingyu Automotive Lighting Systems Company, CMC-SDIC Capital, CICC Capital, and Shougang Fund. The company focuses on AI chips for ADAS and autonomous driving applications and currently provides L2 and L3 solutions, with plans to release an inference chip focused on L3/L4 driving in the first half of the year. Horizon Robotics is working on development and commercialization of L4/L5 chips. Based in Beijing, China and founded in 2015, the company has raised over $1.5B.
Autonomous trucking startup Plus drew $200.0M in a Series B round led by new investors Guotai Junan International, CPE Capital, and Wanxiang International Investment and joined by existing investors including Full Truck Alliance, SAIC Capital, GSR Ventures, Sequoia Capital, China Growth Capital, Lightspeed, and Mayfield Fund. The company’s autonomy system can be installed on an existing truck or as an upfit option on new trucks by truck manufacturers. It is also partnering with heavy truck maker FAW on the J7+, beginning mass production this year for the Chinese market. Based in Cupertino, Calif., and founded in 2016, Plus has raised $300M to date.
Connected vehicle company ECARX raised $200.0M in a Series A+ round led by China Venture Capital Fund. ECARX provides a smart cockpit product with 4G/5G connectivity, infotainment, and voice assistance. It will also be working with auto electronics company Visteon on an intelligent automotive based on Qualcomm’s platform. The company is planning an expansion into the international market and recently set up an R&D center in Sweden. Based in Hangzhou, China, ECARX was founded in 2016 by automaker Geely as an independently operating firm.
Self-driving company Pony.ai drew $100.0M in extended Series C funding led by Ontario Teachers’ Pension Plan and joined by 5Y Capital, Brunei Investment Agency, ClearVue Partners, CPE, Eight Roads Ventures, and Fidelity China Special Situations. The company uses a combination of lidar, radar, and cameras and has a fleet of over 100 L4 robo-taxis currently being testing in Guangzhou, Shanghai, and California, and is testing autonomous freight delivery. Founded in 2016 and based in Fremont, Calif., and Guangzhou, China, Pony.ai has raised over $1.1B in total, over $700M of that raised last year.
Automotive positioning startup Swift Navigation raised $50.0M in Series C funding led by existing investors Forest Baskett and Greg Papadopoulos of New Enterprise Associates, existing investor Eclipse Ventures and new investors, including EPIQ Capital Group and KDDI Open Innovation Fund. The company’s Global Navigation Satellite System (GNSS) platform that includes a software positioning engine that integrates with the automotive sensor suite and pulls centimeter-accurate location corrections from the company’s GNSS service. Founded in 2012 and based in San Francisco, Calif., Swift Navigation has raised $97.6M in total.
Autonomy company Perrone Robotics raised $10.0M in Series A funding from CapStone Holdings. The company offers a general-purpose robotics operating system and autonomous vehicle retrofit kit for transit vans and tractor-trailers operating in geo-fenced areas for commercial, municipal, and governmental applications. A portion of the investment will be used to establish an advanced autonomous vehicle testing facility at the American Center for Mobility in Southeast Michigan. Based in Crozet, Va., Perrone Robotics was founded in 2001.
Next.e.GO Mobile raised a €30.0M (~$36.5M) Series B round with investment from Moore Strategic Ventures and individual investors John Snow, Alejandro Agag, and Edward Norton. The company is developing an electric four-seater passenger car that will begin production in June. Founded in 2015, Next.e.GO Mobile is based in Aachen, Germany.
Electric delivery startup OX received £1.2M (~$1.7M) in grants from Innovate UK and the Advanced Propulsion Centre. The company is developing a goods-transport-as-a-service ecosystem specifically targeted at emerging markets, which will utilize the company’s electric truck designed for rough and all-terrain environments. The truck can be shipped to location flat-pack to reduce transport costs. In April, Ox will begin pilot operations in Rwanda. Based in Warwick, UK, OX was founded as a spin out from the Norman Trust and Global Vehicle Trust nonprofits.
Enevate raised $81.0M in a Series E round for its fast-charging lithium-ion batteries for electric vehicles that use a silicon-dominant anode. The company says its technology provides higher energy density than graphite anodes and complete charging in five minutes. Led by Fidelity Management and Research Company and joined by Mission Ventures and Infinite Potential Technologies, the funding will be used for hiring and to expand the company’s pre-production line designed to guide EV and other battery customers toward implementing larger-scale battery manufacturing utilizing the silicon anode-based batteries. Irvine, Calif., and founded in 2005, Enevate has raised $191M.
Automotive battery maker Nyobolt raised $10.0M in a Series A round led by IQ Capital and joined by Cambridge Enterprise. The startup is focused on high power, high energy density batteries that utilize niobium-based anode materials, which the company also says makes its batteries safer. Nyobolt will use the funds to expand globally, build new facilities, and hire. Formerly named CB2Tech, Nyobolt is based in Cambridge, UK, and was founded in 2020 from research at the University of Cambridge.
Battery startup E-magy raised €5.0M (~$6.1M) in venture funding led by SHIFT Invest and joined by existing investors including PDENH for its silicon anode for lithium-ion batteries. The anode is made up of 5 micron silicon particles with a nano-scale sponge structure, which replaces graphite as the active material. E-magy says its material is compatible with existing manufacturing processes and provides an energy density increase of 40% while increasing charging speed. Funds will be used to increase production at its fully operational pilot production line, accelerate qualification programs, hire, and being construction on a new manufacturing facility. Founded in 2014, E-magy is based in Broek op Langedijk, the Netherlands.
Other startups received funding that will drive semiconductor design in the future, as well. The California Sustainable Energy Entrepreneur Development (CalSEED) program awarded $2.7M in grants to six clean energy startups. Each will receive $450,000:
- Antora Energy is building a low-cost thermal battery for grid-scale energy storage that combines inexpensive thermal storage media with high-efficiency thermophotovoltaic energy conversion.
- EnZinc is building a zinc micro-sponge anode technology that allows zinc to be used in high performance rechargeable batteries the company says are a safer, more cost effective alternative to lead-acid and lithium-based batteries. EnZinc recently completed a 500-cycle test on its prototype anode.
- Icarus RT is developing a hybrid PV/Thermal solar plus storage system that converts stored thermal waste heat to usable power during peak demand evening hours and nighttime. The company says it increases PV panel efficiency up to 12%, extends panel lifetime, and reduces the return on investment to 3 years for a 100 kW system.
- ReJoule has an EV battery diagnostics tool that measures critical health metrics, allowing users to quickly grade the health of large-format lithium-ion battery packs without the need for disassembly, reducing test time and labor costs. It will provide insight into how batteries degrade, helping to improve operational efficiency.
- SiLi-ion will advance development of a “drop-in” additive for lithium ion battery manufacturers, that enables up to 40% increase in storage capacity compared to state-of-the-art devices.
- Takachar is building small-scale, low-cost, portable equipment to convert crop and forest waste biomass in remote areas into higher-value products such as solid fuel, fertilizer, and other specialty chemicals.
A number of companies using AI in their products also raised over $100M, with several aiming to streamline business processes.
- UiPath raised $750.0M in Series F for its robotic process automation software, which aims to automate business processes and repetitive software tasks.
- Plume Design raised $270.0M in Series E for its platform that helps communications service providers to deliver connected home services.
- Highspot raised $200.0M in Series E for its sales platform that manages marketing content and guides sales reps.
- Standard Cognition raised $150.0M in Series C for its cashier-less retail checkout system that uses computer vision to tell what a shopper is buying.
- TigerGraph raised $105.0M in Series C for its graph database platform for data analytics and machine learning.
- ScienceLogic raised $105.0M in Series E for its AI-powered IT, network, and cloud management software.
Find our prior startup funding reports here.
Israel: Startup Powerhouse
Cerfe Labs: Spin-On Memory
Moov: Used Equipment Digital Marketplace Startup
Sign up for Semiconductor Engineering’s news alerts here.
Maximizing Edge AI Performance
Inference of convolutional neural network models is algorithmically straightforward, but to get the fastest performance for your application there are a few pitfalls to keep in mind when deploying. A number of factors make efficient inference difficult, which we will first step through before diving into specific solutions to address and resolve each. By the end of this article, you will be armed with four tools to use before building your system.
Why accelerate convolutional layers?
Broadly speaking, convolutions are all about sliding a function over something else. In the context of image data, we slide a window over pixels with three channels (RGB) and apply the same function on each window.
Fig. 1: Convolving a window over an image.
In a convolutional layer of a CNN, the function performed in every window is actually an element-wise multiplication with a matrix (necessarily of equal size) of fixed values called a filter. A set of multiple filters is also known as a convolutional kernel. The number of filters in this kernel will ultimately be the number of channels that the layer will output.
Fig. 2: In a convolutional layer, the actual function we are convolving is a series of element-wise matrix multiplications with different filters. Note: Each mathematical operation is actually a fused multiply and add (FMA) operation, also known as a ‘tensor op’.
Use fast matrix multiplication algorithms
The first and biggest challenge with CNN inference is that each layer requires a massive amount of matrix multiplies, as mentioned above. The number of operations scale with the size of the image, as well as the number of filters in each layer. While there’s no way to avoid these computations, specialized inference solutions have hardware for fast matrix multiplication algorithms such as the Winograd transformation. On common 3×3 convolutional kernels, such transformations can have the effect of reducing the number of operations needed by 2.25x! Therefore, the first and most general optimization you can make is to ensure that your deployment solution is able to leverage the advantages that fast matrix multiplication algorithms like Winograd can provide. For example, dedicated SoCs like Flex Logix’s InferX X1 have circuitry built in that can dynamically perform the transformations necessary for Winograd multiplication.
Quantize to lower precision data types
Just as the number of multiplications can vary dramatically between layers, so too does the amount of data that needs to be passed between layers. This data is known as activation energies, or activations. Inherently neural networks are approximations, and once a function has been trained in FP32 or FP16, the extra precision that these data types provide is unnecessary for inference. The process of changing the data type of a CNN is known as quantization. In common frameworks like PyTorch and TensorflowLite, quantization to INT8 can be accomplished after training with a tiny fraction of the data required for training, and only a few extra lines of code. The benefit of quantizing for inference can result in an immediate 2x improvement in latency over inference even in FP16!
Choose hardware with flexibility
Next up, as inference proceeds through a CNN, each layer does a different convolution from the previous layer. Whether it’s changing the window size of the kernel or using a different number of filters, the operations that mold and shape the activations end up having different ratios of memory access to computation. An early layer may have many more computations relative to the amount of memory it requires, whereas a middle layer will be operating on a very large activation data but only perform a fraction of the computations. Inherently, then, an architecture that can adapt to these changing memory and computation access patterns will have an advantage over one that does not. For example, the InferX X1 leverages Flex Logix’s eFPGA technology to dynamically reconfigure between layers to maintain an optimal datapath throughout inference. So, when looking to deploy, choose an architecture that can adapt.
Lastly, when training models, in a process known as backwards propagation, much information is generated to update the weights of the model based on each piece of training data. One way to cut down the amount of memory bandwidth required is to ‘batch’ the data and sum up the different changes to these weights over that set of data. In the context of inference, the approach of batching and calculating multiple inferences in parallel, going layer by layer can also improve throughput, but at the cost of latency. For example, in realtime applications, you will have to wait for enough data to come in before starting, and with some hardware, instead of using all the processing elements on a single job, you end up splitting the resources to process multiple inferences in parallel. If the fastest possible inferences is a concern for your application, remember to infer on a batch size of 1.
Faster inference for real-time applications opens up new design possibilities and can ultimately save you and your customers not just time, but also money. As this article highlights, now you have a template you can apply to improve inference performance in your end application, whether that be for medical imaging, factory automation, ADAS, or something else entirely! Just remember these four key tools: 1) make sure you’re taking advantage of fast matrix multiplication algorithms, 2) quantize to INT8, 3) deploy on flexible hardware, and 4) use batch=1 for real-time applications. Leveraging these tools will ensure you get the fastest inference possible for your applications.
Vinay Mehta is the inference technical marketing manager at Flex Logix.
Safeguarding Data Over PCIe & CXL In Data Centers
As more devices enter the market and drive exponential growth of data in the cloud, cloud computing is going through a significant overhaul. The increasing presence of “hyperscale” cloud providers for big data and analytics, 5G for rapid IoT connectivity, and the wide use of AI for natural data processing and for extracting insights are compounding both the amount of connected data and the data vulnerability.
To keep up with the rapid data growth, designers are driving innovation in interface and storage technologies to support increased capacity and performance, as well as more acceleration and new compute architectures. High-speed interfaces like PCI Express (PCIe) 5.0/6.0 and Compute Express Link (CXL) 2.0 are proliferating:
- Faster data rates for cloud-based computing systems are setting the stage for PCIe 5.0 and PCIe 6.0, which are replacing PCIe 4.0 interfaces
- Storage/SSDs are moving to PCIe 5.0/6.0 interfaces
- Data centers that typically deal with many bandwidth-hungry devices and vast shared memory pools are moving to CXL 2.0 interfaces
How can system architects protect cloud data that contains confidential, sensitive, or critical information that can be corrupted, replaced, modified, or stolen by malicious actors? I/O interconnects need to implement security from the start of the design. With limited security, attackers might aim to profit from secrets learned, interfere with the operations of a targeted company, or obstruct a government agency. The types of hacks differ in nature and continue to evolve, like attacks from malicious peripherals delivered over PCIe links, or root access attacks to access memory of other processes to capture secrets and/or alter code execution.
In addition, industry is faced with increasing laws and regulations, such as:
- GDPR (Global Data Protection Regulation) in Europe that imposes steep fines on corporations if private user data is compromised
- Health Insurance Portability and Accountability Act (HIPAA) in the US that stipulates how Personally Identifiable Information (PII) maintained by the healthcare and healthcare insurance industries should be protected from fraud and theft
- Payment Card Industry Data Security Standard, and many others
As the attacks become more sophisticated, the security standards have to continuously adapt to better protect sensitive data and communications and ultimately protect our connected world. To this end, the PCI-SIG and CXL standards organizations added security requirements like Integrity and Data Encryption to PCIe 5.0 and CXL 2.0 specifications in late 2020. Security is expected to continue to be adopted for the next generation PCIe 6.0 and CXL 3.0 interconnects as well.
PCIe and CXL security system components
Security for PCI and CXL interfaces has two main components: 1) Authentication & Key Management, and 2) Integrity and Data Encryption (IDE), as depicted in Figure 1.
Authentication & key management
Authentication and key management include functions like authentication, attestation, measurement, identification, and key exchange, all running in a trusted execution environment / secure module.
The main reference standard for authentication and key management is the Security Protocol and Data Module (SPDM) that is managed by the Distributed Management Task Force (DMTF). SPDM defines messages, data objects and sequences for performing message exchanges between devices over various transport and physical media and enables efficient access to security capabilities and operations. The message exchanges’ description includes authentication of hardware and measurement of firmware identities.
The PCI-SIG introduced two Engineering Change Notices (ECNs) for authentication and key management:
- Component Measurement and Authentication (CMA) defines how SPDM is applied to PCIe/CXL systems
- Data Object Exchange (DOE) supports data object transport over different interconnects
Integrity and Data Encryption (IDE)
IDE provides confidentiality, integrity and replay protection for Transaction Layer Packets (TLPs) for PCIe and Flow Control UnITs (FLITs) for CXL, ensuring that data on the wire is secure from observation, tampering, deletion, insertion and replay of packets. IDE is based on the AES-GCM cryptographic algorithm and receives keys from the Authentication & Key Management security component.
- Reference standards
- PCI-SIG: PCIe IDE ECN
- CXL 2.0: IDE for CXL.cache/mem protocols. CXL.io protocol refers to PCIe IDE ECN.
Fig. 1: PCIe & CXL security system level view.
PCIe & CXL IDE IP solutions
When looking for PCIe and CXL solutions with security, the tradeoffs to consider are performance, latency, and area. All of this needs to be in compliance with the latest standards, of course, and backed by experts.
Things to look for include:
- Throughput full-duplex for receiver and transmitter directions
- Integration with flexible data bus widths and the same clock configurations as the controllers
- Encryption, decryption, and authentication for TLPs for PCIe and FLITs for CXL, based on the AES-GCM cryptographic algorithm with 256-bit key size
- Configurable widths for cipher and hash algorithms for area and latency optimized solutions
- Inflight key refresh for seamless changes of keys in the system
- Low-latency in-order bypass mode for non-protected traffic
Fig. 2: PCIe IDE Security Module block diagram & integration with PCIe Controller.
Figure 3 depicts a CXL 2.0 IDE security module with pre-verification.
Fig. 3: DesignWare CXL IDE Security Module block diagram & integration with DesignWare CXL Controller.
With the tremendous data growth in our connected world, security is essential to protect private and sensitive information in data as it transfers across systems, including over high-performance interconnects such as PCIe and CXL.
Synopsys recently announced the industry’s first security modules for protecting data in high-performance computing SoCs that use the PCIe 5.0 or CXL 2.0 protocols. The DesignWare IDE Security Module IP for PCIe 5.0 or CXL 2.0 are already being deployed with hyperscaler cloud providers. The robust IDE Security Modules are pre-validated with controller IP for PCIe or CXL, making it faster and easier for designers to protect against data tampering and physical attacks on links while complying with the latest versions of the interconnect protocols. Synopsys’ security IP solutions help prevent a wide range of evolving threats in connected devices such as theft, tampering, side channels attacks, malware and data breaches.
Dana Neustadter is a senior manager of product marketing for security IP at Synopsys. She holds a M. Eng. and B. Sc. in electrical engineering from Technical University Cluj-Napoca.
New Uses For AI
AI is being embedded into an increasing number of technologies that are commonly found inside most chips, and initial results show dramatic improvements in both power and performance.
Unlike high-profile AI implementations, such as self-driving cars or natural language processing, much of this work flies well under the radar for most people. It generally takes the path of least disruption, building on or improving technology that already exists. But in addition to having a significant impact, these developments provide design teams with a baseline for understanding what AI can and cannot do well, how it behaves over time and under different environmental and operating conditions, and how it interacts with other systems.
Until recently, the bulk of AI/machine learning has been confined to the data center or specialized mil/aero applications. It has since begun migrating to the edge, which itself is just beginning to take form, driven by a rising volume of data and the need to process that data closer to the source.
Optimizing the movement of data is an obvious target across all of these markets. So much data is being generated that it is overwhelming traditional von Neumann approaches. Rather than scrap proven architectures, companies are looking at ways to reduce the flow of data back and forth between memories and processors. In-memory and near-memory compute are two such solutions that have gained attention, but adding AI into those approaches can have a significant incremental impact.
Samsung’s announcement that it is adding machine learning into the high-bandwidth memory (HBM) stack is a case in point.
“The most difficult part was how to make this as a drop-in replacement for existing DRAM without impacting any of the computing ecosystem,” said Nam Sung Kim, senior vice president of Samsung’s Memory Business Unit. “We still use existing machine learning algorithms, but this technology is about running them more efficiently. Sometimes it wasn’t feasible to run the machine learning model in the past because it required too much memory bandwidth. But with the computing unit inside the memory, now we can explore a lot more bandwidth.”
Kim said this approach allowed a 70% reduction in total system energy without any additional optimization. What makes this so valuable is that it adds a level of “intelligence” into how data is moved. That, in turn, can be paired with other technology improvements to achieve even greater power/performance efficiency. Kim estimates this can be an order of magnitude, but other technologies could push this even higher.
Fig. 1: Processing in memory software stack. Source: Samsung
“As an industry, we have to look in a few different places,” said Steven Woo, fellow and distinguished inventor at Rambus. “One of them is architectures. We have to think about what are the right ways to construct chips so they’re really targeted more toward the actual algorithms. We’ve been seeing that happen for the last four or five years. People have implemented some really neat architectures — things like systolic arrays and more targeted implementations. There are some other ones, too. We certainly know that memory systems are very, very important in the overall energy consumption. One of the things that has to happen is we have to work on making memory accesses more energy-efficient. Utilizing the PHY more effectively is an important piece. SoCs themselves are spending 25% to 40% of their power budget just on PHYs, and then the act of moving data back and forth between and SoC and a PHY — about two thirds of power being used is really just in the movement of the data. And that’s just for HBM2. For GDDR, even more of the power is spent in moving the data because it’s a higher data rate. For an equivalent bandwidth, it’s taking more power just because it’s a much higher speed signal.”
Fig. 2: Breakdown of data movement costs. Source: Rambus
Another place where this kind of approach is being utilized is network configuration and optimization. Unlike in the past, when a computer or smart phone could tap into any of a number of standards-based protocols and networks, the edge is focused on application-specific optimizations and unique implementations. Every component in the data flow needs to be optimized, sometimes across different systems that are connected together.
This is causing headaches for users, who have to integrate edge systems, as well as for vendors looking to sell a horizontal technology that can work across many vertical markets. And it is opening the door for more intelligent devices and components that can configure themselves on a network or in a package — as well as for configurable devices that can adapt to changes in algorithms used for those markets.
“It’s going to start out as software-defined hardware, but it’s going to evolve into a self-healing, self-orchestrating device that can be AI-enabled,” said Kartik Srinivasan, director of data center marketing at Xilinx. “It can say, ‘I’m going to do this level of processing for specific traffic flows,’ and do a multitude of offloads depending upon what AI is needed.”
AI/ML is proving to be very good at understanding how to prioritize and partition data based upon patterns of behavior and probabilities for where it can be best utilized. Not all data needs to be acted upon immediately, and much of it can be trashed locally.
“We’re starting to view machine learning as an optimization problem,” said Anoop Saha, senior manager for strategy and business development at Siemens EDA. “Machine learning historically has been used for pattern recognition, whether it’s supervised or unsupervised learning or reinforcement learning. The idea is that you recognize some pattern from the data that you have, and then use that to classify things to make predictions or do a cat-versus-dog identification. There are other use cases, though, such as a smart NIC card, where you didn’t find the network topology identifying how you maximize your SDN (software defined networking) network. These are not pure pattern-recognition problems, and they are very interesting for the broader industry. People are starting to use this for a variety of tasks.”
While the implementations are highly specific, general concepts are starting to come into focus across multiple markets. “It differs somewhat depending on the market segment that you’re in,” said Geoff Tate, CEO of Flex Logix. “We’re working at what we’re calling the enterprise edge for medical imaging and things like that. Our customers need high throughput, high accuracy, low cost, and low power. So you really have to have an architecture that’s better than GPUs, and we benchmarked ours at 3 to 10 times better. We do that with finer granularity, and rather than having a big matrix multiplier, we have our one-dimensional tensor processors. Those are modular, so we can combine them in different ways to do different convolution and matrix applications. That also requires a programmable interconnect, which we’ve developed. And the last thing we do is have our compute very close to memory, which minimizes latency and power. All of the computation takes place in SRAM, and then the DRAM is used for storing weights.”
AI on the edge
This modular and programmable kind of approach is often hidden in many of these designs, but the emphasis on flexibility in design and implementation is critical. More sensors, a flood of data, and a slowdown in the benefits of scaling, have forced chipmakers to pivot to more complex architectures that can drive down latency and power while boosting performance.
This is particularly true on the edge, where some of the devices are based on batteries, and in on-premises and near-premises data centers where speed is the critical factor. Solutions tend to be highly customized, heterogeneous, and often involve multiple chips in a package. So instead of a hyperscale cloud, where everything is located in one or more giant data centers, there are layers of processing based upon how quickly data needs to be acted upon and how much data needs to be processed.
The result is a massively complex data partitioning problem, because now that data has to be intelligently parsed between different servers and even between different systems. “We definitely see that trend, especially with more edge nodes on the way,” said Sandeep Krishnegowda, senior director of marketing and applications for memory solutions at Infineon. “When there’s more data coming in, you have to partition what you’re trying to accelerate. You don’t want to just send raw bits of information all the way to the cloud. It needs to be meaningful data. At the same time, you want real-time controller on the edge to actually make the inference decisions right there. All of this definitely has highlighted changes to architecture, making it more efficient at managing your traffic. But most importantly, a lot of this comes back to data and how you manage the data. And invariably a lot of that goes back to your memory and the subsystem of memory architectures.”
In addition, this becomes a routing problem because everything is connected and data is flowing back and forth.
“If you’re doing a data center chip, you’re designing at the reticle limit,” said Frank Schirrmeister, senior group director for solution marketing at Cadence. “You have an accelerator in there, different thermal aspects, and 3D-IC issues. When you move down to the wearable, you’re still dealing with equally relevant thermal power levels, and in a car you have an AI component. So this is going in all directions, and it needs a holistic approach. You need to optimize the low-power/thermal/energy activities regardless of where you are at the edge, and people will need to adapt systems for their workloads. Then it comes down to how you put these things together.”
That adds yet another level of complexity. “Initially it was, ‘I need the highest density SRAM I can get so that I can fit as many activations and weights on chip as possible,’” said Ron Lowman, strategic marketing manager for IP at Synopsys. “Other companies were saying they needed it to be as low power as possible. We had those types of solutions before, but we saw a lot of new requests specifically around AI. And then they moved to the next step where they’d say, ‘I need some customizations beyond the highest density or lowest leakage,’ because they’re combining them with specialized processing components such as memory and compute-type technologies. So there are building blocks, like primitive math blocks, DSP processors, RISC processors, and then a special neural network engine. All of those components make up the processing solution, which includes scalar, vector, and matrix multiplication, and memory architectures that are connected to it. When we first did these processors, it was assumed that you would have some sort of external memory interface, most likely LPDDR or DDR, and so a lot of systems were built that way around those assumptions. But there are unique architectures out there with high-bandwidth memories, and that changes how loads and stores are taken from those external memory interfaces and the sizes of those. Then the customer adds their special sauce. That will continue to grow as more niches are found.”
Those niches will increase the demand for more types of hardware, but they also will drive demand for continued expansion of these base-level technologies that can be form-fitted to a particular use case.
“Our FPGAs are littered with memory across the entire device, so you can localize memory directly to the accelerator, which can be a deep learning processing unit,” said Jayson Bethurem, product line manager at Xilinx. “And because the architecture is not fixed, it can be adapted to different characterizations, and classification topologies, with CNNs and other things like that. That’s where most of the application growth is, and we see people wanting to classify something before they react to it.”
AI’s limits in end devices
AI itself is not a fixed technology. Different pieces of an AI solution are in motion as the technology adapts and optimizes, so processing results typically come in the form of distributions and probabilities of acceptability.
That makes it particularly difficult to define the precision and reliability of AI, because the metrics for each implementation and use case are different, and it’s one reason why the chip industry is treading carefully with this technology. For example, consider AI/ML in a car with assisted driving. The data inputs and decisions need to be made in real time, but the AI system needs to be able to weight the value of that data, which may be different from how another vehicle weights that data. Assuming the two vehicles don’t ever interact, that’s not a problem. But if they’re sharing information, the result can be very different.
“That’s somewhat of an open problem,” said Rob Aitken, fellow and director of technology for Arm’s Research and Development Group. “If you have a system with a given accuracy and another with a different accuracy, then cumulatively their accuracy depends on how independent they are from each other. But it also depends on what mechanism you use to combine the two. This seems to be reasonably well understood in things like image recognition, but it’s harder when you’re looking at an automotive application where you’ve got some radar data and some camera data. They’re effectively independent of one another, but their accuracies are dependent on external factors that you would have to know, in addition to everything else. So the radar may say, ‘This is a cat,’ but the camera says there’s nothing there. If it’s dark, then the radar is probably right. If it’s raining, maybe the radar is wrong, too. These external bits can come into play very, very quickly and start to overwhelm any rule of thumb.”
All of those interactions need to be understood in detail. “A lot of designs in automotive are highly configurable, and they’re configurable even on the fly based on the data they’re getting from sensors,” said Simon Rance, head of marketing at ClioSoft. “The data is going from those sensors back to processors. The sheer amount of data that’s running from the vehicle to the data center and back to the vehicle, all of that has to be traced. If something goes wrong, they’ve got to trace it and figure out what the root cause is. That’s where there’s a need to be filled.”
Another problem is knowing what is relevant data and what is not. “When you’re shifting AI to the edge, you shift something like a model, which means that you already know what is the relevant part of the information and what is not,” said Dirk Mayer, department head for distributed data processing and control in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Even if you just do something like a low-pass filtering or high-pass filtering or averaging, you have something in mind that tells you, ‘Okay, this is relevant if you apply a low-pass filter, or you just need data up to 100 Hz or so.’”
The challenge is being able to leverage that across multiple implementations of AI. “Even if you look at something basic, like a milling machine, the process is the same but the machines may be totally different,” Mayer said. “The process materials are different, the materials being milled are different, the process speed is different, and so on. It’s quite hard to invent artificial intelligence that adapts itself from one machine to another. You always need a retraining stage and time to collect new data. This will be a very interesting research area to invent something like building blocks for AI, where the algorithm is widely accepted in the industry and you can move it from this machine to that machine and it’s pre-trained. So you add domain expertise, some basic process parameters, and you can parameterize your algorithm so that it learns faster.”
That is not where the chip industry is today, however. AI and its sub-groups, machine learning and deep learning, add unique capabilities to an industry that was built on volume and mass reproducibility. While AI has been proven to be effective for certain things, such as optimizing data traffic and partitioning based upon use patterns, it has a long way to go before it can make much bigger decisions with predictable outcomes.
The early results of power reduction and performance improvements are encouraging. But they also need to be set in the context of a much broader set of systems, the rapid evolution of multiple market segments, and different approaches such as heterogeneous integration, domain-specific designs, and the limitations of data sharing across the supply chain.
SoC Integration Complexity: Size Doesn’t (Always) Matter
It’s common when talking about complexity in systems-on-chip (SoCs) to haul out monster examples: application processors, giant AI chips, and the like. Breaking with that tradition, consider an internet of things (IoT) design, which can still challenge engineers with plenty of complexity in architecture and integration. This complexity springs from two drivers: very low power consumption, even using harvested MEMS power instead of a battery, and quick turnaround to build out a huge family of products based on a common SoC platform while keeping tight control on development and unit costs.
Fig. 1: Block diagram of a low-power TI CC26xx processor. (Sources: The Linley Group, “Low-Power Design Using NoC Technology”; TI)
For these types of always-on IoT chips, a real-time clock is needed to wake the system up periodically – to sense, compute, communicate and then go back to sleep; a microcontroller (MCU) for control, processing, plus security features; and local memory and flash to store software. I/O is required for provisioning, debugging, and interfacing to multiple external sensors/actuators. Also necessary is a wireless interface, such as Bluetooth Low Energy, because the aim is first at warehouse applications, and relatively short-range links are OK for that application.
This is already a complex SoC, and the designer hasn’t even started to think about adding more features. For a product built around this chip to run for years on a coin cell battery or a solar panel, almost all of this functionality has to be powered down most of the time. Most devices will have to be in switchable power domains and quite likely switchable voltage domains for dynamic voltage and frequency scaling (DVFS) support. A power manager is needed to control this power and voltage switching, which will have to be built/generated for this SoC. That power state controller will add control and status registers (CSRs) to ultimately connect with the embedded software stack.
Fig. 2: There are ten power domains in the TI CC26xx SoC. The processor has two voltage domains in addition to always-on logic (marked with *). (Sources: The Linley Group, “Low-Power Design Using NoC Technology”; TI)
Running through this SoC is the interconnect, the on-chip communications backbone connecting all these devices, interfaces, and CSRs. Remember that interconnects consume power, too, even passively, through clock toggling and even leakage power while quiescent. Because they connect everything, conventional buses are either all on or all off, which isn’t great when trying to eke out extra years of battery life. Designers also need fine-grained power management within the interconnect, another capability lacking in old bus technology.
How can a design team achieve extremely low power consumption in IoT chips like these? By dumping the power-hungry bus and switching to a network-on-chip (NoC) interconnect!
Real-world production chip implementation has shown that switching to a NoC lowers overall power consumption by anywhere from two to nine times compared to buses and crossbars. The primary reasons NoCs have lower power consumption are due to the lower die area of NoCs compared to buses and crossbars and multilevel clock gating (local, unit-level, and root), which enables sophisticated implementation of multiple power domains. This provides three levels of clock gating. For the TI IoT chips, the engineering team implemented multiple overlapping power and clock domains to meet their use cases using the least amount of power possible while limiting current draw to just 0.55mA in idle mode. Using a NoC to reduce active and standby power allowed the team to create IoT chips that can run for over a year using a standard CR2032 coin battery.
Low power is not enough to create successful IoT chips. These markets are fickle with a need for low cost while meeting constantly changing requirements for wireless connectivity standards, sensors, display, and actuator interfaces. Now engineers must think about variants, or derivatives, based on our initial IoT platform architecture. These can range from a narrowband internet of things (NB-IoT) wireless option for agricultural and logistics markets to an audio interface alarm and AI-based anomaly detection. It makes perfect strategic sense to create multiple derivative chips from a common architectural SoC platform, but how will this affect implementation if someone made the mistake of choosing a bus? Conventional bus structures have a disproportionate influence on the floorplan. Change a little functionally, and the floorplan may have to change considerably, resulting in a de facto “re-spin” of the chip architecture, defeating the purpose of having a platform strategy. Can an engineer anticipate all of this while still working on the baseline product? Is there a way to build more floorplan reusability into that first implementation?
A platform strategy for low-power SoCs isn’t just about the interconnect IP. As the engineer tweaks and enhances each design by adding, removing or reconfiguring IPs, and optimizing interconnect structure and power management, the software interface to the hardware will change, too. Getting that interface exactly right is rather critical. A mistake here might make the device non-operational, but at least someone would figure that out quickly. More damaging to the bottom line would be a small bug that may leave on a power domain when it should have shut off. An expected 1-year battery life drops to three months. A foolproof memory map can’t afford to depend on manual updates and verification. It must be generated automatically. IP-XACT based IP deployment technology provides state-of-the-art capabilities to maintain traceability and guarantee correctness of this type of design data throughout the product lifecycle.
Even though these designs are small compared to mega-SoCs, there’s still plenty of complexity, yet plenty of opportunity to get it wrong. At Arteris IP, we’re laser-focused on maximizing automation and optimization in SoC integration to make sure our users always get it “first time right.” Give us a call!
Kurt Shuler is vice president of marketing at ArterisIP. He is a member of the US Technical Advisory Group (TAG) to the ISO 26262/TC22/SC3/WG16 working group and helps create safety standards for semiconductors and semiconductor IP. He has extensive IP, semiconductor, and software marketing experiences in the mobile, consumer, automotive, and enterprise segments working for Intel, Texas Instruments, and four startups. Prior to his entry into technology, he flew as an air commando in the US Air Force Special Operations Forces. Shuler earned a B.S. in Aeronautical Engineering from the United States Air Force Academy and an M.B.A. from the MIT Sloan School of Management.
Free Fire World Series APK Download for Android
DreamHack Online Open Ft. Fortnite April Edition – How To Register, Format, Dates, Prize Pool & More
Hikaru Nakamura drops chessbae, apologizes for YouTube strike
Dota 2: Top Mid Heroes of Patch 7.29
Ludwig Closes Out Month-Long Streaming Marathon in First Place – Weekly Twitch Top 10s, April 5-11
Overwatch League 2021 Day 1 Recap
Position 5 Faceless Void is making waves in North American Dota 2 pubs after patch 7.29
Apex Legends update 1.65 brings five new LTMs for War Games
Welche Probleme bringen US-Bitcoin ETFs mit sich?
Fortnite: Patch Notes v16.20 – Off-Road Vehicle Mods, 50-Player Creative Lobbies, Bug Fixes & More
Which crypto exchange platform is faster, coin transfer or Godex?
Complete guide to romance and marriage in Stardew Valley
Wild Rift patch 2.2a brings tons of champion changes and the addition of Rammus later this month
TenZ on loan to Sentinels through Valorant Challengers Finals
Bitcoin Preis steigt auf über 60.000 USD, neues ATH wahrscheinlich
Fortnite Leak Teases Aloy Skin From Horizon Zero Dawn
flusha announces new CSGO roster featuring suNny and sergej
Capcom Reveals Ransomware Hack Came from Old VPN
LoL: MAD Lions Are The New Kings Of Europe, Is The Reign Of G2 Esports And Fnatic Finally Over?
Epic Games Store lost $181 million & $273 million in 2019 and 2020
Esports6 days ago
chessbae removed as moderator from Chess.com amid drama
Esports2 days ago
Free Fire World Series APK Download for Android
Esports1 week ago
Dota 2 Patch 7.29 Will Reveal a New Hero
Esports5 days ago
DreamHack Online Open Ft. Fortnite April Edition – How To Register, Format, Dates, Prize Pool & More
Fintech1 week ago
Novatti’s Ripple partnership live to The Philippines
Esports1 week ago
Dota 2 Dawnbreaker Hero Guide
Blockchain1 week ago
Krypto-News Roundup 8. April
Cyber Security1 week ago
Fintechs are ransomware targets. Here are 9 ways to prevent it.
Blockchain1 week ago
Ripple Klage: CEO zeigt sich nach Anhörung positiv
Esports5 days ago
Hikaru Nakamura drops chessbae, apologizes for YouTube strike
Esports1 week ago
indiefoxx was just banned from Twitch again, but why?
Esports6 days ago
Why did Twitch ban the word “obese” from its predictions?