SambaNova says its latest chips can best Nvidia’s A100 silicon by a wide margin, at least when it comes to machine learning workloads.
The Palo Alto-based AI startup this week revealed its DataScale systems and Cardinal SN30 accelerator, which the company claims is capable of delivering 688 TFLOPS of BF16 performance, twice that of Nvidia’s A100.
However, in machine learning training workloads, SambaNova says the gap is even larger. The company claims its SN30-based DataScale systems are six times faster when training a 13-billion parameter GPT model than Nvidia’s DGX A100 servers, at least according to its internal benchmarks, so take them with a healthy dose of salt.
The SN30 is manufactured on a 7nm TSMC process node, which packs 86 billion transistors into a single die. The chip itself is a little unconventional compared to other high-performance accelerators on the market today, in that it’s not a GPU, CPU, or traditional FPGA.
SambaNova describes the chip as a Reconfigurable Data Flow Unit or RDU. “Reconfigurability is key to the architecture. So unlike a GPU or CPU that have fixed elements to it, think of this as an array of compute and memory on chip,” Marshall Choy, SVP of Product at SambaNova Systems, told The Register.
In many ways, the RDU is reminiscent of an FPGA, though as Choy points out, nowhere near as fine grained.
According to Choy, the closest comparison is to a coarse grained reconfigurable architecture (CGRA), which typically lacks the gate-level control of an FPGA but benefits from lower power consumption and faster time to reconfigure.
“We think of our chip and our hardware as being software defined, because we’re actually reconfiguring with each input to configure to the needs of the operator being executed,” Choy said.
For example, while the chip lacks the large matrix math engines you might find in a dedicated AI accelerator, the chip can reconfigure itself to achieve the same results. This is done using SambaNova’s software stack which extracts common parallel patterns, Choy explained.
Alleviating memory bottlenecks
The SN30’s configurability is only part of the equation, with memory being the other, Choy notes.
The chip features a 640MB SRAM cache, which is combined with a much larger terabyte of external DRAM per socket. Choy claims this approach – relatively small cache with plenty of outside DRAM capacity – allows the company’s technology to more efficiently accommodate large natural language models (NLP).
The argument seems to be, from SambaNova, that to use these big models with off-the-shelf GPUs, you need to get lots of these processors into a system and pool together their onboard memory to hold all that data as it’s being accessed, whereas you need fewer SN30 chips as they can store the model in their large external DDR-connected DRAM.
For example, you might have an 800GB model that thus need 10 80GB Nvidia GPUs to hold it all in memory, but you may not need 10 GPUs to perform the task, thus you’re wasting money, energy, and space on that unneeded silicon. You could instead do it with a few SN30s and use their sizeable external DRAM to hold the model, or so SambaNova’s logic goes.
“If you look at NLP, for example, Nvidia and everybody else just does a quick calculation. We need X amount of memory, therefore we need this many GPUs,” Choy said. “What we’ve done is architected our system to provide 12.8 times more memory than an Nvidia-based [80GB-per-GPU] system”
Thus, SambaNova appears to be balancing whatever performance hit it suffers from using that external DRAM against that memory’s massive and comparatively cheap capacity plus the performance of its chip architecture.
“We’re seeing cases where it might take 1,400 GPUs to get the job done. We’re throwing 64 sockets at it because we’ve got 12.8 times the memory,” Choy said.
We should note that SambaNova’s approach to this problem is by no means novel. Graphcore has employed a similar SRAM cache and memory architecture in its intelligence processing units. Meanwhile, Nvidia’s Grace-Hopper SuperChips package the company’s Arm-compatible CPU and GH100 GPUs with 80GB of HBM and 512GB of LPDDR5.
An AI datacenter as-a-service
Unlike Nvidia, SambaNova isn’t selling GPU dies for integration into OEM systems or PCIe cards. The SN30 is only available as part of a full system and is designed to be used with the company’s software stack.
“The smallest unit of consumption would be a complete eight-socket system from us,” Choy said.
In fact, the systems are shipped complete in racks with integrated power delivery and networking. In this regard, DataScale is more comparable to Nvidia’s DGX servers, which are designed for rack-scale deployment using the chip giant’s proprietary switches.
Four DataScale systems can be fitted into a rack, and the company claims it can scale out to as many as 48 racks in large-scale deployments.
Beyond hardware and software, the company is also offering fully trained foundational models for customers that don’t have the expertise or interest necessary to develop and train their own.
According to Choy, this is a frequent request from customers who’d prefer to focus on the data science and engineering associated with refining the datasets rather than training the models.
However, AI infrastructure and software remain prohibitively expensive for many customers, with single systems often costing hundreds of thousands of dollars a piece.
Recognizing this, SambaNova plans to offer its DataScale and SambaFlow software suite as a subscription service from the start.
Choy claims the approach will enable customers to achieve a return on investment faster and with less risk than purchasing AI infrastructure outright. ®