Zephyrnet Logo

WEBINAR: FPGAs for Real-Time Machine Learning Inference

Date:

With AI applications proliferating, many designers are looking for ways to reduce server footprints in data centers – and turning to FPGA-based accelerator cards for the job. In a 20-minute session, Salvador Alvarez, Sr. Manager of Product Planning at Achronix, provides insight on the potential of FPGAs for real-time machine learning inference, illustrating how an automatic speech recognition (ASR) application might work with acceleration.

High-level requirements for ASR

Speech recognition is a computationally intensive task and an excellent fit for machine learning (ML). Language differences aside, speakers have different inflections and accents and vary in their use of vocabulary and grammar. Still, sophisticated ML models can produce accurate speech-to-text results using cloud-based resources. Popular models include connectionist temporal classification, listen-attend-spell, and recurrent neural network transducer.

A deterministic, low-latency response is essential. Transit time from an edge device to the cloud and back is low enough on fast 5G or fiber networks to make speech processing the dominant term in response time. Interactive systems add natural language processing and text-to-speech features. Users expect a normal conversation flow and will accept short delays.

Accuracy is also a must, with a low word error rate. Correct speech interpretation depends on what words are present in the conversational vocabulary. Research continues into ASR improvements, and flexibility to adopt new algorithms with a better response in speed or accuracy is a must-have for an ASR system.

While cloud-based resources offer the potential for more processing power than most edge devices, they are not infinitely scalable without tradeoffs. Capital expenditure (CapEx) costs and energy consumption can be substantial in scaled-up, high-throughput configurations that simultaneously take speech input from many users.

FPGA-based acceleration meets the challenge

Multiply-accumulate workloads with high parallelization, typical of most ML algorithms, don’t fit CPUs well, requiring some acceleration to hit performance, cost, and power consumption goals. Three primary ML acceleration vehicles exist: GPUs, ASICs, and FPGAs. GPUs offer flexibility but tend to drive power consumption through the roof with efficiency challenges. ASICs offer tuned performance for specific workloads but can limit flexibility as new models come into play.

FPGA-based acceleration checks all the boxes. By consolidating acceleration in one server with high-performance FPGA accelerator cards, server counts drop drastically while determinism and latency improve. Flexibility for algorithm changes is excellent, requiring only a new FPGA bitstream for new model implementations. Eliminating servers reduces up-front CapEx, helps with space and power consumption, and simplifies maintenance and OpEx.

An server plus an accelerator with FPGAs for real-time machine learning inference reduces costs and energy consumption up to 90 percent

High-performance FPGAs like the Achronix Speedster7t family have four features suited for real-time ML inference. Logic blocks provide multiply-accumulate resources. High bandwidth memory keeps data and weighting coefficients flowing, and high-speed interfaces provide the connection to the host server platform. FPGA logic also supports various computational precision needs, yielding ML inference accuracy and lowering ML training requirements.

Overlays help non-FPGA designers

Some ML developers may be less familiar with FPGA design tactics. “An overlay can optimally configure the hardware on an FPGA to create a highly-efficient engine, yet leave it software programmable,” says Alvarez. He expands on how accelerator IP from Myrtle.ai can be configured into the FPGA, abstracting the user interface, upping the clock rate, and utilizing hardware better.

An overlay can help configure FPGAs for real-time machine learning inference

Alvarez wraps up this webinar on FPGAs for real-time machine learning with a case study describing how an accelerated ASR appliance might work. With the proper ML training, simultaneously transcribing thousands of voice streams with dynamic language allocation becomes possible. According to Achronix:

  • One server with a 250W PCIe Speedster 7t-based accelerator card can replace 20 servers without acceleration
  • Each accelerated server delivers as many as 4000 streaming speech channels
  • Costs and energy consumption both drop by up to 90% by using an accelerated server

Although the example in this webinar is specific to ASR, the principles apply to other machine learning applications where FPGA hardware and IP accelerate inference models. When time-to-market and flexibility matter and high performance is required, FPGAs for real-time machine learning inference are a great fit. Follow the link below to see the entire webinar, including the enlightening case study discussion.

Achronix Webinar: Unlocking the Full Potential of FPGAs for Real-Time Machine Learning Inference

Share this post via:

spot_img

Latest Intelligence

spot_img