Groq’s breakthrough AI chip achieves blistering 800 tokens per second on Meta’s LLaMA 3

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.

In a surprising benchmark result that could shake up the competitive landscape for AI inference, startup chip company Groq appears to have confirmed through a series of retweets that its system is serving Meta’s newly released LLaMA 3 large language model at more than 800 tokens per second.

“We’ve been testing against their API a bit and the service is definitely not as fast as the hardware demos have shown. Probably more a software problem—still excited for Groq to be more widely used,” Dan Jakaitis, an engineer who has been benchmarking LLaMA 3 performance, posted on X (formerly known as Twitter).

But according to an X post from OthersideAI cofounder and CEO Matt Shumer, in addition to several other prominent users, the Groq system is delivering lightning-fast inference speeds of over 800 tokens per second with the LLaMA 3 model. If independently verified, this would represent a significant leap forward compared to existing cloud AI services. VentureBeat’s own early testing shows that the claim appears to be true. (You can test it for yourself right here.)

A novel processor architecture optimized for AI

Groq, a well-funded Silicon Valley startup, has been developing a novel processor architecture optimized for the matrix multiplication operations that are the computational heart of deep learning. The company’s Tensor Streaming Processor eschews the caches and complex control logic of conventional CPUs and GPUs in favor of a simplified, deterministic execution model tailored for AI workloads.

VB Event

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

By avoiding the overheads and memory bottlenecks of general-purpose processors, Groq claims it can deliver much higher performance and efficiency for AI inference. The 800 tokens per second LLaMA 3 result, if it holds up, would lend credence to that claim.

Groq’s architecture is a significant departure from the designs used by Nvidia and other established chip makers. Instead of adapting general-purpose processors for AI, Groq has built its Tensor Streaming Processor to accelerate the specific computational patterns of deep learning.

This “clean sheet” approach allows the company to strip out extraneous circuitry and optimize the data flow for the highly repetitive, parallelizable workloads of AI inference. The result, Groq asserts, is a dramatic reduction in the latency, power consumption, and cost of running large neural networks compared to mainstream alternatives.

The need for fast and efficient AI inference

The performance of 800 tokens per second translates to around 48,000 tokens per minute — fast enough to generate about 500 words of text per second. This is nearly an order of magnitude faster than the typical inference speeds of large language models served on conventional GPUs in the cloud today.

Fast and efficient AI inference is becoming increasingly important as language models grow to hundreds of billions of parameters in size. While training these massive models is hugely computationally intensive, deploying them cost-effectively requires hardware that can run them quickly without consuming enormous amounts of power. This is especially true for latency-sensitive applications like chatbots, virtual assistants, and interactive experiences.

The energy efficiency of AI inference is also coming under increasing scrutiny as the technology is deployed more widely. Data centers are already significant consumers of electricity, and the computational demands of large-scale AI threaten to dramatically increase that power draw. Hardware that can deliver the necessary inference performance while minimizing energy consumption will be key to making AI sustainable at scale. Groq’s Tensor Streaming Processor is designed with this efficiency imperative in mind, promising to significantly reduce the power cost of running large neural networks compared to general-purpose processors.

Challenging Nvidia’s dominance

Nvidia currently dominates the market for AI processors, with its A100 and H100 GPUs powering the vast majority of cloud AI services. But a crop of well-funded startups like Groq, Cerebras, SambaNova and Graphcore are challenging that dominance with new architectures purpose-built for AI.

Of these challengers, Groq has been one of the most vocal about targeting inference as well as training. CEO Jonathan Ross has boldly predicted that most AI startups will be using Groq’s low-precision tensor streaming processors for inference by the end of 2024.

Meta’s release of LLaMA 3, described as one of the most capable open source language models available, provides a high-profile opportunity for Groq to showcase its hardware’s inference capabilities. The model, which Meta claims is on par with the best closed-source offerings, is likely to be widely used for benchmarking and deployed in many AI applications.

If Groq’s hardware can run LLaMA 3 significantly faster and more efficiently than mainstream alternatives, it would bolster the startup’s claims and potentially accelerate the adoption of its technology. Groq recently launched a new business unit to make its chips more easily accessible to customers through a cloud service and partnerships.

The combination of powerful open models like LLaMA and highly efficient “AI-first” inference hardware like Groq’s could make advanced language AI more cost-effective and accessible to a wider range of businesses and developers. But Nvidia won’t cede its lead easily, and other challengers are also in the wings.

What’s certain is that the race is on to build infrastructure that can keep up with the explosive progress in AI model development and scale the technology to meet the demands of a rapidly expanding range of applications. Near real-time AI inference at affordable cost could open up transformative possibilities in areas like e-commerce, education, finance, healthcare and more.

As one user reacted to Groq’s LLaMA 3 benchmark claim: “speed + low_cost + quality = it doesn’t make sense to use anything else [right now]”. The coming months will reveal if that bold equation plays out, but it’s clear that AI’s hardware foundations are anything but settled as a new wave of architectures challenges the status quo.