SambaNova launches fastest deployment of DeepSeek-R1 model

Wed, 19th Feb 2025

SambaNova has announced reaching a milestone with the fastest deployment of the DeepSeek-R1 671B large language model, providing significant advancements in speed and efficiency.

According to SambaNova, the deployment achieves an unprecedented 198 tokens per second, setting itself apart from traditional methods through the scaling capacity expected to reach 100 times by the end of the year. The hardware requirements have been reduced significantly, from 40 racks (320 GPUs) to a single rack with 16 chips.

Rodrigo Liang, CEO of SambaNova, stated, "SambaNova is the first company to offer inference at scale with DeepSeek-R1, the full size model, fast and efficiently to developers." DeepSeek on SambaNova is currently hosted in US data centres, ensuring private and secure operations.

The company emphasises that DeepSeek-R1 has lowered AI training costs by tenfold, but widespread adoption was previously hindered by high inference costs and inefficiencies. With SambaNova's deployment, these barriers are removed, allowing real-time, cost-effective inference at scale.

Rodrigo Liang further expressed, "Powered by the SN40L RDU chip, SambaNova is the fastest platform running DeepSeek at 198 tokens per second per user. This will increase to 5X faster than the latest GPU speed on a single rack — and by year end, we will offer 100X capacity for DeepSeek-R1."

Dr. Andrew Ng, a noted figure in AI, commented on the significance of this development, stating, "Being able to run the full DeepSeek-R1 671B model — not a distilled version — at SambaNova's blazingly fast speed is a game changer for developers. Reasoning models like R1 need to generate a lot of reasoning tokens to come up with a superior output, which makes them take longer than traditional LLMs. This makes speeding them up especially important."

Another verification of SambaNova's claims comes from Artificial Analysis, which independently benchmarked the cloud deployment, as Co-Founder George Cameron remarked, "Artificial Analysis has independently benchmarked SambaNova's cloud deployment of the full 671 billion parameter DeepSeek-R1 Mixture of Experts model at over 195 output tokens/s, the fastest output speed we have ever measured for DeepSeek-R1. High output speeds are particularly important for reasoning models, as these models use reasoning output tokens to improve the quality of their responses. SambaNova's high output speeds will support the use of reasoning models in latency sensitive use cases."

The company has tackled the primary challenge of inference at scale for DeepSeek-R1, a model that previously faced constraints due to the inefficiency of GPU-based inference, thus limiting its application. SambaNova's solution reduces the hardware requirements to one rack, compared to the extensive 40 racks required by GPUs.

Rodrigo Liang acknowledged this contribution, saying, "DeepSeek-R1 is one of the most advanced frontier AI models available, but its full potential has been limited by the inefficiency of GPUs. That changes today. We're bringing the next major breakthrough — collapsing inference costs and reducing hardware requirements from 40 racks to just one — to offer DeepSeek-R1 at the fastest speeds, efficiently."

Robert Rizk, CEO of Blackbox AI, highlighted the collaborative benefits, "More than 10 million users and engineering teams at Fortune 500 companies rely on Blackbox AI to transform how they write code and build products. Our partnership with SambaNova plays a critical role in accelerating our autonomous coding agent workflows. SambaNova's chip capabilities are unmatched for serving the full R1 671B model, which provides much better accuracy than any of the distilled versions. We couldn't ask for a better partner to work with to serve millions of users."

Sumti Jairath, Chief Architect at SambaNova, elaborated on the infrastructure's capabilities, "DeepSeek-R1 is the perfect match for SambaNova's three-tier memory architecture. With 671 billion parameters R1 is the largest open source large language model released to date, which means it needs a lot of memory to run. GPUs are memory constrained, but SambaNova's unique dataflow architecture means we can run the model efficiently to achieve 20000 tokens/s of total rack throughput in the near future — unprecedented efficiency when compared to GPUs due to their inherent memory and data communication bottlenecks."

SambaNova continues to expand its capabilities and anticipates that by year end, it will enhance its global capacity for DeepSeek-R1 by over 100 times, making its technology a leading solution for enterprise reasoning models.

Share on: