Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Artificial intelligence startup Galileo released a comprehensive benchmark on Monday revealing that open-source language models are rapidly closing the performance gap with their proprietary counterparts. This shift could reshape the AI landscape, potentially democratizing advanced AI capabilities and accelerating innovation across industries.
The second annual Hallucination Index from Galileo evaluated 22 leading large language models on their tendency to generate inaccurate information. While closed-source models still lead overall, the margin has narrowed significantly in just eight months.
“The huge improvements in open-source models was absolutely incredible to see,” said Vikram Chatterji, co-founder and CEO of Galileo, in an interview with VentureBeat. “Back then [in October 2023] the first five or six were all closed source API models, mostly OpenAI models. Versus now, open source has been closing the gap.”
This trend could lower barriers to entry for startups and researchers while pressuring established players to innovate more rapidly or risk losing their edge.
The new AI royalty: Anthropic’s Claude 3.5 Sonnet dethrones OpenAI
Anthropic’s Claude 3.5 Sonnet topped the index as the best performing model across all tasks, outpacing offerings from OpenAI that dominated last year’s rankings. This shift indicates a changing of the guard in the AI arms race, with newer entrants challenging the established leaders.
“We were extremely impressed by Anthropic’s latest set of models,” Chatterji said. “Not only was Sonnet able to perform excellently across short, medium, and long context windows, scoring an average of 0.97, 1, and 1 respectively across tasks, but the model’s support of up to a 200k context window suggests it could support even larger datasets than we tested.”
The index also highlighted the importance of considering cost-effectiveness alongside raw performance. Google’s Gemini 1.5 Flash emerged as the most efficient option, delivering strong results at a fraction of the price of top models.
“The dollar per million prompt tokens cost for Flash was $0.35, but it was $3 for Sonnet,” Chatterji told VentureBeat. “When you look at the output, dollars per million response token cost, it’s about $1 for Flash, but it’s $15 for Sonnet. So now anyone who’s using Sonnet immediately has to have money in the bank, which is like, at least like 15 to 20x more, whereas literally Flash is not that much worse at all.”
This cost disparity could prove crucial for businesses looking to deploy AI at scale, potentially driving adoption of more efficient models even if they don’t top performance charts.
Global competition heats up: Alibaba’s open-source model makes waves
Alibaba’s Qwen2-72B-Instruct performed best among open-source models, scoring highly on short and medium-length inputs. This success signals a broader trend of non-U.S. companies making significant strides in AI development, challenging the notion of American dominance in the field.
Chatterji sees this as part of a larger democratization of AI technology. “What I see this unlocking—using Llama 3, using Qwen—teams across the world, across different economic strata, can just start building really incredible products,” he said.
He added that we’re likely to see these models becoming optimized for edge and mobile devices, leading to “incredible mobile apps and web apps and apps on the edge being built out with these open source models.”
The index introduces a new focus on how models handle different context lengths, from short snippets to long documents, reflecting the growing use of AI for tasks like summarizing lengthy reports or answering questions about extensive datasets. This approach provides a more nuanced view of model capabilities, essential for businesses considering AI deployment in various scenarios.
“We focused on breaking that down based on context length — small, medium, and large,” Chatterji told VentureBeat. “That and the other big piece here was cost versus performance. Because that’s very top of mind for people.”
The index also revealed that bigger isn’t always better when it comes to AI models. In some cases, smaller models outperformed their larger counterparts, suggesting that efficient design can sometimes trump sheer scale.
“The Gemini 1.5 Flash model was an absolute revelation for us because it outperformed larger models,” Chatterji said. “This suggests that if you have great model design efficiency, that can outweigh the scale.”
This finding could drive a shift in AI development, with companies focusing more on optimizing existing architectures rather than simply scaling up model size.
The AI crystal ball: Predicting the future of language models
Galileo’s findings could significantly impact enterprise AI adoption. As open-source models improve and become more cost-effective, companies may deploy powerful AI capabilities without relying on expensive proprietary services. This could lead to more widespread AI integration across industries, potentially boosting productivity and innovation.
The startup, which provides tools for monitoring and improving AI systems, is positioning itself as a key player in helping enterprises navigate the rapidly evolving landscape of language models. By offering regular, practical benchmarks, Galileo aims to become an essential resource for technical decision-makers.
“We want this to be something that our enterprise customers and our AI team users can just use as a powerful, ever-evolving resource for what’s the most efficient way to build out AI applications instead of just, you know, feeling through the dark and trying to figure it out,” Chatterji said.
As the AI arms race intensifies, with new models being released almost weekly, Galileo’s index offers a snapshot of an industry in flux. The company plans to update the benchmark quarterly, providing ongoing insight into the shifting balance between open-source and proprietary AI technologies.
Looking ahead, Chatterji anticipates further developments in the field. “We’re starting to see large models that are like operating systems for this very powerful reasoning,” he said. “And it’s going to become more and more generalizable over the course of the next maybe one to two years, as well as see the context lengths that they can support, especially on the open source side, will start increasing a lot more. Cost is going to go down quite a lot, just the laws of physics are going to kick in.”
He also predicts a rise in multimodal models and agent-based systems, which will require new evaluation frameworks and likely spur another round of innovation in the AI industry.
As businesses grapple with the rapid pace of AI advancement, tools like Galileo’s Hallucination Index will likely play an increasingly crucial role in informing decision-making and strategy. The democratization of AI capabilities, coupled with the growing importance of cost-efficiency, suggests a future where advanced AI is not just more powerful, but also more accessible to a wider range of organizations.
This evolving landscape presents both opportunities and challenges for businesses. While the availability of high-performing, cost-effective AI models could drive innovation and efficiency, it also requires careful consideration of which technologies to adopt and how to integrate them effectively.
As the line between open-source and proprietary AI continues to blur, companies will need to stay informed and agile, ready to adapt their strategies as the technology evolves. Galileo’s benchmark serves not just as a snapshot of the current state of AI, but as a roadmap for navigating the complex and rapidly changing world of artificial intelligence.