The GAIA benchmark: Next-gen AI faces off against real-world challenges

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.

A new artificial intelligence benchmark called GAIA aims to evaluate whether chatbots like ChatGPT can demonstrate human-like reasoning and competence on everyday tasks. 

Created by researchers from Meta, Hugging Face, AutoGPT and GenAI, the benchmark “proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency,” the researchers wrote in a paper published on arXiv.

The researchers said GAIA questions are “conceptually simple for humans yet challenging for most advanced AIs.” They tested the benchmark on human respondents and GPT-4, finding that humans scored 92 percent while GPT-4 with plugins scored only 15 percent.


“This notable performance disparity contrasts with the recent trend of LLMs [large language models] outperforming humans on tasks requiring professional skills in e.g. law or chemistry,” the paper states.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!


Learn More

GAIA focuses on human-like competence, not expertise 

Rather than focusing on tasks difficult for humans, the researchers suggest benchmarks should target tasks that demonstrate an AI system has similar robustness to the average human.

The GAIA methodology led the researchers to devise 466 real-world questions with unambiguous answers. Three-hundred answers are being held privately to power a public GAIA leaderboard, while 166 questions and answers were released as a development set.

“Solving GAIA would represent a milestone in AI research,” said lead author Grégoire Mialon of Meta AI. “We believe the successful resolution of GAIA would be an important milestone towards the next generation of AI systems.”


The human vs. AI performance gap

So far, the leading GAIA score belongs to GPT-4 with manually selected plugins, at 30% accuracy. The benchmark creators said a system that solves GAIA could be considered an artificial general intelligence within a reasonable timeframe.

“Tasks that are difficult for humans are not necessarily difficult for recent systems,” the paper states, critiquing the common practice of testing AIs on complex math, science and law exams. 

Instead, GAIA focuses on questions like, “Which city hosted the 2022 Eurovision Song Contest according to the official website?” and “How many images are there in the latest 2022 Lego Wikipedia article?”

“We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions,” the researchers wrote.

GAIA could shape the future trajectory of AI 

The release of GAIA represents an exciting new direction for AI research that could have broad implications. By focusing on human-like competence at everyday tasks rather than specialized expertise, GAIA pushes the field beyond more narrow AI benchmarks.

If future systems can demonstrate human-level common sense, adaptability and reasoning as measured by GAIA, it suggests they will have achieved artificial general intelligence (AGI) in a practical sense. This could accelerate deployment of AI assistants, services and products.

However, the authors caution that today’s chatbots still have a long way to go to solve GAIA. Their performance shows current limitations in reasoning, tool use and handling diverse real-world situations.

As researchers rise to the GAIA challenge, their results will reveal progress in making AI systems more capable, general and trustworthy. But benchmarks like GAIA also lead to reflection on how to shape AI that benefits humanity.

“We believe the successful resolution of GAIA would be an important milestone towards the next generation of AI systems,” the researchers wrote. So in addition to driving technical advances, GAIA could help guide AI in a direction that emphasizes shared human values like empathy, creativity and ethical judgment.

You can view the GAIA benchmark leaderboard right here to see which next-generation LLM is currently performing the best at this evaluation.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.