Meta AI releases OpenEQA to spur ’embodied intelligence’ in artificial agents

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.


Meta AI researchers today released OpenEQA, a new open-source benchmark dataset that aims to measure an artificial intelligence system’s capacity for “embodied question answering” — developing an understanding of the real world that allows it to answer natural language questions about an environment.

The dataset, which Meta is positioning as a key benchmark for the nascent field of “embodied AI,” contains over 1,600 questions about more than 180 different real-world environments like homes and offices. These span seven question categories that thoroughly test an AI’s abilities in skills like object and attribute recognition, spatial and functional reasoning, and commonsense knowledge.

“Against this backdrop, we propose that Embodied Question Answering (EQA) is both a useful end-application as well as a means to evaluate an agent’s understanding of the world,” the researchers wrote in a paper released today. “Simply put, EQA is the task of understanding an environment well enough to answer questions about it in natural language.”

Even the most advanced AI models, like GPT-4V, struggled to match human performance on OpenEQA, a new benchmark that measures an artificial intelligence system’s ability to understand and answer questions about the real world, according to a study by Meta AI researchers. (Credit: open-eqa.github.io)

Bringing together robotics, computer vision and language AI

The OpenEQA project sits at the intersection of some of the hottest areas in AI: computer vision, natural language processing, knowledge representation and robotics. The ultimate vision is to develop artificial agents that can perceive and interact with the world, communicate naturally with humans, and draw upon knowledge to assist us in our daily lives.

VB Event

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

The researchers see two main applications for this “embodied intelligence” in the near-term. One is AI assistants embedded in augmented reality glasses or headsets that could draw upon video and other sensor data to essentially give a user a photographic memory, able to answer questions like, “Where did I leave my keys?” The other is mobile robots that could autonomously explore an environment to find information, for example searching a home to answer the question “Do I have any coffee left?”

Creating a challenging benchmark

To create the OpenEQA dataset, the Meta researchers first collected video data and 3D scans of real-world environments. They then showed the videos to humans and asked them to generate questions they might want to ask an AI assistant that had access to that visual data.

The resulting 1,636 questions thoroughly test a wide range of perception and reasoning capabilities. For example, to answer the question ,”How many chairs are around the dining table?” an AI would need to recognize the objects in the scene, understand the spatial concept of “around,” and count the number of relevant objects. Other questions require the AI to have basic knowledge about the uses and attributes of objects.

Each question also includes answers generated by multiple humans, to account for the fact that questions can be answered in many different ways. To measure the performance of AI agents, the researchers used large language models to automatically score how similar the AI-generated answer is to the human answers.