Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.
A new study by researchers at the Georgia Institute of Technology has found that large language models (LLMs) exhibit significant bias towards entities and concepts associated with Western culture, even when prompted in Arabic or trained solely on Arabic data.
The findings, published on arXiv, raise concerns about the cultural fairness and appropriateness of these powerful AI systems as they are deployed globally.
“We show that multilingual and Arabic monolingual [language models] exhibit bias towards entities associated with Western culture,” the researchers wrote in their paper titled, “Having Beer after Prayer? Measuring Cultural Bias in Large Language Models.”
The study sheds light on the challenges LLMs face in grasping cultural nuances and adapting to specific cultural contexts, despite advancements in their multilingual capabilities.
VB Event
The AI Impact Tour – Boston
Request an invite
Potential harms of cultural bias in LLMs
The researcher’s findings raise concerns about the impact of cultural biases on users from non-Western cultures who interact with applications powered by LLMs. “Since LLMs are likely to have increasing impact through many new applications in the coming years, it is difficult to predict all the potential harms that might be caused by this type of cultural bias,” said Alan Ritter, one of the study’s authors, in an interview with VentureBeat.
Ritter pointed out that current LLM outputs perpetuate cultural stereotypes. “When prompted to generate fictional stories about individuals with Arab names, language models tend to associate Arab male names with poverty and traditionalism. For instance, GPT-4 is more likely to select adjectives such as ‘headstrong’, ‘poor’, or ‘modest.’ In contrast, adjectives such as ‘wealthy’, ‘popular’, and ‘unique’ are more common in stories generated about individuals with Western names,” he explained.
Moreover, the study found that current LLMs perform worse for individuals from non-Western cultures. “In the case of sentiment analysis, LLMs also make more false-negative predictions on sentences containing Arab entities, suggesting more false association of Arab entities with negative sentiment,” Ritter added.
Wei Xu, the lead researcher and author of the study, emphasized the potential consequences of these biases. “These cultural biases not only may harm users from non-Western cultures, but also impact the model’s accuracy in performing tasks and decrease users’ trust in the technology,” she said.
Introducing CAMeL: A novel benchmark for assessing cultural biases
To systematically assess cultural biases, the team introduced CAMeL (Cultural Appropriateness Measure Set for LMs), a novel benchmark dataset consisting of over 20,000 culturally relevant entities spanning eight categories including person names, food dishes, clothing items and religious sites. The entities were curated to enable the contrast of Arab and Western cultures.
“CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations,” the research team explains in the paper. By leveraging CAMeL, the researchers assessed the cross-cultural performance of 12 different language models, including the renowned GPT-4, on a range of tasks such as story generation, named entity recognition (NER), and sentiment analysis.
Ritter envisions that the CAMeL benchmark could be used to quickly test LLMs for cultural biases and identify gaps where more effort is needed by developers of models to reduce these problems. “One limitation is that CAMeL only tests Arab cultural biases, but we are planning to extend this to more cultures in the future,” he added.
The path forward: Building culturally-aware AI systems
To reduce bias for different cultures, Ritter suggests that LLM developers will need to hire data labelers from many different cultures during the fine-tuning process, in which LLMs are aligned with human preferences using labeled data. “This will be a complex and expensive process, but is very important to make sure people benefit equally from technological advances due to LLMs, and some cultures are not left behind,” he emphasized.
Xu highlighted an interesting finding from their paper, noting that one of the potential causes of cultural biases in LLMs is the heavy use of Wikipedia data in pre-training. “Although Wikipedia is created by editors all around the world, it happens that more Western cultural concepts are getting translated into non-Western languages rather than the other way around,” she explained. “Interesting technical approaches could involve better data mix in pre-training, better alignment with humans for cultural sensitivity, personalization, model unlearning, or relearning for cultural adaptation.”
Ritter also pointed out an additional challenge in adapting LLMs to cultures with less of a presence on the internet. “The amount of raw text available to pre-train language models may be limited. In this case, important cultural knowledge may be missing from the LLMs to begin with, and simply aligning them with the values of those cultures using standard methods may not completely solve the problem. Creative solutions are needed to come up with new ways to inject cultural knowledge into LLMs to make them more helpful for individuals in these cultures,” he said.
The findings underscore the need for a collaborative effort among researchers, AI developers, and policymakers to address the cultural challenges posed by LLMs. “We look at this as a new research opportunity for the cultural adaptation of LLMs in both training and deployment,” Xu said. “This is also a good opportunity for companies to think about localization of LLMs for different markets.”
By prioritizing cultural fairness and investing in the development of culturally aware AI systems, we can harness the power of these technologies to promote global understanding and foster more inclusive digital experiences for users worldwide. As Xu concluded, “We are excited to lay one of the first stones in these directions and look forward to seeing our dataset and similar datasets created using our proposed method to be routinely used in evaluating and training LLMs to ensure they have less favoritism towards one culture over the other.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.