Evaluating Language Models in the Connections Game
A study on large language models' performance in word grouping challenges.
― 6 min read
Table of Contents
Word games challenge our thinking and language skills. One such game is Connections, created by the New York Times. It asks players to group words into categories based on shared traits. This game has gained popularity since its launch in June 2023 and attracts both casual players and word puzzle lovers alike.
In this study, we look at how well large language models (LLMs), which are advanced artificial intelligence systems, perform in this game compared to human players. We collected data from 200 Connections games to compare the performance of LLMs to that of both novice (new) and expert (regular) human players.
What is the Connections Game?
Connections presents a grid of 16 words and challenges players to find four distinct groups containing four words each. These groups must have something in common, such as their meaning or usage. The game does not just challenge players on an easy level; it also includes difficult categories where connections may not be immediately obvious. For example, some categories might involve words that fit multiple meanings, adding to the challenge.
Players must think creatively and use different kinds of knowledge to succeed in this game. Words can be tricky because some might seem to fit together but actually belong to different categories-these are called red herrings. For instance, the words “Likes”, “Followers”, “Shares”, and “Insult” might first appear to belong to a social media category but finding the true categories requires deeper thinking.
Evaluating LLMs
The aim of this research is to evaluate how well LLMs can handle the abstract reasoning needed to play Connections. We tested four state-of-the-art LLMs: Gemini 1.5 Pro, Claude 3 Opus, GPT-4o, and Llama 3 70B. To measure their performance, we compared their scores against those of human players.
Despite being designed to process language effectively, we found that even the best-performing LLM, GPT-4o, only fully solved 8% of the games. In contrast, expert human players managed to solve many more games correctly. This shows that while LLMs can perform certain tasks well, they still struggle with tasks that require more abstract reasoning close to how humans think.
Types of Knowledge Needed to Play
Successful players need to use different kinds of knowledge to categorize words in Connections. We broke down the types of knowledge required into several categories:
Semantic Knowledge: This involves understanding the meanings of words and how they relate to each other. Players must know about synonyms, the general term and specific instances, and words with multiple meanings.
Associative Knowledge: This entails recognizing connections between words that may not be directly related by their definitions. Players might need to group words based on common themes or connotations.
Encyclopedic Knowledge: Some words require knowledge beyond simple definitions; players must understand references to real-world entities, events, or concepts. For instance, knowing that “Jack Black” refers to an actor and “Jack Frost” is a character from folklore is crucial.
Multiword Expressions: Players often have to recognize that multiple words can create a common phrase. Understanding how these phrases work requires familiarity with language usage.
Linguistic Knowledge: This relates to the rules and patterns of language itself, such as grammar, sound patterns, or word formation.
Combined Knowledge: Some of the toughest categories require a mix of the above knowledge types, making those categories particularly hard to sort.
Human vs. LLM Performance
To understand the effectiveness of LLMs better, we compared their performance to that of novice and expert human players. We gathered groups of volunteers to play the game, and they were asked to categorize words just like the LLMs.
Novice Players
Novice human players were able to perform slightly better than GPT-4o in solving the Connections games. Their average unweighted clustering score was higher, which means they managed to group words more successfully than the model did.
Expert Players
Expert players significantly outperformed both novice humans and LLMs. They consistently achieved higher scores, demonstrating that a deeper familiarity with the game and its challenges greatly enhances performance. For instance, expert players were able to completely solve over 60% of the games, while GPT-4o managed only 5%.
Challenges Faced by LLMs
Our analysis revealed that LLMs struggle particularly with certain types of reasoning. They perform well with basic semantic knowledge but find it difficult to recognize multiword expressions and combined knowledge categories. This indicates that while they can process individual words efficiently, understanding the broader context or deeper relations is more complex for them.
The Role of Red Herrings
Connections includes red herrings that add an extra layer of difficulty. These are words that might seem to fit into a category but do not. For example, if a group of words seems to relate to Christmas but one word belongs to a different context, separating them requires careful thought.
Both LLMs and human players made more mistakes in categories where red herrings were present, which suggests that misdirection can significantly hinder performance. LLMs, in particular, often struggled to find the right connections when red herrings were included.
Reasoning and Justifications
As part of our evaluation, we also looked at how well LLMs could explain their reasoning. For certain successful groupings, they still sometimes provided incorrect or unclear reasons for their choices.
For example, an LLM might group words correctly but fail to articulate why they fit together in its explanation. This gap highlights the importance of understanding not just how to categorize words but also why those categorizations make sense.
Future Directions
To better prepare LLMs for tasks like Connections in the future, we suggest that they could benefit from more focused training. Strategies like identifying words that don't fit with others (red herrings) and receiving real-time feedback on groupings could improve their performance.
Additionally, training on synthetic data that mimics the game could also bridge the gap between human experts and LLMs. By simulating the game environment and allowing LLMs to play against themselves, we could drive increased performance outcomes.
Conclusion
In evaluating LLMs against human players using the New York Times Connections game, we find that while these models are powerful tools for processing language, their abstract reasoning capacities are still lacking. The depth of knowledge and the different reasoning types required to excel in the game showcase areas for improvement.
With more training and better data, it is possible that LLMs could enhance their abilities in abstract reasoning tasks. However, as of now, expert human players significantly outperform LLMs, showing that understanding and reasoning remain complex challenges for artificial intelligence.
Title: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 438 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game. We find that while LLMs perform relatively well on categorizing words based on semantic relations they struggle with other types of knowledge such as Encyclopedic Knowledge, Multiword Expressions or knowledge that combines both Word Form and Meaning. Our results establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in AI systems.
Authors: Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan
Last Update: 2024-10-13 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2406.11012
Source PDF: https://arxiv.org/pdf/2406.11012
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.