Evaluating Language Models in the Connections Game

Table of Contents

Original Source
Reference Links

Word games challenge our thinking and language skills. One such game is Connections, created by the New York Times. It asks players to group words into categories based on shared traits. This game has gained popularity since its launch in June 2023 and attracts both casual players and word puzzle lovers alike.

In this study, we look at how well large language models (LLMs), which are advanced artificial intelligence systems, perform in this game compared to human players. We collected data from 200 Connections games to compare the performance of LLMs to that of both novice (new) and expert (regular) human players.

What is the Connections Game?

Connections presents a grid of 16 words and challenges players to find four distinct groups containing four words each. These groups must have something in common, such as their meaning or usage. The game does not just challenge players on an easy level; it also includes difficult categories where connections may not be immediately obvious. For example, some categories might involve words that fit multiple meanings, adding to the challenge.

Players must think creatively and use different kinds of knowledge to succeed in this game. Words can be tricky because some might seem to fit together but actually belong to different categories-these are called red herrings. For instance, the words “Likes”, “Followers”, “Shares”, and “Insult” might first appear to belong to a social media category but finding the true categories requires deeper thinking.

Evaluating LLMs

The aim of this research is to evaluate how well LLMs can handle the abstract reasoning needed to play Connections. We tested four state-of-the-art LLMs: Gemini 1.5 Pro, Claude 3 Opus, GPT-4o, and Llama 3 70B. To measure their performance, we compared their scores against those of human players.

Despite being designed to process language effectively, we found that even the best-performing LLM, GPT-4o, only fully solved 8% of the games. In contrast, expert human players managed to solve many more games correctly. This shows that while LLMs can perform certain tasks well, they still struggle with tasks that require more abstract reasoning close to how humans think.

Types of Knowledge Needed to Play

Successful players need to use different kinds of knowledge to categorize words in Connections. We broke down the types of knowledge required into several categories:

Semantic Knowledge: This involves understanding the meanings of words and how they relate to each other. Players must know about synonyms, the general term and specific instances, and words with multiple meanings.
Associative Knowledge: This entails recognizing connections between words that may not be directly related by their definitions. Players might need to group words based on common themes or connotations.
Encyclopedic Knowledge: Some words require knowledge beyond simple definitions; players must understand references to real-world entities, events, or concepts. For instance, knowing that “Jack Black” refers to an actor and “Jack Frost” is a character from folklore is crucial.
Multiword Expressions: Players often have to recognize that multiple words can create a common phrase. Understanding how these phrases work requires familiarity with language usage.
Linguistic Knowledge: This relates to the rules and patterns of language itself, such as grammar, sound patterns, or word formation.
Combined Knowledge: Some of the toughest categories require a mix of the above knowledge types, making those categories particularly hard to sort.

Human vs. LLM Performance

To understand the effectiveness of LLMs better, we compared their performance to that of novice and expert human players. We gathered groups of volunteers to play the game, and they were asked to categorize words just like the LLMs.

Novice Players

Novice human players were able to perform slightly better than GPT-4o in solving the Connections games. Their average unweighted clustering score was higher, which means they managed to group words more successfully than the model did.

Expert Players

Expert players significantly outperformed both novice humans and LLMs. They consistently achieved higher scores, demonstrating that a deeper familiarity with the game and its challenges greatly enhances performance. For instance, expert players were able to completely solve over 60% of the games, while GPT-4o managed only 5%.

Challenges Faced by LLMs

Our analysis revealed that LLMs struggle particularly with certain types of reasoning. They perform well with basic semantic knowledge but find it difficult to recognize multiword expressions and combined knowledge categories. This indicates that while they can process individual words efficiently, understanding the broader context or deeper relations is more complex for them.

The Role of Red Herrings

Connections includes red herrings that add an extra layer of difficulty. These are words that might seem to fit into a category but do not. For example, if a group of words seems to relate to Christmas but one word belongs to a different context, separating them requires careful thought.

Both LLMs and human players made more mistakes in categories where red herrings were present, which suggests that misdirection can significantly hinder performance. LLMs, in particular, often struggled to find the right connections when red herrings were included.

Reasoning and Justifications

As part of our evaluation, we also looked at how well LLMs could explain their reasoning. For certain successful groupings, they still sometimes provided incorrect or unclear reasons for their choices.

For example, an LLM might group words correctly but fail to articulate why they fit together in its explanation. This gap highlights the importance of understanding not just how to categorize words but also why those categorizations make sense.

Future Directions

To better prepare LLMs for tasks like Connections in the future, we suggest that they could benefit from more focused training. Strategies like identifying words that don't fit with others (red herrings) and receiving real-time feedback on groupings could improve their performance.

Additionally, training on synthetic data that mimics the game could also bridge the gap between human experts and LLMs. By simulating the game environment and allowing LLMs to play against themselves, we could drive increased performance outcomes.

Conclusion

In evaluating LLMs against human players using the New York Times Connections game, we find that while these models are powerful tools for processing language, their abstract reasoning capacities are still lacking. The depth of knowledge and the different reasoning types required to excel in the game showcase areas for improvement.

With more training and better data, it is possible that LLMs could enhance their abilities in abstract reasoning tasks. However, as of now, expert human players significantly outperform LLMs, showing that understanding and reasoning remain complex challenges for artificial intelligence.

Evaluating Language Models in the Connections Game

A study on large language models' performance in word grouping challenges.

What is the Connections Game?

Evaluating LLMs

Types of Knowledge Needed to Play

Human vs. LLM Performance

Novice Players

Expert Players

Challenges Faced by LLMs

The Role of Red Herrings

Reasoning and Justifications

Future Directions

Conclusion

Reference Links

Referenced Topics

Evaluating Language Models in the Connections Game

A study on large language models' performance in word grouping challenges.

#What is the Connections Game?

#Evaluating LLMs

#Types of Knowledge Needed to Play

#Human vs. LLM Performance

#Novice Players

#Expert Players

#Challenges Faced by LLMs

#The Role of Red Herrings

#Reasoning and Justifications

#Future Directions

#Conclusion

Reference Links

Referenced Topics

What is the Connections Game?

Evaluating LLMs

Types of Knowledge Needed to Play

Human vs. LLM Performance

Novice Players

Expert Players

Challenges Faced by LLMs

The Role of Red Herrings

Reasoning and Justifications

Future Directions

Conclusion