Assessing AI's Creative Problem-Solving Skills

New dataset highlights AI performance in creative tasks with distractions.

2025-10-29T10:56:24+00:00 ― 5 min read

Table of Contents

The Only Connect Wall Dataset
Tasks and Evaluation
Methodology
Results
Challenges and Limitations
Conclusion
Future Directions
Original Source
Reference Links

Artificial intelligence (AI) has long sought to mimic human thinking. Recently, researchers have focused on large language models (LLMs), which have shown impressive capabilities. However, while many tests measure how well these models imitate human behavior, few assess their ability to solve creative problems. Creative problem-solving in humans involves making Connections between different ideas, a skill that many researchers have studied.

One challenge in creative problem-solving is the presence of misleading information, often called "Red Herrings." These distractions can cause people to focus on the wrong answers and lead them away from correct responses. In studies, researchers have found that showing participants similar but incorrect words can create a fixation effect, making it harder to think of the right answer.

To understand how LLMs deal with creative problem-solving and red herrings, researchers have created a new dataset based on a British quiz show called "Only Connect." In the show's "Connecting Wall" segment, contestants must group 16 mixed-up clue words into four categories, identifying the correct relationships among them. The show is designed with built-in red herrings, making it a useful case for examining how LLMs tackle these creative challenges.

The Only Connect Wall Dataset

The dataset consists of 618 walls, each containing 16 clue words. The goal is to sort these words into four connected groups, with each group having a specific relationship. The clues cover various topics, such as history, famous people, and cultural references. However, each wall also contains red herrings-words that can fit elsewhere-adding a layer of complexity.

Researchers gathered this dataset by watching episodes of the show and noting the Groupings and connections that contestants made correctly. The dataset is structured to make it easy to evaluate how well LLMs handle these tasks, focusing specifically on their creative problem-solving abilities.

Tasks and Evaluation

The dataset includes two main tasks:

Grouping: Evaluating how well LLMs can cluster clue words into the correct categories.
Connections: Assessing how accurately LLMs can identify the relationships among words in each category.

For the grouping task, researchers measure success using several metrics, including the number of correctly solved walls and the accuracy of the groupings. For the connections task, they look at exact matches, as well as less strict measures that allow for some variation.

The evaluation aims to see how LLMs perform compared to humans, particularly in their ability to handle the distractions created by red herrings. Researchers compared the performance of various LLMs, including the latest models from OpenAI.

Methodology

To evaluate the models, researchers employed different techniques. For the grouping task, they used Clustering algorithms on word embeddings, which are mathematical representations of words based on their meanings. They attempted to find groups that match the correct answers, looking for patterns in how words relate to each other.

For the connections task, they applied a method called few-shot in-context learning (ICL). This means that they provided the models with a few examples of how to solve the tasks, testing how well they could generalize from these examples to new problems.

Researchers also used a mix of static and contextual embeddings. Static embeddings provide a fixed representation of words, while contextual embeddings consider the surrounding words to give a more nuanced meaning.

Results

The findings revealed some interesting insights. For the grouping task, the best-performing model managed only a small fraction of solutions compared to human performance. This suggests that while LLMs show promise, they still fall short of human capabilities in creative problem-solving.

Notably, one surprising outcome was that having more examples in few-shot learning did not necessarily lead to better performance. Researchers speculated that this might arise from the nature of the clues, which often require background knowledge to understand fully.

In the connections task, performance was still below human levels, although the more advanced models showed some improvement with more examples. Again, this underlines the challenges faced by LLMs when dealing with complex relationships between words.

Challenges and Limitations

Researchers also noted limitations in their approach. The dataset is primarily based on UK-centric clues, which may not translate well to other languages or cultures. This may restrict the generalization of their findings to a broader range of contexts.

Moreover, the order of clues can significantly impact model performance. Researchers attempted to mitigate this issue by randomizing the order of clues in their evaluations, but future work could explore this further.

Some models struggled with the "context" of the clues, sometimes causing misinterpretations. In certain instances, the models produced irrelevant answers or included clues in their predictions when they shouldn't have.

Conclusion

The exploration of how LLMs tackle creative problem-solving tasks illuminates some strengths and weaknesses in current AI systems. The findings suggest areas for future research, particularly in enhancing how these models handle misleading information.

The Only Connect Wall dataset serves as a valuable resource for researchers interested in evaluating creative problem-solving abilities in AI. The ongoing development and refinement of LLMs will be crucial to bridging the gap between human-like creativity and machine learning.

Future Directions

Going forward, researchers are encouraged to explore additional Datasets that incorporate a broader range of cultural references and challenge LLMs with various languages. Improved models that account for context and ambiguity could lead to better performance in creative tasks.

By continuing to investigate the relationship between human cognitive processes and AI capabilities, the field can move closer to developing systems that can truly think creatively. Strategies such as retrieval-augmented models may provide new avenues for addressing the challenges posed by misleading cues and enhance performance in creative problem-solving tasks.

Assessing AI's Creative Problem-Solving Skills

New dataset highlights AI performance in creative tasks with distractions.

#The Only Connect Wall Dataset

#Tasks and Evaluation

#Methodology

#Results

#Challenges and Limitations

#Conclusion

#Future Directions

Reference Links

Referenced Topics