Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Artificial Intelligence# Machine Learning

Assessing AI's Creative Problem-Solving Skills

New dataset highlights AI performance in creative tasks with distractions.

― 5 min read


AI vs. Human CreativityAI vs. Human Creativitytasks.Studying AI's struggle with creative
Table of Contents

Artificial intelligence (AI) has long sought to mimic human thinking. Recently, researchers have focused on large language models (LLMs), which have shown impressive capabilities. However, while many tests measure how well these models imitate human behavior, few assess their ability to solve creative problems. Creative problem-solving in humans involves making Connections between different ideas, a skill that many researchers have studied.

One challenge in creative problem-solving is the presence of misleading information, often called "Red Herrings." These distractions can cause people to focus on the wrong answers and lead them away from correct responses. In studies, researchers have found that showing participants similar but incorrect words can create a fixation effect, making it harder to think of the right answer.

To understand how LLMs deal with creative problem-solving and red herrings, researchers have created a new dataset based on a British quiz show called "Only Connect." In the show's "Connecting Wall" segment, contestants must group 16 mixed-up clue words into four categories, identifying the correct relationships among them. The show is designed with built-in red herrings, making it a useful case for examining how LLMs tackle these creative challenges.

The Only Connect Wall Dataset

The dataset consists of 618 walls, each containing 16 clue words. The goal is to sort these words into four connected groups, with each group having a specific relationship. The clues cover various topics, such as history, famous people, and cultural references. However, each wall also contains red herrings-words that can fit elsewhere-adding a layer of complexity.

Researchers gathered this dataset by watching episodes of the show and noting the Groupings and connections that contestants made correctly. The dataset is structured to make it easy to evaluate how well LLMs handle these tasks, focusing specifically on their creative problem-solving abilities.

Tasks and Evaluation

The dataset includes two main tasks:

  1. Grouping: Evaluating how well LLMs can cluster clue words into the correct categories.
  2. Connections: Assessing how accurately LLMs can identify the relationships among words in each category.

For the grouping task, researchers measure success using several metrics, including the number of correctly solved walls and the accuracy of the groupings. For the connections task, they look at exact matches, as well as less strict measures that allow for some variation.

The evaluation aims to see how LLMs perform compared to humans, particularly in their ability to handle the distractions created by red herrings. Researchers compared the performance of various LLMs, including the latest models from OpenAI.

Methodology

To evaluate the models, researchers employed different techniques. For the grouping task, they used Clustering algorithms on word embeddings, which are mathematical representations of words based on their meanings. They attempted to find groups that match the correct answers, looking for patterns in how words relate to each other.

For the connections task, they applied a method called few-shot in-context learning (ICL). This means that they provided the models with a few examples of how to solve the tasks, testing how well they could generalize from these examples to new problems.

Researchers also used a mix of static and contextual embeddings. Static embeddings provide a fixed representation of words, while contextual embeddings consider the surrounding words to give a more nuanced meaning.

Results

The findings revealed some interesting insights. For the grouping task, the best-performing model managed only a small fraction of solutions compared to human performance. This suggests that while LLMs show promise, they still fall short of human capabilities in creative problem-solving.

Notably, one surprising outcome was that having more examples in few-shot learning did not necessarily lead to better performance. Researchers speculated that this might arise from the nature of the clues, which often require background knowledge to understand fully.

In the connections task, performance was still below human levels, although the more advanced models showed some improvement with more examples. Again, this underlines the challenges faced by LLMs when dealing with complex relationships between words.

Challenges and Limitations

Researchers also noted limitations in their approach. The dataset is primarily based on UK-centric clues, which may not translate well to other languages or cultures. This may restrict the generalization of their findings to a broader range of contexts.

Moreover, the order of clues can significantly impact model performance. Researchers attempted to mitigate this issue by randomizing the order of clues in their evaluations, but future work could explore this further.

Some models struggled with the "context" of the clues, sometimes causing misinterpretations. In certain instances, the models produced irrelevant answers or included clues in their predictions when they shouldn't have.

Conclusion

The exploration of how LLMs tackle creative problem-solving tasks illuminates some strengths and weaknesses in current AI systems. The findings suggest areas for future research, particularly in enhancing how these models handle misleading information.

The Only Connect Wall dataset serves as a valuable resource for researchers interested in evaluating creative problem-solving abilities in AI. The ongoing development and refinement of LLMs will be crucial to bridging the gap between human-like creativity and machine learning.

Future Directions

Going forward, researchers are encouraged to explore additional Datasets that incorporate a broader range of cultural references and challenge LLMs with various languages. Improved models that account for context and ambiguity could lead to better performance in creative tasks.

By continuing to investigate the relationship between human cognitive processes and AI capabilities, the field can move closer to developing systems that can truly think creatively. Strategies such as retrieval-augmented models may provide new avenues for addressing the challenges posed by misleading cues and enhance performance in creative problem-solving tasks.

Original Source

Title: Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset

Abstract: The quest for human imitative AI has been an enduring topic in AI research since its inception. The technical evolution and emerging capabilities of the latest cohort of large language models (LLMs) have reinvigorated the subject beyond academia to the cultural zeitgeist. While recent NLP evaluation benchmark tasks test some aspects of human-imitative behaviour (e.g., BIG-bench's 'human-like behavior' tasks), few, if not none, examine creative problem solving abilities. Creative problem solving in humans is a well-studied topic in cognitive neuroscience with standardized tests that predominantly use the ability to associate (heterogeneous) connections among clue words as a metric for creativity. Exposure to misleading stimuli - distractors dubbed red herrings - impede human performance in such tasks via the fixation effect and Einstellung paradigm. In cognitive neuroscience studies, such fixations are experimentally induced by pre-exposing participants to orthographically similar incorrect words to subsequent word-fragments or clues. The popular British quiz show Only Connect's Connecting Wall segment essentially mimics Mednick's Remote Associates Test (RAT) formulation with built-in, deliberate red herrings, which makes it an ideal proxy dataset to explore and study fixation effect and Einstellung paradigm from cognitive neuroscience in LLMs. In this paper we present the novel Only Connect Wall (OCW) dataset and report results from our evaluation of selected pre-trained language models and LLMs on creative problem solving tasks like grouping clue words by heterogeneous connections, and identifying correct open knowledge domain connections in respective groups. We synthetically generate two additional datasets: OCW-Randomized, OCW-WordNet to further analyze our red-herrings hypothesis in language models. The code and link to the dataset are available at https://github.com/TaatiTeam/OCW.

Authors: Saeid Naeini, Raeid Saqur, Mozhgan Saeidi, John Giorgi, Babak Taati

Last Update: 2023-11-08 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2306.11167

Source PDF: https://arxiv.org/pdf/2306.11167

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles