Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

DetectBench: A New Standard for Evidence Detection in Language Models

DetectBench evaluates LLMs on their ability to detect hidden evidence in reasoning tasks.

― 5 min read


DetectBench for LLMDetectBench for LLMEvidence Detectiondetect hidden evidence.Assessing language models' ability to
Table of Contents

Detecting evidence is essential for Reasoning tasks. This article discusses a new benchmark called DetectBench, which tests how well large language models (LLMs) can identify and connect implicit evidence within long contexts. The goal is to improve how these models perform in reasoning tasks that depend on understanding context.

What is DetectBench?

DetectBench is a set of 3,928 multiple-choice questions, with each question averaging around 994 tokens. Each question typically contains about 4.55 pieces of hidden evidence that must be pieced together to arrive at the correct answer. On average, solving each question requires making about 7.62 logical steps.

The aim is to evaluate LLMs' abilities to identify and connect hidden evidence in complex tasks. The authors created tools called Detective Reasoning Prompt and Fine-tune to boost LLM performance. The experiments show that current LLMs struggle significantly with evidence detection compared to human abilities.

Importance of Evidence Detection

Evidence detection is critical because it helps in understanding the underlying context of a question. Many existing tasks assess the ability to find evidence and reason within that context. For example, reading comprehension or fact verification tasks often present clear evidence that is easy for models to find. However, in real-life scenarios, evidence is often not as obvious, requiring deeper reasoning to connect the dots.

The Challenge for LLMs

LLMs often fail to recognize the hidden evidence in a context. This may lead them to produce random or incorrect answers. The difference between clear and subtle evidence can be significant, making it harder for models to reason effectively. Therefore, it is crucial to assess whether LLMs can actually find and connect these hidden pieces of evidence to formulate logical answers.

Design of DetectBench

The design of DetectBench aims to create a realistic setting for evidence detection and reasoning. The questions in this benchmark are derived from detective puzzles, where answers are not straightforward. The benchmark is structured so that:

  1. Evidence is not easily recognized through direct text matching.
  2. Multiple pieces of evidence must be combined for effective reasoning.
  3. Each question comes with detailed annotations showing how the reasoning process leads to the answer.

Testing Human and LLM Performance

To gauge the effectiveness of DetectBench, researchers invited human participants to answer questions from the benchmark. Compared to LLMs, humans demonstrated significantly higher accuracy in both detecting evidence and answering questions correctly. This finding confirms the need for better tools and strategies to improve LLM capabilities.

Detective Reasoning Prompt

One of the key strategies introduced in this research is the Detective Reasoning Prompt, which consists of four stages:

  1. Evidence Detection: Encourages the model to find all pieces of evidence in the given context.
  2. Evidence Association: Helps the model understand how different pieces of evidence connect and generate new insights.
  3. Answer Inspiration: Guides the model in identifying the relevant evidence needed to formulate an answer.
  4. Weighted Reasoning: Reinforces the importance of the reasoning process in determining the final answer.

Detective Reasoning Fine-tuning

In addition to prompts, a fine-tuning strategy was developed to enhance models' abilities in evidence detection. By using DetectBench to provide specific training data, models can learn to be more efficient in evidence detection and reasoning.

The results of these improvements indicate that fine-tuning significantly increases both evidence detection accuracy and overall performance. Models trained in this way show greater success in handling questions from DetectBench.

Comparison with Other Benchmarks

DetectBench stands out from traditional benchmarks in information retrieval and commonsense reasoning. Most existing benchmarks present evidence that is clear and easy to find, while DetectBench focuses on implicit evidence that models must work to uncover. This unique design aims to reflect more accurately the challenges faced in real-world reasoning tasks.

Performance Analysis

The results from testing various LLMs on DetectBench reveal several trends:

  • LLMs generally struggle with evidence detection. For instance, GPT4-Turbo had average scores of only 44.4 for detecting evidence, while open-source models scored even lower.
  • There is a clear link between how well models detect evidence and how accurately they can answer questions. When given direct prompts about evidence, model performance improved significantly.
  • The Detective Reasoning Prompt was found to outperform other prompting methods, leading to better reasoning and evidence detection.

Additional Findings

Further analysis of the models revealed that longer texts and more complex questions tend to lower performance. For example, as the context length increased, accuracy dropped notably. This indicates that while models may recognize evidence, the complexity of the reasoning steps can hinder their ability to provide correct answers.

The researchers also created two additional datasets: DetectBench-Test-Hard and DetectBench-Test-Distract, aimed at distinguishing model performance further. These datasets feature longer contexts and more intricate logical steps, making the reasoning process even more challenging.

Ethical Considerations

The benchmarks used in this study include sensitive topics, such as crime. There is a concern that LLMs prioritizing safety may refuse to answer questions related to these topics, potentially limiting their effectiveness. The researchers aim to strike a balance, ensuring that models can handle sensitive questions while maintaining safety standards.

Conclusion

In summary, DetectBench serves as a valuable tool for assessing and improving LLMs' abilities in evidence detection and reasoning. By focusing on implicit evidence and incorporating innovative prompting and fine-tuning strategies, this benchmark provides insights that can help refine the performance of LLMs. The results suggest that with the right training and approach, LLMs can improve significantly in understanding and reasoning based on complex contexts, which is key for their future development and application.

Original Source

Title: DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?

Abstract: Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice questions, with an average of 994 tokens per question. Each question contains an average of 4.55 pieces of implicit evidence, and solving the problem typically requires 7.62 logical jumps to find the correct answer. To enhance the performance of LLMs in evidence detection, this paper proposes Detective Reasoning Prompt and Finetune. Experiments demonstrate that the existing LLMs' abilities to detect evidence in long contexts are far inferior to humans. However, the Detective Reasoning Prompt effectively enhances the capability of powerful LLMs in evidence detection, while the Finetuning method shows significant effects in enhancing the performance of weaker LLMs. Moreover, when the abilities of LLMs in evidence detection are improved, their final reasoning performance is also enhanced accordingly.

Authors: Zhouhong Gu, Lin Zhang, Xiaoxuan Zhu, Jiangjie Chen, Wenhao Huang, Yikai Zhang, Shusen Wang, Zheyu Ye, Yan Gao, Hongwei Feng, Yanghua Xiao

Last Update: 2024-11-11 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2406.12641

Source PDF: https://arxiv.org/pdf/2406.12641

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles