Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Information Retrieval

Analyzing Question Answering Datasets

A study of datasets and metrics in question answering research.

Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt

― 4 min read


Question AnsweringQuestion AnsweringDataset Insightsand metrics.Evaluation and analysis of key datasets
Table of Contents

In this article, we look at different Datasets used for our research. The information includes details such as the number of samples, the number of questions, and hints provided in each dataset.

Dataset Details

We examined three main datasets: TriviaQA, NQ, and WebQ. Here are the details regarding the datasets.

DatasetScenarioNumber of QuestionsNumber of Hints
TriviaQAFinetuned11,313105,709
TriviaQAVanilla11,313103,018
NQFinetuned3,61033,131
NQVanilla3,61030,976
WebQFinetuned2,03216,978
WebQVanilla2,03215,812

Question Type Distribution

The distribution of question types in the datasets is also crucial for our study.

Question TypeTriviaQANQWebQ
Training14,6451,0001,000
Validation140,9739,6389,619
Test14.1814.0813.95
Avg. Hint Length14.9815.0715.14
Avg. Hints/Question9.629.639.61
Avg. Entities/Question1.351.401.35
Avg. Entities/Hint0.961.000.98
Avg. Sources/Question6.276.176.71

Metrics Used

In this section, we discuss the metrics used to evaluate the methods in our research. The scikit-learn library helped us calculate these metrics.

Accuracy (ACC)

This metric checks if the answers given by the model are correct.

Exact Match (EM)

This measures if the retrieved or generated passage includes the correct answer exactly as it appears.

Precision (PR)

This shows how many words in the passage are also found in the correct answer.

Recall (RC)

This calculates the number of words from the correct answer that are included in the passage retrieved.

F1-measure (F1)

This is a balance between precision and recall.

Contains (CON)

This metric checks if the retrieved passage has the entire correct answer.

BERTScore (BERT)

This metric checks how similar the words in the retrieved passage are to the answer by using word embeddings from BERT.

Experimental Results

In this section, we provide results from our experiments in various scenarios. We examine how different conditions and factors affect the results.

The number of hints gives context, while the ranking column lists methods for reranking these hints.

Results for TriviaQA Dataset

In our experiments, we analyzed the results of using T5-3b as the reader. We applied both zero-shot and few-shot learning strategies on the TriviaQA dataset.

Here are the results based on different ranking methods and hint amounts.

Number of HintsRankingEMF1PRRCCONBERT
8cZero-Shot
8cFew-Shot

NQ Dataset Results

Similar to the previous dataset, we provide results for the NQ dataset using T5-3b under zero-shot and few-shot conditions.

Number of HintsRankingEMF1PRRCCONBERT
8cZero-Shot
8cFew-Shot

WebQ Dataset Results

Lastly, we present results for the WebQ dataset under the same conditions.

Number of HintsRankingEMF1PRRCCONBERT
8cZero-Shot
8cFew-Shot

Performance of T5-3b and LLaMA-7b

We also examined the performance of T5-3b and LLaMA-7b in various scenarios using different hint methods.

Case Studies

In this section, we display several case studies that demonstrate the prompts we chose, along with examples from our experiments.

Sample Questions and Answers

Here are some sample questions from the datasets, along with how the models responded:

QuestionRetrieverLLaMA-70bTrue Answer
How many dot positions are usually used in each letter of the Braille system?6six6, six
Who was the leader of the gang whose members included Benny the Ball, Brain, and Choo Choo?the bowery boystop cattop cat
Which Glasgow group signed to Creation Records and recorded their debut single "All Fall Down", in 1985?primal screamthe pastelsthe jesus and mary chain
Who is the only man to win a best actor Oscar playing brothers?jack nicholsondaniel dayhenry fonda

Hints Generated

Our case studies illustrate how hints were generated for various questions. Each hint provided context to help the models find the correct answers.

Conclusion

In this article, we explored various datasets used for our research, focusing on their details, the metrics used for evaluation, experimental results, and case studies that illustrate the practical application of our methods. The goal is to contribute to the understanding of how different models perform in answering questions with the aid of contextual hints.

Original Source

Title: Exploring Hint Generation Approaches in Open-Domain Question Answering

Abstract: Automatic Question Answering (QA) systems rely on contextual information to provide accurate answers. Commonly, contexts are prepared through either retrieval-based or generation-based methods. The former involves retrieving relevant documents from a corpus like Wikipedia, whereas the latter uses generative models such as Large Language Models (LLMs) to generate the context. In this paper, we introduce a novel context preparation approach called HINTQA, which employs Automatic Hint Generation (HG) techniques. Unlike traditional methods, HINTQA prompts LLMs to produce hints about potential answers for the question rather than generating relevant context. We evaluate our approach across three QA datasets including TriviaQA, NaturalQuestions, and Web Questions, examining how the number and order of hints impact performance. Our findings show that the HINTQA surpasses both retrieval-based and generation-based approaches. We demonstrate that hints enhance the accuracy of answers more than retrieved and generated contexts.

Authors: Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt

Last Update: Sep 24, 2024

Language: English

Source URL: https://arxiv.org/abs/2409.16096

Source PDF: https://arxiv.org/pdf/2409.16096

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles