Analyzing Question Answering Datasets

A study of datasets and metrics in question answering research.

2025-06-06T18:53:24+00:00 ― 4 min read

Table of Contents

Metrics Used
Experimental Results
Performance of T5-3b and LLaMA-7b
Conclusion
Original Source
Reference Links

In this article, we look at different Datasets used for our research. The information includes details such as the number of samples, the number of questions, and hints provided in each dataset.

Dataset Details

We examined three main datasets: TriviaQA, NQ, and WebQ. Here are the details regarding the datasets.

Dataset	Scenario	Number of Questions	Number of Hints
TriviaQA	Finetuned	11,313	105,709
TriviaQA	Vanilla	11,313	103,018
NQ	Finetuned	3,610	33,131
NQ	Vanilla	3,610	30,976
WebQ	Finetuned	2,032	16,978
WebQ	Vanilla	2,032	15,812

Question Type Distribution

The distribution of question types in the datasets is also crucial for our study.

Question Type	TriviaQA	NQ	WebQ
Training	14,645	1,000	1,000
Validation	140,973	9,638	9,619
Test	14.18	14.08	13.95
Avg. Hint Length	14.98	15.07	15.14
Avg. Hints/Question	9.62	9.63	9.61
Avg. Entities/Question	1.35	1.40	1.35
Avg. Entities/Hint	0.96	1.00	0.98
Avg. Sources/Question	6.27	6.17	6.71

Metrics Used

In this section, we discuss the metrics used to evaluate the methods in our research. The scikit-learn library helped us calculate these metrics.

Accuracy (ACC)

This metric checks if the answers given by the model are correct.

Exact Match (EM)

This measures if the retrieved or generated passage includes the correct answer exactly as it appears.

Precision (PR)

This shows how many words in the passage are also found in the correct answer.

Recall (RC)

This calculates the number of words from the correct answer that are included in the passage retrieved.

F1-measure (F1)

This is a balance between precision and recall.

Contains (CON)

This metric checks if the retrieved passage has the entire correct answer.

BERTScore (BERT)

This metric checks how similar the words in the retrieved passage are to the answer by using word embeddings from BERT.

Experimental Results

In this section, we provide results from our experiments in various scenarios. We examine how different conditions and factors affect the results.

The number of hints gives context, while the ranking column lists methods for reranking these hints.

Results for TriviaQA Dataset

In our experiments, we analyzed the results of using T5-3b as the reader. We applied both zero-shot and few-shot learning strategies on the TriviaQA dataset.

Here are the results based on different ranking methods and hint amounts.

Number of Hints	Ranking	EM	F1	PR	RC	CON	BERT
8c	Zero-Shot
8c	Few-Shot

NQ Dataset Results

Similar to the previous dataset, we provide results for the NQ dataset using T5-3b under zero-shot and few-shot conditions.

Number of Hints	Ranking	EM	F1	PR	RC	CON	BERT
8c	Zero-Shot
8c	Few-Shot

WebQ Dataset Results

Lastly, we present results for the WebQ dataset under the same conditions.

Number of Hints	Ranking	EM	F1	PR	RC	CON	BERT
8c	Zero-Shot
8c	Few-Shot

Performance of T5-3b and LLaMA-7b

We also examined the performance of T5-3b and LLaMA-7b in various scenarios using different hint methods.

Case Studies

In this section, we display several case studies that demonstrate the prompts we chose, along with examples from our experiments.

Sample Questions and Answers

Here are some sample questions from the datasets, along with how the models responded:

Question	Retriever	LLaMA-70b	True Answer
How many dot positions are usually used in each letter of the Braille system?	6	six	6, six
Who was the leader of the gang whose members included Benny the Ball, Brain, and Choo Choo?	the bowery boys	top cat	top cat
Which Glasgow group signed to Creation Records and recorded their debut single "All Fall Down", in 1985?	primal scream	the pastels	the jesus and mary chain
Who is the only man to win a best actor Oscar playing brothers?	jack nicholson	daniel day	henry fonda

Hints Generated

Our case studies illustrate how hints were generated for various questions. Each hint provided context to help the models find the correct answers.

Conclusion

In this article, we explored various datasets used for our research, focusing on their details, the metrics used for evaluation, experimental results, and case studies that illustrate the practical application of our methods. The goal is to contribute to the understanding of how different models perform in answering questions with the aid of contextual hints.

Analyzing Question Answering Datasets

A study of datasets and metrics in question answering research.

#Dataset Details

#Question Type Distribution

#Metrics Used

#Accuracy (ACC)

#Exact Match (EM)

#Precision (PR)

#Recall (RC)

#F1-measure (F1)

#Contains (CON)

#BERTScore (BERT)

#Experimental Results

#Results for TriviaQA Dataset

#NQ Dataset Results

#WebQ Dataset Results

#Performance of T5-3b and LLaMA-7b

#Case Studies

#Sample Questions and Answers

#Hints Generated

#Conclusion

Reference Links

Referenced Topics