Hybrid-SQuAD: The Future of Scholarly Q&A
A dataset combining text and structured data for better scholarly question answering.
Tilahun Abedissa Taffa, Debayan Banerjee, Yaregal Assabie, Ricardo Usbeck
― 4 min read
Table of Contents
In the world of research, finding accurate answers to questions can be tricky. Many systems that try to answer these questions usually focus on one type of data, either text or graphs. However, scholarly information often comes from a mix of different sources. To tackle this issue, a new dataset called Hybrid-SQuAD has been created. This dataset helps systems answer questions by pulling information from both text and structured data.
What is Hybrid-SQuAD?
Hybrid-SQuAD stands for Hybrid Scholarly Question Answering Dataset. It is a large collection of questions and answers designed to improve how we can answer scholarly questions. This dataset contains about 10,500 pairs of questions and answers generated by a powerful computer model. The questions draw from various sources, including databases like DBLP and SemOpenAlex, and text from Wikipedia. The goal is to make sure that answers can be found by looking at multiple sources rather than just one.
The Need for Hybrid Approaches
Scholarly questions often require information that is spread across different locations. For example, someone might need to look at a Knowledge Graph (KG) that lists publications and then check Wikipedia for more personal details about the authors. A typical question could be, "What is the main research interest of the author of a specific paper?" This question can't be answered by just looking at one source; both graphical and textual information are needed. That's where Hybrid-SQuAD comes in, making it easier to pull together all the data needed for answers.
Dataset Construction
Creating this dataset involved a thorough process:
-
Data Collection: The team gathered data from DBLP, a database of computer science publications, and SemOpenAlex, which contains scholarly information. They also collected related texts from Wikipedia.
-
Generating Questions: Using a language model, they created questions based on the gathered information. The model produced pairs of questions and answers that reflect the complexity of scholarly inquiries.
-
Quality Check: The researchers checked the generated questions to ensure they were clear and made sense. Any questions that had incomplete answers were revised to improve quality.
Types of Questions in Hybrid-SQuAD
The questions in this dataset cover various types:
-
Bridge Questions: These require linking data from different sources to find answers. For example, figuring out citation counts for an author involved in a particular work.
-
Comparison Questions: These ask for comparisons between entities, like determining which author has a higher citation count.
-
Text-based Questions: Some questions involve extracting specific information from text, such as the primary research focus of an author.
-
Complex Questions: A few questions ask for information that needs data from multiple sources, requiring both textual and graphical data to find answers.
Model Performance
To see how well systems could answer these questions, a baseline model was developed. This model was able to achieve an impressive accuracy rate of over 69%, demonstrating its effectiveness in answering questions from Hybrid-SQuAD. In contrast, popular models like ChatGPT struggled, achieving only about 3% accuracy when tested without any context.
Importance of Hybrid-SQuAD
Hybrid-SQuAD is essential because it encourages further progress in how we answer complex scholarly questions. By pushing the boundaries of existing systems and methodologies, it can help establish new standards in academic research and data integration.
Conclusion
Hybrid-SQuAD is a significant step towards improving how we address scholarly questions. By combining different types of data and creating a rich resource for building better question-answering systems, it aims to enhance the accuracy and efficiency of scholarly research. Who knew answering research questions could stir up such excitement? Researchers now have one more tool in their toolbox, making the quest for knowledge a bit easier and a lot more fun.
Title: Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset
Abstract: Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.
Authors: Tilahun Abedissa Taffa, Debayan Banerjee, Yaregal Assabie, Ricardo Usbeck
Last Update: Dec 5, 2024
Language: English
Source URL: https://arxiv.org/abs/2412.02788
Source PDF: https://arxiv.org/pdf/2412.02788
Licence: https://creativecommons.org/licenses/by-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://www.w3.org/TR/rdf-sparql-query/
- https://dblp.org
- https://semopenalex.org/resource/semopenalex:UniversalSearch
- https://orkg.org
- https://openai.com/blog/chatgpt
- https://github.com/semantic-systems/hybrid-squad
- https://www.quora.com/
- https://stackexchange.com/
- https://www.mturk.com/
- https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
- https://sbert.net
- https://huggingface.co/google/flan-t5-small
- https://huggingface.co/deepset/bert-base-cased-squad2
- https://blog.dblp.org/2022/03/02/dblp-in-rdf/
- https://semopenalex.org/authors/context
- https://semopenalex.org/institutions/context
- https://dblp-april24.skynet.coypu.org/sparql
- https://semoa.skynet.coypu.org/sparql
- https://drive.google.com/file/d/1ISxvb4q1TxcYRDWlyG-KalInSOeZqpyI/view?usp=drive_link
- https://orcid.org
- https://pypi.org/project/beautifulsoup4/
- https://huggingface.co/BAAI/bge-small-en-v1.5
- https://huggingface.co/meta-llama/Meta-Llama-3-8B
- https://www.w3.org/1999/02/
- https://dblp.org/rdf/schema#
- https://semopenalex.org/ontology/
- https://purl.org/spar/bido/
- https://dbpedia.org/ontology/
- https://dbpedia.org/property/
- https://xmlns.com/foaf/0.1/
- https://www.w3.org/ns/org#
- https://www.w3.org/
- https://www.w3.org/2002/07/owl#