Evaluating Text-to-SQL Systems with Real Data
Analyzing Text-to-SQL systems using user-generated questions from the FootballDB project.
― 7 min read
Table of Contents
Text-to-SQL systems let users ask questions in plain language and translate those questions into SQL commands that can retrieve data from a database. These systems have become more popular as people want easier ways to access data without needing to know how to write SQL themselves.
Over time, new advanced language models have helped improve how these systems work. However, while many of these systems show high scores on controlled tests, there has not been enough focus on how well they perform in real-life situations with actual user questions.
This paper investigates how robust Text-to-SQL systems are when faced with different database structures by evaluating them through a project focused on real-world applications. We look at FootballDB, a setup that was used during the FIFA World Cup 2022, where users submitted thousands of questions in natural language. Our analysis includes real-time user questions, showing how the system handles different types of database designs.
The Rise of Text-to-SQL Systems
Relational databases have long relied on SQL as the standard way to query structured data. In recent years, more people, including those without programming backgrounds, want to utilize these databases for decision-making. This has led to the growth of Text-to-SQL systems, which allow users to ask questions and receive answers without needing to learn SQL.
Over the years, methods for converting plain language to SQL have advanced. From simple keyword-based systems to complex deep learning models, these systems have improved their accuracy significantly. However, the current tools are not fully ready for everyday use in enterprise situations.
Despite various performance benchmarks created from synthetic data, there is still no systematic investigation of how these systems perform with real User Queries. Most previous surveys are more about design principles rather than practical evaluations. This paper fills that gap by providing an extensive analysis based on real-world applications.
The FootballDB Project
The FootballDB dataset captures nearly a century of football World Cup data. It includes information about games, players, teams, and clubs, dating back to the inaugural World Cup in 1930. Our team updated the dataset continuously as new matches occurred during the World Cup.
A total of 6,000 natural language questions were logged from users during this project. The key features of this dataset include:
- Real User Interaction: All questions came from actual users interacting with the system.
- Comprehensive Information: The dataset covers information about 22 World Cups, 86 national teams, and thousands of players and clubs.
- Continuous Updates: New data points were added after each match, ensuring the dataset remained current.
System Deployment
We designed the FootballDB interface to allow users to easily ask questions about World Cup data. The backend system used PostgreSQL to manage the database.
During the live operation, we used a model called ValueNet to translate user queries into SQL commands. ValueNet was chosen because its source code is open and it has been successfully used in other projects.
User Engagement
To attract users to our system, we targeted the FIFA World Cup. Football is a universally recognized event, making it easier to gain interest from a diverse audience. The deployment started in October 2022 and lasted through the World Cup. Feedback from users helped us to refine the system over time.
Iteration of Improvements
Initial User Feedback: The first interactions revealed that users were often more interested in queries about player clubs and coaching histories. We expanded the database to include this information.
Expert Interface: We created a specialized interface for users with some knowledge of SQL. They could provide feedback on the SQL queries generated by the system, helping us improve accuracy.
Data Validation: We implemented a manual validation process to ensure the SQL queries generated were correct. This took a considerable amount of effort but was essential for refining our dataset.
Labeling Automation: We began developing techniques to automate parts of the labeling process, which was previously done manually. This helped speed up the time needed to prepare data for training the model.
Final Model Update: After cleaning the dataset, we retrained the system to better handle user queries. This process improved the model’s accuracy significantly.
Data Model Design
The design of the database structure is critical for how well Text-to-SQL systems perform. We focused on two main goals:
Simplifying Queries: We aimed to create a structure that would make it easier for users to write questions.
Reducing Errors: We wanted to minimize the chances of the system generating incorrect SQL queries.
Initial Data Model
The starting version of our data model had 13 tables containing various details about the tournament. During the validation process, we found some weaknesses in the structure, particularly with how data was connected.
Changes Through Iteration
Version 2: We modified the data model to ease the query-writing process. This version aimed to reduce the number of complex joins between tables.
Version 3: Further refinements were made to boost user experience. Unnecessary tables were removed, and more intuitive relationships were created, allowing users to ask simpler questions.
Experimental Evaluation
We conducted a detailed analysis to evaluate how well different Text-to-SQL systems perform with the FootballDB dataset. We focused on several key research questions:
Impact of Data Models: How do different database designs affect the accuracy of SQL translations?
Language Model Performance: How do small, medium, and large language models compare in executing these tasks?
Effect of Training Data Size: How does the volume of training data influence system performance?
Query Complexity: What role does the complexity of the user queries play?
Inference Time: How quickly can these systems process queries in real-time scenarios?
Experimental Setup
From the 6,000 user interactions logged, we filtered the data to create a training set of 1,000 questions. We then evaluated the systems based on various sizes of training data to see how it affected their accuracy.
Performance Metrics
To evaluate system efficiency, we used a straightforward measure known as exact execution matching. This method looks at whether the results from the system's SQL queries match the expected outcomes.
System Performance Findings
We tested multiple Text-to-SQL systems, including small, medium, and large language models. Each system's performance was affected by several factors:
Data Model: Some systems performed better with simplified data models, while others showed little difference in performance across different models.
Language Model Size: Our findings indicated that larger models generally provide better accuracy. However, this comes at the cost of increased computational resources.
Training Data Size: More training data tended to improve the overall performance, but the gain was marginal after a certain point.
Complexity of Queries: More complex queries resulted in lower accuracy across the systems.
Inference Time Analysis
We measured how long it took each system to generate a response after receiving a user query. The results varied widely, with some models taking over five minutes to respond to a question.
Lessons Learned
Through the deployment and evaluation phases, we discovered key insights that can guide future Text-to-SQL system designs:
User Training: Educating users about the database content can help reduce out-of-scope questions.
Expert Features: Providing advanced features for knowledgeable users can improve system feedback and help refine performance.
Automation in Data Processing: Finding ways to automate labor-intensive tasks such as query labeling can save time and resources.
Importance of Data Models: The choice of data model significantly affects the overall performance of Text-to-SQL systems, with simpler designs generally resulting in better user query handling.
Filtering Queries: Reducing the complexity of generated SQL queries can enhance performance without requiring additional training data.
Conclusion
This study emphasizes the importance of real-world evaluations for Text-to-SQL systems and how they can be improved by considering various design choices. By releasing the FootballDB dataset, we hope to help other researchers develop and test their systems effectively. While the challenge of translating natural language queries to SQL is not yet fully solved, our findings highlight the potential for growth and improvement in this area.
The work also brings attention to the practical limitations of currently available systems, particularly regarding inference time, which needs addressing for more widespread adoption in real-world applications. Going forward, the challenge remains to optimize these systems for better performance while making them accessible for everyday users.
Title: Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries
Abstract: Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while Text-to-SQL systems that incorporate such models continuously reach new high scores on -- often synthetic -- benchmark datasets, a systematic exploration of their robustness towards different data models in a real-world, realistic scenario is notably missing. This paper provides the first in-depth evaluation of the data model robustness of Text-to-SQL systems in practice based on a multi-year international project focused on Text-to-SQL interfaces. Our evaluation is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022, during which about 6K natural language questions were asked and executed. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models. For each data model, we explore the performance of representative Text-to-SQL systems and language models. We further quantify the impact of training data size, pre-, and post-processing steps as well as language model inference time. Our comprehensive evaluation sheds light on the design choices of real-world Text-to-SQL systems and their impact on moving from research prototypes to real deployments. Last, we provide a new benchmark dataset to the community, which is the first to enable the evaluation of different data models for the same dataset and is substantially more challenging than most previous datasets in terms of query complexity.
Authors: Jonathan Fürst, Catherine Kosten, Farhad Nooralahzadeh, Yi Zhang, Kurt Stockinger
Last Update: 2024-11-29 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2402.08349
Source PDF: https://arxiv.org/pdf/2402.08349
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.