Leveraging LLMs for Efficient Tabular Data Annotation
This paper showcases a method for using LLMs to annotate tabular data with minimal human effort.
― 14 min read
Table of Contents
Tabular data is everywhere in our daily lives and online, but labeling this data has often required a lot of human effort, making it hard to scale for machine learning tasks. This paper presents a new way to use Large Language Models (LLMs) to help understand and label tabular data.
With our method, we can quickly and easily label a vast amount of tabular data. We can customize how the labeling is done based on specific needs, demonstrating this with examples such as SQL query creation and identifying input-target columns in tables. We are sharing a dataset called AnnotatedTables, which contains 32,119 databases with annotations made by LLMs. This includes 405,616 valid SQL programs, making it the biggest SQL dataset linked to tabular data that allows for query testing.
To show how useful our methods and dataset are, we conducted two follow-up studies.
- We looked at whether LLMs could change SQL programs into a different programming language called Rel, which LLMs had not learned before, while still getting the same results. By adjusting our prompts based on execution feedback, we found that LLMs could make decent translations with just a few examples. 
- We assessed how well a neural network model, TabPFN, performed on 2,720 tables with input-target columns that had been identified and labeled by LLMs. Overall, TabPFN performed similarly to a baseline method, but results could differ greatly from one data table to another, so both models can be useful depending on the situation. 
Our work highlights how effective LLMs can be in streamlining the process of labeling large amounts of diverse tabular data.
Tabular data is one of the most common forms of data, used in areas like finance, healthcare, and marketing. The tabular format allows us to organize structured data into relational databases. Machine learning models can then understand and create SQL queries for these databases, and LLMs have become a key method in this space. Recent studies show that LLMs can excel in generating SQL with well-crafted prompts due to the vast amount of open-source SQL code they have been trained on.
However, having SQL code alone is not sufficient for all tabular machine learning challenges. For instance, when testing how well a model can translate text to SQL, both the tabular data and SQL code are needed to measure the success rate. While we may have plenty of either SQL code or tabular data, gathering both simultaneously can be quite challenging.
This issue leads to what we term the tabular dataset annotation problem. Previous popular datasets like WikiSQL and Spider have limited sizes because the ways they were labeled do not scale to exploit the vast amounts of tabular data or SQL code available online. WikiSQL collects its tabular data from Wikipedia, a constrained source that doesn't capture the complexities of real-world tabular data. Its SQL programs are manually annotated through a service called Amazon Mechanical Turk, and in WikiSQL, each SQL program only queries one table at a time without joining tables.
In contrast, Spider includes more complex queries across multiple tables, but it too relies on a small group of college students for annotations. Human annotation is expensive and time-consuming, making it difficult to expand these datasets.
Recent work has shown that LLMs can successfully generate SQL code from instructions, prompting us to consider whether LLMs can also label other types of data and replace costly human labelling for tabular datasets. Inspired by LLMs' ability to perform multiple tasks, we see promise in addressing the tabular dataset annotation problem.
This paper introduces AnnotatedTables, a large-scale tabular dataset filled with LLM annotations. To avoid the bottleneck of time-consuming human annotations, we use LLMs to interpret tabular data and produce synthetic labels. Our dataset includes varied tabular data seen in real-world applications while also being general enough for other data sources.
Through careful design of prompts, we guide LLMs to create SQL code with no prior examples, and this approach has proven to be a quick and cost-effective way to build a comprehensive dataset without relying on human effort. To ensure the quality of the SQL code generated by the LLM, we validate it by executing the queries on the tabular data, confirming the results of these generated SQL queries.
AnnotatedTables is now the largest collection of executable SQL code related to the corresponding tabular data. SQL programs are not the only annotations that LLMs can provide. Some table columns may serve as input for predictions, and LLMs can be used to identify these input-target relationships, making them usable for tabular classification.
Our work demonstrates how LLMs can simplify and automate the process of creating tabular datasets. To validate our dataset's usefulness and quality, we carried out two additional studies. First, we examined the ability of LLMs to learn a new programming language through examples. Second, we scaled up an evaluation of a recent classification model, TabPFN, using diverse real-world data. Both studies highlight how LLM annotation can adapt our dataset to meet specific research goals.
In summary, the main contributions of our work are as follows:
- We present methods using in-context learning with LLMs to understand and annotate a large volume of real-world tabular data, effectively tackling the challenges of traditional human annotations.
- We introduce a cross-domain dataset, AnnotatedTables, which is currently the largest SQL dataset with execution-capable tabular data.
- We assess the quality of the annotations generated by LLMs comprehensively throughout our research and demonstrate the research potential of AnnotatedTables through two follow-up studies.
- We introduce a technique called Incremental Prompt Engineering to teach LLMs how to translate SQL to Rel programs with adequate accuracy, even for a language that the LLMs had not been previously trained on.
- We utilize the identified input-target columns annotated by LLMs to evaluate TabPFN's performance on various tabular classification tasks.
In the related work section, we discuss the intersection of database systems and deep learning. Language models can be trained to generate SQL queries from natural language descriptions. Additionally, neural networks can be used to estimate the number of rows returned by queries, implementing optimizations. Other database-related deep learning applications include query answering and anomaly detection.
All these areas could benefit from a large multi-domain dataset like AnnotatedTables. The LLM annotations can be used for acquiring large training datasets in tabular format, vital for tasks like text-to-SQL generation and cardinality estimation. Our dataset's support for executing queries is necessary for these tasks to validate their accuracy.
Currently, most common SQL datasets rely on human annotations, whereas our approach is the first to create a tabular dataset annotated by LLMs. This innovation enables us to generate a significant volume of annotations that match or even exceed what is available on the web. While existing SQL datasets often contain English descriptions, we do not evaluate or ensure their quality, as SQL programs can be validated by execution, while English descriptions cannot be.
LLMs have also been previously applied to data annotation in text fields, often performing better than human annotators. They can detect harmful speech and provide natural language explanations as necessary annotations. Moreover, LLMs are used for evaluating natural language generation, aligning closely with human assessments.
In constructing our dataset, we begin by collecting public tabular data. Our aim is to tap into the wealth of unannotated data available online, focusing on curated tables from practical data science applications. We found that Kaggle, a data science platform, meets our criteria well. Kaggle hosts tables that are crowd-sourced by a community of data enthusiasts and professionals. Users can share and analyze datasets on a wide range of real-world topics.
Ratings and insights from users contribute to the curation of usable and dependable data. Although individual Kaggle datasets have been studied in prior research, our approach aggregates and examines them collectively, revealing a substantial and diverse collection of quality data. Using Kaggle's public API, we searched for datasets between 10 megabytes and 1 gigabyte to ensure faster data collection and processing. We also ensured that each dataset met a minimum usability rating to filter low-quality datasets. In total, we collected 70,000 Kaggle datasets for further processing.
The method we used for annotating data with LLMs is both flexible and automated, and can potentially be applied to other sources beyond Kaggle datasets.
To assist LLMs in interpreting tabular data, we developed a textual description for each database as part of the prompt. Since using entire tables in a prompt may exceed LLM input limits, we opted for shorter descriptions that include basic schema and example rows. The schema describes the table's structure, listing the table name, column names, and the data type of each column.
To extract the schema, we discovered that tabular data were stored in CSV files within the Kaggle datasets. Since CSV files do not contain table names, we relied on the filenames for that purpose. The header row of the CSV files provided the column names. Our initial inspection revealed that the column names and filenames are usually clear and descriptive, allowing the LLM to easily interpret the table's information.
An existing data analysis software called pandas was employed to load the tables and derive the column data types. Our schema extraction method proved effective and practical. Some Kaggle datasets might include multiple tables, necessitating the creation of multiple schemas. Together, these schemas outline the database's structure, indicating how many tables exist, the columns and types within those tables, and if any column might serve as a JOIN key.
Example rows from the table are also crucial, allowing LLMs to construct SQL queries effectively, such as WHERE statements to filter data. For example, in a query like SELECT * FROM CUSTOMERS WHERE COUNTRY='MEXICO', the table name and column name can be derived from the schema, while the value 'MEXICO' would be found in a table row. Thus, we provide example rows alongside the schema so that LLMs can create WHERE statements in SQL accurately.
For the example row, we selected the first row of each table and converted it into a textual format for the LLM. Our schema and example row descriptions are broad enough that they can be used for nearly all forms of tabular data.
Through zero-shot learning, we directed LLMs to generate SQL code that queries our collected tables, which is one of the annotation types. Currently, OpenAI's ChatGPT is among the most efficient LLMs for various tasks, and we chose to use it as our annotation LLM.
When generating SQL for each database, we create a prompt that includes instructions, the database schema, and the example row from each table. The instruction asks the LLM to produce SQL queries "that typical users might write to access this database." It also requests more complex queries to improve dataset value, especially encouraging more JOINs when multiple tables are available.
For every Kaggle dataset with tabular data, we prompted the LLM to generate fifteen SQL program annotations along with natural language explanations.
Some SQL programs generated may be incorrect, so we put a system in place to check the quality of these annotations. Since AnnotatedTables has access to the tabular data, we validated the SQL programs by executing them. If a SQL program runs and returns non-empty results, we consider it valid. Programs that return empty results are marked as such. To ensure accurate comparisons during our follow-up study on SQL-to-Rel translation, we do not use SQL queries that return empty results.
We acknowledged in our research that we do not validate the English descriptions generated by the LLM. All intermediate data artifacts are made public as part of AnnotatedTables.
In examining the quality of annotations, we found that LLMs could create SQL programs reflecting human intent effectively. We found a high percentage (82.25%) of valid SQL programs, indicating good quality overall. The LLM could create complex queries featuring a range of SQL components.
In nearly all categories of SQL components except JOINs, the percentages of valid queries exceeded 50%. However, JOIN queries only achieved a 49.47% validity rate, potentially pointing towards a need for deeper understanding or programming skill when joining multiple tables.
We noted that certain simple SQL statements appeared frequently, like SELECT * FROM table, which accounted for a large chunk (48.4%) of valid SQL programs. While these statements are valid, they are less informative than more complicated queries for further learning.
In our prompts, we directed the LLM to create SQL programs "with relatively high complexity," achieving some success. However, the conflicting goal of creating complex and correct queries may lead LLMs to produce simpler queries to maintain accuracy.
As shown, invalid SQL annotations can arise for various reasons. Sometimes, the LLM creates non-existent table names or leaves placeholder text, or in some CSV files, column names may be numerical, which are invalid in SQL queries. Additionally, while some correct SQL programs may return empty results, they are labeled invalid for the purpose of measuring execution accuracy.
We utilized AnnotatedTables to study LLMs' ability to learn a new programming language, Rel.
Since there are limited examples of open-source Rel code available, we turned to few-shot in-context learning as a natural solution. Although ICL has been studied, translating between two distinct programming languages is a unique research challenge.
All SQL programs in AnnotatedTables can be executed, making it possible to evaluate translation accuracy effectively. The SQL programs mimic typical database usage, and translating these queries enables us to create a sizable dataset of Rel programs valuable for subsequent research.
In learning a new programming language, getting enough training data can be challenging. Our results indicate that translation accuracy improves steadily with more in-context examples, achieving around 40% accuracy with 45 translation examples.
To formulate the task of SQL-to-Rel translation, we start with an executable SQL program that operates on a specific table. Our goal is to develop a model capable of translating the SQL program into a Rel program, ensuring both yield the same query results when executed on the same table.
Execution accuracy measures how often translated SQL and Rel programs produce the same non-empty results. However, comparing SQL and Rel programs directly can be difficult due to inherent language differences.
To enhance the translation capabilities, we introduced a technique called Incremental Prompt Engineering (IPE), which employs execution feedback to refine the learned translation examples. We began with a few initial SQL-to-Rel translation examples, instructing the language model to translate unseen SQL programs, and calculated execution accuracy for each step.
The feedback allows us to identify where the LLM fails to translate and to add those examples to the next round of training, increasing the number of examples steadily until we reach satisfactory translation accuracy.
Through the IPE process, we observed improved translation accuracy as we added more examples. Once we surpassed 30 examples, further examples offered diminishing returns, converging around the 40% accuracy mark.
For our evaluation of TabPFN, we used AnnotatedTables to test its classification performance on many diverse datasets. TabPFN is a new type of neural network trained with synthetic data drawn from a Bayesian prior.
Once trained, it processes all data at once, making predictions in a single forward pass. Given its advanced results in smaller benchmarks, we were curious how TabPFN would measure up against the complexities of real-world tabular data.
TabPFN is compared against standard methods and a couple of leading AutoML frameworks, using various datasets to assess its resilience. As we extend our evaluation to 2,720 classification problems, we move far beyond the earlier benchmarks.
In our experimental setup, we maintain the same parameters as the original authors for TabPFN and train AutoGluon using a one or five-minute time budget, splitting the data only once per table to save time.
We rely on the LLM annotations to identify suitable input and target columns within the Kaggle datasets for TabPFN. This process ensures that the input columns are numeric while the target is categorical for effective classification.
In our findings, TabPFN shows solid performance across the 2,720 classification problems tested, with some significant results compared to AutoGluon, demonstrating that LLMs can accurately identify input-target columns and the tabular structure needed for effective classification tasks.
Lastly, using AnnotatedTables highlighted the potential for LLMs to revolutionize how we create machine learning datasets. The ability to use LLMs to automate the annotation process can reduce costs and make it easier for researchers to get the customized data they need for their specific projects, opening the door to new avenues in data science.
Title: AnnotatedTables: A Large Tabular Dataset with Language Model Annotations
Abstract: Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can successfully annotate a large amount of tabular data and can be flexibly steered to generate various types of annotations based on specific research objectives, as we demonstrate with SQL annotation and input-target column annotation as examples. As a result, we release AnnotatedTables, a collection of 32,119 databases with LLM-generated annotations. The dataset includes 405,616 valid SQL programs, making it the largest SQL dataset with associated tabular data that supports query execution. To further demonstrate the value of our methodology and dataset, we perform two follow-up research studies. 1) We investigate whether LLMs can translate SQL programs to Rel programs, a database language previously unknown to LLMs, while obtaining the same execution results. Using our Incremental Prompt Engineering methods based on execution feedback, we show that LLMs can produce adequate translations with few-shot learning. 2) We evaluate the performance of TabPFN, a recent neural tabular classifier trained on Bayesian priors, on 2,720 tables with input-target columns identified and annotated by LLMs. On average, TabPFN performs on par with the baseline AutoML method, though the relative performance can vary significantly from one data table to another, making both models viable for practical applications depending on the situation. Our findings underscore the potential of LLMs in automating the annotation of large volumes of diverse tabular data.
Authors: Yaojie Hu, Ilias Fountalis, Jin Tian, Nikolaos Vasiloglou
Last Update: 2024-06-24 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2406.16349
Source PDF: https://arxiv.org/pdf/2406.16349
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.