Improving Table Structure Recognition with Aligned Datasets
Aligning datasets enhances model performance in table structure recognition tasks.
― 4 min read
Table of Contents
Table structure recognition (TSR) is important for understanding data in tables across different documents. To help machines learn better from various datasets, it is crucial to ensure that these datasets are clear, consistent, and free from mistakes. However, many existing Benchmark Datasets can have Errors and inconsistencies that can negatively affect the performance of machine learning models designed for this task.
In this article, we discuss how aligning benchmark datasets can improve Model Performance for TSR. We focus on two large datasets, FinTabNet and PubTables-1M, as well as the ICDAR-2013 dataset, which is often used for evaluation.
The Importance of Consistent Datasets
A dataset's annotations need to be consistent within itself and with other datasets. Even small errors in a dataset can hurt the way models train and evaluate performance. For example, a benchmark dataset may seem fine when looked at alone, but if combined with others that are not aligned, it can lead to poor performance. This misalignment acts as another source of noise, affecting the models that rely on these datasets.
The Effects of Errors and Inconsistencies
Errors can vary from direct mistakes in Labeling to subtle inconsistencies across datasets. When we reference "misalignment," we mean that datasets for the same task might be labeled differently, which can confuse models and lead to incorrect predictions. This article aims to explore how correcting these errors can lead to a significant boost in model performance.
Selected Datasets
For our study, we worked with FinTabNet and PubTables-1M for training, and we used ICDAR-2013 as an evaluation benchmark. FinTabNet features around 113,000 tables from financial reports, whereas PubTables-1M includes nearly one million tables from scientific documents. The ICDAR-2013 dataset has tables from various documents, manually annotated by experts, making it useful for gauging model performance despite its smaller size.
Data Processing Steps
To align these datasets, we had to correct numerous mistakes that were present in the original annotations. Each dataset included specific types of errors, such as incorrect bounding boxes for table cells or inconsistency in labeling. For example, some tables included unnecessary empty rows, which serve no logical purpose and thus can be deemed errors.
We also added missing labels to improve the quality and usability of the datasets. This involved defining bounding boxes for rows and columns and labeling header cells correctly. Each step of data correction was done carefully to enhance the overall quality of the datasets for training models.
Training the Model
We utilized the Table Transformer model (TATR) to carry out our experiments. TATR is designed to frame table structure recognition as a type of object detection, using different classes to identify table components. The model's architecture remained constant while we made improvements only to the data used for training.
During our experiments, we trained the model with both the original and corrected datasets. Each model was evaluated after every training session, allowing us to observe how improvements in the datasets directly impacted model performance.
Results of Dataset Corrections
After aligning the datasets and correcting errors, we saw substantial improvements in model performance. For instance, the accuracy of TATR on the ICDAR-2013 dataset increased markedly when trained on the corrected FinTabNet and PubTables-1M datasets. Specifically, the accuracy increased from 42% to 65% for FinTabNet and from 65% to 75% for PubTables-1M when evaluated on ICDAR-2013.
We also established new performance benchmarks, achieving a directed adjacency relation (DAR) score of 0.965 and an exact match accuracy of 81% on the ICDAR-2013 dataset by combining the two training datasets. This shows that cleaning up data can lead to significantly better outcomes.
The Role of Canonicalization
A major step in our approach involved a technique called canonicalization. This process helped standardize the labels across different datasets. Our ablation experiments demonstrated that this step was particularly effective in boosting model performance. By making annotations more consistent, we reduced confusion for the models and improved their overall accuracy.
Conclusion
This work highlights the importance of having aligned and corrected datasets for table structure recognition tasks. By focusing on aligning benchmark datasets, we showed that model performance can improve significantly. The results indicate that even existing models can perform better when trained on cleaner, more consistent data.
In future work, continuing to refine datasets and explore methods of further improving model training will be crucial. We encourage researchers to consider the quality of the data they use, as it can play a significant role in the success of their models. By improving the benchmarks for table structure recognition, we pave the way for better tools that can efficiently handle data in tables, benefiting various fields including finance, science, and beyond.
Title: Aligning benchmark datasets for table structure recognition
Abstract: Benchmark datasets for table structure recognition (TSR) must be carefully processed to ensure they are annotated consistently. However, even if a dataset's annotations are self-consistent, there may be significant inconsistency across datasets, which can harm the performance of models trained and evaluated on them. In this work, we show that aligning these benchmarks$\unicode{x2014}$removing both errors and inconsistency between them$\unicode{x2014}$improves model performance significantly. We demonstrate this through a data-centric approach where we adopt one model architecture, the Table Transformer (TATR), that we hold fixed throughout. Baseline exact match accuracy for TATR evaluated on the ICDAR-2013 benchmark is 65% when trained on PubTables-1M, 42% when trained on FinTabNet, and 69% combined. After reducing annotation mistakes and inter-dataset inconsistency, performance of TATR evaluated on ICDAR-2013 increases substantially to 75% when trained on PubTables-1M, 65% when trained on FinTabNet, and 81% combined. We show through ablations over the modification steps that canonicalization of the table annotations has a significantly positive effect on performance, while other choices balance necessary trade-offs that arise when deciding a benchmark dataset's final composition. Overall we believe our work has significant implications for benchmark design for TSR and potentially other tasks as well. Dataset processing and training code will be released at https://github.com/microsoft/table-transformer.
Authors: Brandon Smock, Rohith Pesala, Robin Abraham
Last Update: 2023-05-23 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2303.00716
Source PDF: https://arxiv.org/pdf/2303.00716
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.