Improving Table Structure Recognition with Aligned Datasets

Table of Contents

The Importance of Consistent Datasets
The Effects of Errors and Inconsistencies
Selected Datasets
Data Processing Steps
Training the Model
Results of Dataset Corrections
The Role of Canonicalization
Conclusion
Original Source

Table structure recognition (TSR) is important for understanding data in tables across different documents. To help machines learn better from various datasets, it is crucial to ensure that these datasets are clear, consistent, and free from mistakes. However, many existing Benchmark Datasets can have Errors and inconsistencies that can negatively affect the performance of machine learning models designed for this task.

In this article, we discuss how aligning benchmark datasets can improve Model Performance for TSR. We focus on two large datasets, FinTabNet and PubTables-1M, as well as the ICDAR-2013 dataset, which is often used for evaluation.

The Importance of Consistent Datasets

A dataset's annotations need to be consistent within itself and with other datasets. Even small errors in a dataset can hurt the way models train and evaluate performance. For example, a benchmark dataset may seem fine when looked at alone, but if combined with others that are not aligned, it can lead to poor performance. This misalignment acts as another source of noise, affecting the models that rely on these datasets.

The Effects of Errors and Inconsistencies

Errors can vary from direct mistakes in Labeling to subtle inconsistencies across datasets. When we reference "misalignment," we mean that datasets for the same task might be labeled differently, which can confuse models and lead to incorrect predictions. This article aims to explore how correcting these errors can lead to a significant boost in model performance.

Selected Datasets

For our study, we worked with FinTabNet and PubTables-1M for training, and we used ICDAR-2013 as an evaluation benchmark. FinTabNet features around 113,000 tables from financial reports, whereas PubTables-1M includes nearly one million tables from scientific documents. The ICDAR-2013 dataset has tables from various documents, manually annotated by experts, making it useful for gauging model performance despite its smaller size.

Data Processing Steps

To align these datasets, we had to correct numerous mistakes that were present in the original annotations. Each dataset included specific types of errors, such as incorrect bounding boxes for table cells or inconsistency in labeling. For example, some tables included unnecessary empty rows, which serve no logical purpose and thus can be deemed errors.

We also added missing labels to improve the quality and usability of the datasets. This involved defining bounding boxes for rows and columns and labeling header cells correctly. Each step of data correction was done carefully to enhance the overall quality of the datasets for training models.

Training the Model

We utilized the Table Transformer model (TATR) to carry out our experiments. TATR is designed to frame table structure recognition as a type of object detection, using different classes to identify table components. The model's architecture remained constant while we made improvements only to the data used for training.

During our experiments, we trained the model with both the original and corrected datasets. Each model was evaluated after every training session, allowing us to observe how improvements in the datasets directly impacted model performance.

Results of Dataset Corrections

After aligning the datasets and correcting errors, we saw substantial improvements in model performance. For instance, the accuracy of TATR on the ICDAR-2013 dataset increased markedly when trained on the corrected FinTabNet and PubTables-1M datasets. Specifically, the accuracy increased from 42% to 65% for FinTabNet and from 65% to 75% for PubTables-1M when evaluated on ICDAR-2013.

We also established new performance benchmarks, achieving a directed adjacency relation (DAR) score of 0.965 and an exact match accuracy of 81% on the ICDAR-2013 dataset by combining the two training datasets. This shows that cleaning up data can lead to significantly better outcomes.

The Role of Canonicalization

A major step in our approach involved a technique called canonicalization. This process helped standardize the labels across different datasets. Our ablation experiments demonstrated that this step was particularly effective in boosting model performance. By making annotations more consistent, we reduced confusion for the models and improved their overall accuracy.

Conclusion

This work highlights the importance of having aligned and corrected datasets for table structure recognition tasks. By focusing on aligning benchmark datasets, we showed that model performance can improve significantly. The results indicate that even existing models can perform better when trained on cleaner, more consistent data.

In future work, continuing to refine datasets and explore methods of further improving model training will be crucial. We encourage researchers to consider the quality of the data they use, as it can play a significant role in the success of their models. By improving the benchmarks for table structure recognition, we pave the way for better tools that can efficiently handle data in tables, benefiting various fields including finance, science, and beyond.

Improving Table Structure Recognition with Aligned Datasets

Aligning datasets enhances model performance in table structure recognition tasks.

The Importance of Consistent Datasets

The Effects of Errors and Inconsistencies

Selected Datasets

Data Processing Steps

Training the Model

Results of Dataset Corrections

The Role of Canonicalization

Conclusion

Referenced Topics

Improving Table Structure Recognition with Aligned Datasets

Aligning datasets enhances model performance in table structure recognition tasks.

#The Importance of Consistent Datasets

#The Effects of Errors and Inconsistencies

#Selected Datasets

#Data Processing Steps

#Training the Model

#Results of Dataset Corrections

#The Role of Canonicalization

#Conclusion

Referenced Topics

The Importance of Consistent Datasets

The Effects of Errors and Inconsistencies

Selected Datasets

Data Processing Steps

Training the Model

Results of Dataset Corrections

The Role of Canonicalization

Conclusion