Improving Data Labeling in Active Learning
Two methods aim to enhance data labeling for better classification results.
― 6 min read
Table of Contents
Supervised classification methods help solve various real-world problems by making predictions based on labeled data. The effectiveness of these methods depends heavily on the quality of the labels used during training. However, gathering good quality labels can be challenging and costly, making it hard to utilize these algorithms effectively in real situations.
To address this problem, researchers often use Active Learning. This technique focuses on choosing the most meaningful data samples for Labeling, thereby maximizing the efficiency of the labeling process. Yet, for active learning to work optimally, the labels obtained from Experts must be of high quality and sufficient quantity. In many cases, this creates a dilemma: should we ask multiple experts to label the same sample to ensure quality, or should we focus on getting more samples labeled in total?
This article discusses the issue of poor-quality Annotations in active learning setups. The goal is to present two new methods to unify different expert annotations while making use of unlabeled data. The proposed methods are designed to work effectively even when samples are labeled by different experts.
The Challenges of Labeling Data
Supervised learning algorithms play a major role in building prediction models for various tasks. However, their success primarily relies on having a well-labeled dataset during training. In real life, we often start with either no labels or just a few, as labeling data requires significant human effort and financial resources.
To make the labeling process more efficient and affordable, active learning techniques are widely implemented. Active learning algorithms select the most valuable samples from a larger pool of unlabeled data, which are then sent to experts for annotation. While some labels can be generated through automated methods, many tasks still rely on human input, especially in areas like security alert notifications.
Human annotators are not perfect, and their labels may contain errors, which negatively affects the performance of models built on those labels. The likelihood of mistakes is influenced by the complexity of the task and the expertise of the annotators. When these errors accumulate, it becomes necessary to apply correction methods. Two common approaches include unifying annotations from multiple experts or identifying and filtering out incorrect labels.
The first approach takes advantage of the fact that different experts might accurately label some samples. This method usually requires multiple experts to label each sample, which can be a challenge when resources are limited. The second approach seeks to find and eliminate mislabeled samples, but it runs the risk of discarding accurate labels, which could lead to an oversimplified model that misses vital information.
Proposed Methods
This paper introduces two algorithms that improve the process of unifying annotations: inferred consensus and simulated consensus. Both algorithms build on a well-known method called Expectation-Maximization (EM) and aim to enhance labeling even when samples lack multiple expert annotations.
Inferred consensus uses existing annotations from experts to predict labels for unlabeled samples. Basically, the idea is to assume what an expert would have labeled a sample that they did not actually annotate. For each expert, a machine learning model is created using the samples they have labeled, which is then used to estimate labels for the entire dataset.
Simulated consensus improves on the inferred approach by training models in a way that they infer labels only for samples not seen by the original expert. This helps create a more reliable set of labels while keeping track of the quality of each annotator’s contributions.
Datasets
Addressing ImbalancedWhen using algorithms like EM, it is important to account for how class labels are assigned. A common threshold for distinguishing between classes is usually set at 0.5, but this can be problematic in cases of imbalanced data, where one class is much less frequent than another.
In situations where the class distribution is unknown, determining an effective threshold can be difficult. This article proposes an approach to calculate a threshold based on the probabilities predicted for all samples during training. By averaging the probabilities for each class, we can create a more informed cut-off point, which helps improve the models’ performance on imbalanced datasets.
Experimental Setup
To evaluate the effectiveness of the proposed algorithms, a testing setup was created that resembles real-world active learning scenarios. Since it is impractical to obtain human labels solely for experimentation, a method was developed to generate annotations using known public datasets.
The process involved creating binary labels for a set number of experts by simulating their annotation behavior. We achieved this by drawing from statistical distributions to define how likely an expert was to label a given sample, considering their accuracy rates as well.
The experiments were conducted across four research datasets with different characteristics. This diversity was essential to ensure the robustness of the proposed methods in various settings. The researchers followed a repetitive testing procedure for each dataset to gather meaningful results and statistical significance.
Evaluation Metrics
Three types of evaluation metrics were used to assess the proposed methods:
Metrics on Annotation Quality: These metrics evaluate the methods' effectiveness in providing accurate probabilities for each sample based on the annotations received from experts.
Expert Quality Estimation: This section measures how well the algorithms can assess the reliability of each expert based on their annotations.
Machine Learning Model Performance: Finally, the evaluation includes metrics from the machine learning models trained on the estimated labels, measuring how well these models perform on test datasets.
Results and Discussion
The results demonstrated that the simulated consensus algorithm significantly outperformed other approaches in most cases. This finding suggests that introducing simulated annotations helps achieve better label quality and improves models' accuracy.
The study also revealed that the quality of the trained models varied depending on the dataset used. While the proposed consensus methods performed well in structured datasets, their advantage weakened in imbalanced scenarios where majority voting with the default threshold performed unexpectedly well.
Conclusion
In conclusion, this article addresses the challenge of poor-quality data annotations in active learning environments. By introducing two new methods for unifying annotations, we can enhance the labeling process and improve the performance of classification algorithms. These methods can manage imbalanced datasets effectively without needing prior information about class distributions.
The findings suggest that using simulators for expert annotations may lead to better assessment of label quality and reliability. Future work should further explore these methods in various contexts and extend the research to understand the relationship between label quality and the performance of machine learning models.
The implications of this research extend to various fields where active learning is applied, indicating a clear pathway forward for improving data labeling processes in a wide range of applications. Further experimentation and validation will help solidify the results presented and encourage ongoing exploration in this area.
Title: Robust Assignment of Labels for Active Learning with Sparse and Noisy Annotations
Abstract: Supervised classification algorithms are used to solve a growing number of real-life problems around the globe. Their performance is strictly connected with the quality of labels used in training. Unfortunately, acquiring good-quality annotations for many tasks is infeasible or too expensive to be done in practice. To tackle this challenge, active learning algorithms are commonly employed to select only the most relevant data for labeling. However, this is possible only when the quality and quantity of labels acquired from experts are sufficient. Unfortunately, in many applications, a trade-off between annotating individual samples by multiple annotators to increase label quality vs. annotating new samples to increase the total number of labeled instances is necessary. In this paper, we address the issue of faulty data annotations in the context of active learning. In particular, we propose two novel annotation unification algorithms that utilize unlabeled parts of the sample space. The proposed methods require little to no intersection between samples annotated by different experts. Our experiments on four public datasets indicate the robustness and superiority of the proposed methods in both, the estimation of the annotator's reliability, and the assignment of actual labels, against the state-of-the-art algorithms and the simple majority voting.
Authors: Daniel Kałuża, Andrzej Janusz, Dominik Ślęzak
Last Update: 2023-07-25 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2307.14380
Source PDF: https://arxiv.org/pdf/2307.14380
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.