Simple Science

Cutting edge science explained simply

# Computer Science# Machine Learning# Computer Vision and Pattern Recognition

Improving Medical Image Classification with Active Label Cleaning

A new method enhances classification despite noisy labels and imbalanced datasets.

― 5 min read


Tackling Noisy Labels inTackling Noisy Labels inMedical Imagingclassification.A method for robust medical image
Table of Contents

Medical image classification can greatly help diagnose diseases. However, there is often a problem with incorrect labels, which can make it challenging to train accurate models. This is especially true when some diseases are rare and have fewer images. In this context, having Noisy Labels, or incorrect labels, can lead to a drop in model performance. This article discusses a method that aims to improve the training of classifiers in the presence of noisy labels and imbalanced Datasets.

The Problem of Noisy Labels

In the real world, many factors can lead to noisy labels in medical images. Poor quality annotations, automated label generation, or even relying on misleading labels can introduce errors. This noise can distort the learning process, where a model tries to fit the training data, and this distortion can reduce its ability to perform well on new, unseen data.

In medical datasets, conditions can vary in how common they are. Some diseases have many images available, while others have far fewer. For instance, a rare skin condition might have only a small number of images in the dataset, making it hard for the model to learn about it effectively. When working with such imbalanced data, traditional methods that rely on noisy labels might struggle to recognize the minority classes correctly.

Importance of Clean Labels

For accurate predictions, obtaining clean labels is crucial. A clean label is simply a correct label that accurately describes an image. If the model is trained with noisy labels, it might misclassify important images, especially those from minority classes. This means that special strategies are needed to identify and clean these labels, allowing the model to improve its performance gradually.

Active Label Cleaning Approach

To tackle the issue of noisy labels, a two-phase approach is recommended. The first phase focuses on robust training, even when faced with noisy labels. The second phase involves actively cleaning these labels. By combining these two stages, the method can improve classification performance significantly.

Phase 1: Learning with Noisy Labels

In the initial phase, the model is trained while considering the noise present in the labels. The idea is to learn which samples are likely to be clean and which are noisy. This involves separating the labels based on their reliability. However, standard methods often fall short when dealing with imbalanced datasets since they may identify underrepresented samples as noisy mistakenly.

Phase 2: Active Label Cleaning

After the first phase, the next step is to clean the noisy labels. An annotation budget is set that limits how many samples can be relabeled. An active learning sampler is then used to select the most crucial samples to clean. By focusing on key samples during the relabeling process, the model can improve significantly. The selected samples are then sent to experts for relabeling, and the model is updated accordingly.

Addressing Class Imbalance

The challenge of class imbalance comes into play when certain classes have far fewer samples. For example, in a dataset containing multiple skin conditions, one condition might present a significantly lower number of images than others. To ensure that the model learns effectively, strategies should focus on balancing the representation of classes.

Variance Of Gradients

One novel technique introduced in this approach is the Variance of Gradients (VOG). While traditional methods may rely on a sample's loss to determine its status as clean or noisy, VOG helps analyze the change in gradients over time. This helps in identifying underrepresented samples more accurately and ensures that minority classes are recognized during the training process.

Datasets Used

The effectiveness of the proposed method is shown using two specific datasets: ISIC-2019 and NCT-CRC-HE-100K. The ISIC-2019 dataset has images of skin diseases, while the NCT-CRC-HE-100K dataset contains histopathology images. Both datasets display significant Class Imbalances, providing a proper ground to test how well the method performs in real-world settings.

ISIC-2019 Dataset

This dataset comprises over 25,000 images of various skin diseases, which are divided into training, validation, and test sets. The distribution among the classes is uneven, leading to challenges when training classifiers. The goal remains to ensure that the model learns effectively across all represented conditions despite the imbalance.

NCT-CRC-HE-100K Dataset

The long-tailed NCT-CRC-HE-100K dataset is another critical data source, with numerous histopathology images. Similar to ISIC-2019, this dataset also suffers from class imbalance, allowing for a thorough evaluation of the proposed method and its ability to manage noisy labels effectively.

Experiments and Results

To validate the effectiveness of the proposed method, various experiments were conducted. The performance of the active label cleaning approach was compared against several baseline methods.

Active Learning Comparison

Different active learning strategies were tested, including random sampling and entropy-based sampling. The goal was to see how well these strategies could select samples for relabeling and improve the model's performance. Results showed that starting with a model trained on noisy data was generally less effective than training with clean samples initially identified through the proposed method.

Conclusion

The proposed two-phase approach combining learning with noisy labels and active label cleaning demonstrates significant improvements in medical image classification tasks, especially in handling noisy labels and class imbalance. By effectively relabeling important samples and using innovative techniques like Variance of Gradients, the method presents a practical way to enhance the robustness of classifiers in the face of label noise.

In summary, the key takeaways include the importance of clean labels, the effectiveness of active learning in cleaning noisy labels, and the benefits of addressing class imbalance. By focusing on these areas, medical image classification can become more accurate, ultimately aiding in better diagnosis and treatment of various health conditions.

Original Source

Title: Active Label Refinement for Robust Training of Imbalanced Medical Image Classification Tasks in the Presence of High Label Noise

Abstract: The robustness of supervised deep learning-based medical image classification is significantly undermined by label noise. Although several methods have been proposed to enhance classification performance in the presence of noisy labels, they face some challenges: 1) a struggle with class-imbalanced datasets, leading to the frequent overlooking of minority classes as noisy samples; 2) a singular focus on maximizing performance using noisy datasets, without incorporating experts-in-the-loop for actively cleaning the noisy labels. To mitigate these challenges, we propose a two-phase approach that combines Learning with Noisy Labels (LNL) and active learning. This approach not only improves the robustness of medical image classification in the presence of noisy labels, but also iteratively improves the quality of the dataset by relabeling the important incorrect labels, under a limited annotation budget. Furthermore, we introduce a novel Variance of Gradients approach in LNL phase, which complements the loss-based sample selection by also sampling under-represented samples. Using two imbalanced noisy medical classification datasets, we demonstrate that that our proposed technique is superior to its predecessors at handling class imbalance by not misidentifying clean samples from minority classes as mostly noisy samples.

Authors: Bidur Khanal, Tianhong Dai, Binod Bhattarai, Cristian Linte

Last Update: 2024-10-24 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2407.05973

Source PDF: https://arxiv.org/pdf/2407.05973

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles