Latest Articles for Data Quality

Statistics Theory Validating Statistical Models with Contaminated Data

This article discusses the challenges in model validation due to contaminated data.

2025-07-31T22:27:48+00:00 ― 6 min read

Computer Vision and Pattern Recognition Advancing Offline Reinforcement Learning with SeMOPO

SeMOPO improves learning from low-quality data by separating useful information from noise.

2025-07-29T13:07:42+00:00 ― 4 min read

Machine Learning Improving Offline Multi-Agent Reinforcement Learning Research Standards

Examining the key issues in offline MARL and proposing standardized solutions.

2025-07-29T05:53:12+00:00 ― 6 min read

Methodology Evaluating Non-Probability Data in Statistics

A look at the role of non-probability data in modern statistical methods.

2025-07-28T21:52:00+00:00 ― 6 min read

Machine Learning The Growing Importance of Data Valuation

Assessing data worth is key to improving machine learning outcomes.

2025-07-28T21:01:56+00:00 ― 7 min read

Machine Learning Evaluating Feature Selection Methods in Noisy Data

Methods for identifying important features in low-quality data environments.

2025-07-28T00:47:18+00:00 ― 6 min read

Computation and Language A Closer Look at GLM-4 Models

GLM-4 models show improved capabilities in language understanding and generation.

2025-07-27T06:52:54+00:00 ― 8 min read

Machine Learning Improving EHR Data Generation for Better Healthcare Insights

A new model enhances synthetic EHR data for improved healthcare applications.

2025-07-26T21:39:54+00:00 ― 5 min read

Machine Learning Improving Pseudo-labeling with DIPS Framework

DIPS addresses data quality issues in pseudo-labeling for better machine learning outcomes.

2025-07-26T18:38:12+00:00 ― 5 min read

Computation and Language Introducing FineWeb: A New Dataset for Language Models

FineWeb offers 15 trillion tokens to improve language model training.

2025-07-24T10:01:12+00:00 ― 7 min read

Computation and Language Small Language Models and Noise Management

This article examines how small language models learn to handle noise in data.

2025-07-21T07:53:30+00:00 ― 4 min read

Computer Vision and Pattern Recognition VideoEval: A New Standard for Video Model Evaluation

VideoEval sets a new benchmark for assessing video foundation models effectively.

2025-07-17T18:26:24+00:00 ― 5 min read

Machine Learning Addressing Model Collapse in AI Training

This article discusses tackling model collapse using better data selection and feedback.

2025-07-16T12:48:16+00:00 ― 4 min read

Computer Vision and Pattern Recognition Improving Dataset Quality through Label Error Detection

A new method enhances detection of mislabeled images and text in datasets.

2025-07-16T06:37:36+00:00 ― 5 min read

Databases Enhancing Data Management with Semantic SQL Transducer

Discover how the Semantic SQL Transducer improves data clarity and management.

2025-07-15T15:52:48+00:00 ― 6 min read

Machine Learning The Impact of Noisy Data on Machine Learning Accuracy

Exploring how noisy data affects model performance on unseen data.

2025-07-08T17:53:20+00:00 ― 7 min read

Image and Video Processing Improving Disease Detection Through Quality Dataset Management

Using UMAP to spot labeling errors in medical image datasets.

2025-07-08T10:56:15+00:00 ― 6 min read

Computation and Language Detecting Errors in Machine Translation

This article discusses challenges in detecting hallucinations in machine translation across various languages.

2025-07-08T06:15:42+00:00 ― 5 min read

Computation and Language Introducing LawLuo: A New Approach to Legal Assistance

LawLuo combines multiple agents for enhanced legal consultation experiences.

2025-07-08T02:10:48+00:00 ― 6 min read

Computation and Language The Challenges of Regurgitative Training in LLMs

This paper examines the drawbacks of using LLM-generated data for training new models.

2025-07-05T23:08:00+00:00 ― 7 min read

Computation and Language Advancing Synthetic Data for Language Models

A new method enhances synthetic data quality for better language model alignment.

2025-06-30T13:24:06+00:00 ― 5 min read

Databases Advancements in Entity Resolution with ASPen

Introducing ASPen, a system to improve data quality through advanced entity resolution techniques.

2025-06-28T15:11:12+00:00 ― 6 min read

Artificial Intelligence EU AI Act: Addressing Uncertainty in AI Systems

New rules focus on transparency and managing uncertainty in AI technology.

2025-06-25T09:53:54+00:00 ― 6 min read

Computation and Language Adapting Language Models with Limited Resources

Research on training language models for underrepresented languages efficiently.

2025-06-20T19:49:30+00:00 ― 6 min read

Computation and Language Optimizing Language Models for Medical Texts

A study on improving language models using focused medical articles.

2025-06-17T05:19:12+00:00 ― 5 min read

Software Engineering Addressing Fairness Debt in AI Systems

This article explores identifying and managing biases in AI for fair outcomes.

2025-06-16T23:15:48+00:00 ― 5 min read

Computer Vision and Pattern Recognition Aligning AI to Human Visual Understanding

A framework to improve AI's performance in visual tasks by mimicking human judgments.

2025-06-14T06:13:12+00:00 ― 5 min read

Computer Vision and Pattern Recognition Assessing the Quality of Image Captions

This article evaluates sentiment and meaning in image captions.

2025-06-12T04:58:36+00:00 ― 4 min read

Computer Vision and Pattern Recognition The Impact of Labeling on Machine Learning Performance

This article highlights how label variations affect machine learning models.

2025-06-12T01:09:30+00:00 ― 7 min read

Methodology Improving Data Readiness for AI Success

Enhance data quality through visual analysis for effective AI projects.

2025-06-09T10:27:08+00:00 ― 5 min read

Image and Video Processing Challenges in Histopathological Image Analysis Using Deep Learning

Investigation of dataset issues impacting tissue image classification accuracy.

2025-06-09T01:40:10+00:00 ― 5 min read

Statistics Theory Bayesian Methods for Mismatched Data

A new approach to accurately match records in error-prone datasets.

2025-06-04T01:56:52+00:00 ― 5 min read

Machine Learning Improving K-Means Clustering with Missing Data

New methods enhance K-means clustering by addressing missing data issues.

2025-06-02T11:24:00+00:00 ― 5 min read

Biological Physics PDBBind-Opt: Improving Drug Discovery Data

New systems enhance protein-ligand interaction data for better medicine design.

2025-05-30T23:26:45+00:00 ― 6 min read

Machine Learning The Quirks and Challenges of Vision-Language Models

An overview of the strengths and flaws in today's Vision-Language Models.

2025-05-28T19:26:51+00:00 ― 6 min read

Computation and Language Evaluating Wikipedia's Quality Across Languages

This piece examines the varying quality of Wikipedia content in different languages.

2025-05-27T10:10:12+00:00 ― 7 min read

Artificial Intelligence Understanding Class Granularity in Knowledge Graphs

Class Granularity helps organize knowledge graphs for better information retrieval.

2025-05-26T10:01:39+00:00 ― 6 min read

Software Engineering The Hidden Risks of Bad Data in Deep Learning

Bad data can lead to poor model performance in deep learning applications.

2025-05-20T17:13:12+00:00 ― 6 min read

Machine Learning Navigating the Challenges of Label Noise in Deep Learning

Label noise can hinder deep learning models; new methods improve accuracy.

2025-05-01T16:21:20+00:00 ― 7 min read

Machine Learning Tackling the Challenge of Cyberbullying Detection

Understanding data biases in machine learning for effective cyberbullying detection.

2025-04-30T03:12:00+00:00 ― 8 min read