Latest Articles for Data Evaluation

Health Informatics Evaluating Biomedical Research: Human and AI Collaboration

Combining human reviewers with LLMs improves biomedical research evaluations.

2025-08-06T10:09:00+00:00 ― 6 min read

Image and Video Processing 2023 AAPM Grand Challenge on Medical Imaging

A challenge focusing on deep generative models for realistic medical image generation.

2025-08-04T00:01:12+00:00 ― 8 min read

Machine Learning Dynamic Benchmarks for Evaluating Language Models

A new system for assessing language models using real-world data streams.

2025-08-02T01:23:42+00:00 ― 5 min read

Computation and Language Evaluating Commonsense Knowledge in Language Models

A new method to assess commonsense reasoning in AI models through open-ended tasks.

2025-08-01T10:15:12+00:00 ― 8 min read

Computer Vision and Pattern Recognition Assessing Action Quality in AI-Generated Videos

New GAIA dataset sheds light on action quality in AI-generated content.

2025-07-30T19:56:18+00:00 ― 7 min read

Machine Learning Efficient Online Evaluation of Generative Models

A new method to assess generative models with minimal data generation.

2025-07-30T12:41:48+00:00 ― 5 min read

Computer Vision and Pattern Recognition Challenging the Limits of Vision-Language Models

A new benchmark tests compositional reasoning in advanced models.

2025-07-29T19:42:42+00:00 ― 7 min read

Computation and Language Evaluating Hallucination in Large Language Models

New dataset helps assess AI text accuracy and reliability.

2025-07-29T07:12:12+00:00 ― 6 min read

Computation and Language RUPBench: Assessing Robustness in Language Models

A new benchmark evaluates how language models handle text changes.

2025-07-28T07:06:30+00:00 ― 6 min read

Computation and Language Evaluating Retrieval-Augmented Large Language Models

A toolkit for assessing performance of retrieval-augmented models in specific domains.

2025-07-27T18:28:06+00:00 ― 9 min read

Computer Vision and Pattern Recognition Introducing VideoVista: A New Benchmark for Video QA

VideoVista offers a comprehensive evaluation for video question-answering models.

2025-07-27T13:35:48+00:00 ― 5 min read

Econometrics Estimating Treatment Effects in Varied Designs

Methods for measuring treatment effects across diverse groups and time frames.

2025-07-24T23:14:12+00:00 ― 4 min read

Computer Vision and Pattern Recognition A New Approach to Evaluating Text-to-Image Models

This article presents a new method for assessing text-to-image models effectively.

2025-07-24T20:25:18+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Large Vision-Language Models with Dysca

Dysca introduces a new way to assess LVLM performance using synthetic data.

2025-07-24T03:49:54+00:00 ― 6 min read

Computation and Language Evaluating Belief Revision in Language Models

A new method measures how language models adapt their beliefs with new evidence.

2025-07-22T18:07:30+00:00 ― 9 min read

Computation and Language Evaluating AI Agents in Biomedical Research

A new benchmark to assess AI agents' performance in biomedical literature and knowledge graphs.

2025-07-22T12:04:06+00:00 ― 5 min read

Computer Vision and Pattern Recognition Addressing Fairness in Medical Imaging Models

Introducing FairMedFM to evaluate the fairness of foundation models in healthcare.

2025-07-21T07:45:36+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Hallucinations in Medical Vision Language Models

This study assesses how medical LVLMs perform amidst hallucinations using a new dataset.

2025-07-21T04:12:18+00:00 ― 6 min read

Software Engineering Enhancing Vulnerability Detection in Software Systems

Exploring machine learning models and new datasets for improved security.

2025-07-20T06:36:42+00:00 ― 7 min read

Machine Learning New Method for Evaluating Generative Models

FKEA offers a fresh way to assess generative models without needing reference datasets.

2025-07-20T04:38:12+00:00 ― 5 min read

Computation and Language Evaluating Machine Translation: Moving Towards Segment-Level Assessment

A look at the benefits of segment-level evaluation methods for translation quality.

2025-07-19T23:14:18+00:00 ― 8 min read

Computer Vision and Pattern Recognition Improving 3D Object Detection for Safer Self-Driving

New metrics and EdgeHead module enhance 3D detection for autonomous vehicles.

2025-07-19T11:54:54+00:00 ― 6 min read

Machine Learning Improving Language Model Evaluation with Stratified Methods

A new approach enhances the accuracy of language model evaluations.

2025-07-18T10:41:56+00:00 ― 7 min read

Computation and Language Evaluating Trust in Long Document Processing

Improving how models handle evidence in long documents builds user trust.

2025-07-15T22:35:42+00:00 ― 4 min read

Computation and Language Addressing Bias in Language Models with BiasAlert

BiasAlert enhances bias detection in language models for fairer AI outputs.

2025-07-13T20:41:36+00:00 ― 5 min read

Computation and Language Evaluating Language Models: The GraphEval Approach

A new method to assess accuracy in language model outputs.

2025-07-13T06:36:18+00:00 ― 4 min read

Computer Vision and Pattern Recognition Evaluating Hallucination in Vision Language Models

A new benchmark sheds light on hallucination in vision language models.

2025-07-10T21:59:18+00:00 ― 5 min read

Computer Vision and Pattern Recognition The Role of Granularity in Image-Text Retrieval

This study highlights the importance of dataset granularity in improving image-text retrieval systems.

2025-07-09T13:35:54+00:00 ― 5 min read

Computer Vision and Pattern Recognition A New Method for Assessing Generated Sample Quality

Introducing an efficient way to evaluate the quality of generated samples using latent density scores.

2025-07-09T12:09:00+00:00 ― 8 min read

Computer Vision and Pattern Recognition New Benchmark Enhances Video-Language Understanding

A new benchmark improves models' understanding of long videos and language.

2025-07-09T01:29:06+00:00 ― 5 min read

Computer Vision and Pattern Recognition HaloQuest: A New Approach to Hallucination in VLMs

HaloQuest addresses hallucination issues in vision-language models with a new dataset.

2025-07-08T23:14:48+00:00 ― 9 min read

Computation and Language Improving Open Information Extraction Benchmarks

A new benchmark seeks to enhance evaluations of OIE systems for better performance insights.

2025-07-08T12:34:54+00:00 ― 5 min read

Computer Vision and Pattern Recognition Advancing Visual-Language Model Assessment with VisMin Benchmark

A new benchmark to test visual-language models on minimal changes in images and captions.

2025-07-08T11:08:00+00:00 ― 6 min read

Computation and Language Improving Trust in Language Models Through Abstention

This study highlights the need for LLMs to know when to abstain.

2025-07-08T00:36:00+00:00 ― 6 min read

Methodology Evaluating Probabilistic Forecasts: A New Framework

Proper scoring rules enhance the evaluation of probabilistic forecasts across various fields.

2025-07-07T20:11:36+00:00 ― 7 min read

Methodology Analyzing Treatment Effects in Cluster Trials

A framework for better estimating treatment effects in paired cluster-randomized experiments.

2025-07-07T12:41:00+00:00 ― 6 min read

Information Retrieval Evaluating Information Retrieval Systems with AI Annotations

Using AI-generated relevance marks for efficient evaluation of information retrieval systems.

2025-07-06T13:19:08+00:00 ― 7 min read

Computation and Language Addressing Topic Leakage in Authorship Verification

A new method improves evaluation accuracy in authorship verification by reducing topic leakage.

2025-07-06T02:38:54+00:00 ― 8 min read

Computation and Language Evaluating Retrieval-Augmented Generation Systems

A new framework enhances evaluation of RAG systems in specialized domains.

2025-07-03T13:09:36+00:00 ― 8 min read

Computation and Language Improving Evaluation Methods for Machine Reading Comprehension

New methods offer better evaluation of language understanding in models.

2025-06-29T22:47:12+00:00 ― 6 min read