Latest Articles for Data Evaluation

Computation and Language SIB-200: A Step Towards Inclusive Language Evaluation

New dataset enhances evaluation of multilingual models across diverse languages.

2025-09-26T23:44:12+00:00 ― 7 min read

Computation and Language A New Way to Evaluate Question Answering Systems

SQuArE metric improves evaluation of QA systems through multiple answer references.

2025-09-23T13:58:18+00:00 ― 5 min read

Computer Vision and Pattern Recognition Evaluating Weakly-Supervised Semantic Segmentation for Small Objects

New methods improve performance evaluation of small objects in WSSS.

2025-09-22T00:11:00+00:00 ― 6 min read

Computation and Language Evaluating Retrieval Augmented Generation Systems

A new framework for assessing RAG systems without human references.

2025-09-21T17:51:48+00:00 ― 5 min read

Computation and Language New Evaluation Method for Answer Quality

Introducing a method that measures answer quality at different detail levels.

2025-09-18T05:04:12+00:00 ― 6 min read

Computation and Language Enhancing Evaluation Methods in Question Answering Systems

This study proposes new methods for evaluating answers in machine question answering.

2025-09-15T03:20:12+00:00 ― 7 min read

Artificial Intelligence Improving AI Explanation Evaluation Methods

New methods enhance the evaluation of AI model explanations.

2025-09-13T08:41:04+00:00 ― 7 min read

Computation and Language Advancements in Language Model Evaluation with WSC+

A new dataset and method enhance language model question generation.

2025-09-12T20:18:00+00:00 ― 6 min read

Computation and Language Enhancing Verification of Reasoning in Language Models

New dataset improves verification of reasoning steps in AI models.

2025-09-12T11:28:42+00:00 ― 7 min read

Computation and Language Evaluating Language Models with New Benchmark

This article presents a benchmark to assess large language models with complex tasks.

2025-09-11T04:55:54+00:00 ― 6 min read

Computation and Language Evaluating Vocabulary Richness in ChatGPT

A study on how ChatGPT uses language and vocabulary features.

2025-09-09T07:46:12+00:00 ― 9 min read

Artificial Intelligence Assessing Large Language Models in Cybersecurity

A detailed look at CyberMetric's evaluation of AI and human experts in cybersecurity.

2025-09-08T19:39:24+00:00 ― 8 min read

Computation and Language Evaluating Model Editing in Long Texts

A new method assesses the effectiveness of model editing in generating longer texts.

2025-09-08T06:21:30+00:00 ― 8 min read

Computation and Language Improving Question Answering Evaluation Methods

A new framework for assessing AI answer correctness with human-like judgment.

2025-09-07T13:06:36+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Machine Unlearning in Diffusion Models

New dataset enhances evaluation methods for machine unlearning in image generation.

2025-09-06T06:02:12+00:00 ― 6 min read

Computation and Language Introducing FanOutQA: A New Dataset for Complex Question Answering

FanOutQA helps evaluate language models on challenging multi-hop questions using structured data.

2025-09-05T08:58:12+00:00 ― 6 min read

Computer Vision and Pattern Recognition Addressing Visual Hallucination in AI Models

A novel tool generates diverse visual hallucination instances to improve AI accuracy.

2025-09-04T23:45:12+00:00 ― 5 min read

Artificial Intelligence Evaluating Hallucinations in Large Vision-Language Models

This article discusses a new framework for assessing hallucinations in LVLMs.

2025-09-04T12:02:06+00:00 ― 6 min read

Machine Learning Lifelong Benchmarks: A New Approach to Model Evaluation

A method for continuous model evaluation in machine learning to prevent overfitting.

2025-09-02T23:49:36+00:00 ― 6 min read

Computation and Language Improving Fact Verification in RAG Systems

A new method enhances fact checking in retrieval augmented generation systems.

2025-08-31T22:19:12+00:00 ― 7 min read

Computation and Language Improving Intent Recognition in Conversational Systems

Enhancing understanding of user intents through negation and implicature.

2025-08-31T10:04:30+00:00 ― 5 min read

Computation and Language Assessing Language Models on Discourse Entity Recognition

An analysis of language models' understanding of entity recognition rules.

2025-08-30T21:34:00+00:00 ― 7 min read

Software Engineering Assessing Realism in Self-Driving Test Scenarios Using LLMs

This research evaluates the use of LLMs for realistic self-driving car scenarios.

2025-08-29T14:13:48+00:00 ― 8 min read

Computation and Language Enhancing NLP for Diverse Dialects

A framework to improve NLP performance across various language dialects.

2025-08-28T20:51:00+00:00 ― 4 min read

Computation and Language NovelQA: A New Benchmark for Long Text Understanding

Evaluating LLMs on their ability to process long texts in literature.

2025-08-28T03:12:24+00:00 ― 5 min read

Machine Learning Evaluating the Reliability of LLMs in Biomedicine

A new framework assesses how trustworthy LLMs are as biomedical assistants.

2025-08-27T05:13:06+00:00 ― 4 min read

Software Engineering Evaluating Code Language Models: The Data Contamination Challenge

A study highlights data contamination's impact on code model evaluations.

2025-08-25T23:27:42+00:00 ― 6 min read

Computation and Language Evaluating Language Models in Molecular Research

A new dataset improves assessment of molecular knowledge in language models.

2025-08-24T19:45:30+00:00 ― 7 min read

Computer Vision and Pattern Recognition Transforming Image Understanding with SPHINX-V

SPHINX-V enhances AI's ability to interpret images through user interaction.

2025-08-24T07:49:48+00:00 ― 6 min read

Computation and Language BEAR: A New Framework for Evaluating Language Models

BEAR improves assessment of relational knowledge in language models.

2025-08-22T05:16:12+00:00 ― 8 min read

Computation and Language Evaluating Paraphrastic Consistency in Language Models

This study examines how language models handle different expressions of the same reasoning problems.

2025-08-18T21:28:48+00:00 ― 4 min read

Computation and Language Assessing Toxicity in Multilingual Language Models

A new dataset evaluates how language models handle harmful content across cultures.

2025-08-17T13:52:48+00:00 ― 5 min read

Computation and Language Evaluating the Accuracy of Large Vision-Language Models

A new benchmark improves how we assess LVLMs and their accuracy.

2025-08-17T06:46:12+00:00 ― 5 min read

Computation and Language Evaluating Factual Recall in Large Language Models

An assessment of how well LLMs remember factual information and the factors involved.

2025-08-16T20:45:48+00:00 ― 5 min read

Computer Vision and Pattern Recognition Evaluating Text-to-Image Models: A New Approach

This study offers improved methods for assessing text-to-image models.

2025-08-16T12:59:42+00:00 ― 6 min read

Computation and Language Advancing Few-Shot Learning for Polish Language Tasks

A study evaluating few-shot learning methods for Polish language classification.

2025-08-15T22:38:36+00:00 ― 4 min read

Computer Vision and Pattern Recognition Evaluating Information Extraction in Handwritten Texts

New metrics improve evaluation of information extraction systems in handwritten documents.

2025-08-15T11:58:42+00:00 ― 6 min read

Computation and Language Introducing WorkBench: A New Office Task Dataset

WorkBench tests agents' ability to perform realistic office tasks with a unique evaluation method.

2025-08-14T22:09:12+00:00 ― 6 min read

Computation and Language Evaluating Large Language Models in a Changing World

Assessing how LLMs adapt to new information and biases.

2025-08-11T02:46:36+00:00 ― 7 min read

Artificial Intelligence Evaluating Language Models with ALI-Agent Framework

A new method for assessing language models' alignment with human values.

2025-08-09T06:16:24+00:00 ― 7 min read