Latest Articles for Evaluation

Computer Vision and Pattern Recognition Introducing VideoVista: A New Benchmark for Video QA

VideoVista offers a comprehensive evaluation for video question-answering models.

2025-07-27T13:35:48+00:00 ― 5 min read

Distributed, Parallel, and Cluster Computing Improving Reproducibility in Distributed Systems Research

This article explores methods to enhance the reliability of research artifacts in computing.

2025-07-27T08:04:00+00:00 ― 7 min read

Computation and Language A Closer Look at GLM-4 Models

GLM-4 models show improved capabilities in language understanding and generation.

2025-07-27T06:52:54+00:00 ― 8 min read

Computation and Language Evaluating Language Models: A New Approach

A study on using LLMs to judge other LLMs and its implications.

2025-07-27T04:30:42+00:00 ― 7 min read

Computation and Language Rationales in Argument Ranking by Language Models

A study on how language models generate persuasive rationales for argument evaluation.

2025-07-26T20:52:30+00:00 ― 5 min read

Computation and Language New Language Models Enhance Galician Accessibility

Two new models aim to improve technology access for Galician speakers.

2025-07-26T20:44:36+00:00 ― 5 min read

Computation and Language Challenges of Machine Translation in Metaphors

Examining the difficulties of translating metaphorical language in machine translation.

2025-07-26T17:58:42+00:00 ― 6 min read

Computer Vision and Pattern Recognition Introducing DF40: A New Dataset for Deepfake Detection

DF40 offers a comprehensive approach to improving deepfake detection methods.

2025-07-26T16:08:06+00:00 ― 6 min read

Computation and Language Evaluating Honesty in Large Language Models

This study assesses the honesty of LLMs in three key areas.

2025-07-26T14:33:18+00:00 ― 5 min read

Information Retrieval Improving Question-Answering Systems in Corporations

Discover how companies enhance their question-answering systems for better user support.

2025-07-26T12:26:54+00:00 ― 4 min read

Artificial Intelligence Assessing AI's Understanding of Algorithms

A study on how AI comprehends algorithms and their implications.

2025-07-26T11:31:36+00:00 ― 6 min read

Computation and Language Evaluating Cross-Domain Text Classification with Depth

A new metric improves evaluation of text classification models across different domains.

2025-07-26T10:44:12+00:00 ― 7 min read

Computation and Language Data Contamination in Language Models: A Growing Concern

Data contamination affects the evaluation of large language models significantly.

2025-07-26T10:12:36+00:00 ― 5 min read

Computation and Language Evaluating Large Language Models for Ethical Alignment

A new method for assessing LLMs aligns with human values.

2025-07-26T05:12:24+00:00 ― 6 min read

Computer Vision and Pattern Recognition Addressing Bias in AI: The VLBiasBench Approach

A new tool to evaluate biases in large vision-language models.

2025-07-26T01:15:24+00:00 ― 6 min read

Computation and Language Assessing Diversity in Automatic Poetry Generation

A study evaluates how machines create varied and creative poetry compared to humans.

2025-07-25T20:38:54+00:00 ― 6 min read

Computation and Language Evaluating Counter Narratives Against Hate Speech

A new method improves how we assess counter narratives to hate speech.

2025-07-25T20:15:12+00:00 ― 6 min read

Computation and Language Introducing InternLM-Law: A Model for Legal Queries

InternLM-Law enhances responses to diverse Chinese legal questions with advanced training.

2025-07-25T15:30:48+00:00 ― 7 min read

Computation and Language The Role of User Profiles in Language Models

Exploring how user profiles improve personalization in language models.

2025-07-25T14:11:48+00:00 ― 6 min read

Computation and Language Evaluating Model Performance in Understanding Plan Dependencies

Research shows models struggle with step dependencies in cooking recipes.

2025-07-25T11:41:42+00:00 ― 5 min read

Computation and Language A New Way to Evaluate Language Models

This paper presents a method to assess language models across various prompts.

2025-07-25T08:45:12+00:00 ― 6 min read

Computation and Language Evaluating Gender Bias in Language Models Across Regions

New method addresses regional differences in gender bias evaluation.

2025-07-25T07:13:06+00:00 ― 6 min read

Computation and Language New Dataset Enhances Language Models for Multi-Turn Conversations

M2Lingual dataset improves instruction-following capabilities across various languages.

2025-07-24T23:03:18+00:00 ― 5 min read

Computer Vision and Pattern Recognition A New Approach to Evaluating Text-to-Image Models

This article presents a new method for assessing text-to-image models effectively.

2025-07-24T20:25:18+00:00 ― 6 min read

Computation and Language Evaluating Italian Language Models with INVALSI Tests

This study benchmarks Language Models' performance using Italian INVALSI tests.

2025-07-24T09:37:30+00:00 ― 7 min read

Computation and Language Advancements in RAG Systems: A New Evaluation Framework

RAGBench introduces a comprehensive dataset for evaluating Retrieval-Augmented Generation systems.

2025-07-24T05:24:42+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Large Vision-Language Models with Dysca

Dysca introduces a new way to assess LVLM performance using synthetic data.

2025-07-24T03:49:54+00:00 ― 6 min read

Mathematical Software Advancements in Topology Optimization Techniques

A look at modern methods in engineering design for efficiency and performance.

2025-07-23T22:52:08+00:00 ― 7 min read

Computation and Language Advancements in Causal Event Extraction Methods

A new approach improves causal event extraction using human-centered evaluation.

2025-07-23T21:38:36+00:00 ― 5 min read

Machine Learning Evaluating the Impact of Deferring Systems in Machine Learning

Assessing how deferring to human experts affects prediction accuracy in ML models.

2025-07-23T14:11:48+00:00 ― 8 min read

Machine Learning Advancing Bayesian Optimization with Robust Entropy Search

Introducing a new method for better solutions in complex engineering and robotics tasks.

2025-07-23T07:31:16+00:00 ― 6 min read

Computation and Language Evaluating Datasets for Hate Speech Detection

A study assessing the quality of datasets for identifying hate speech online.

2025-07-23T04:07:54+00:00 ― 7 min read

Computation and Language Evaluating Belief Revision in Language Models

A new method measures how language models adapt their beliefs with new evidence.

2025-07-22T18:07:30+00:00 ― 9 min read

Computer Vision and Pattern Recognition Rethinking Evaluation Methods for Multimodal Models

New benchmark improves evaluation of multimodal models by minimizing biases.

2025-07-22T12:12:00+00:00 ― 6 min read

Artificial Intelligence Assessing LLMs with GraphArena Tool

GraphArena evaluates LLM performance on graph problems using real-world data.

2025-07-22T10:13:30+00:00 ― 6 min read

Discrete Mathematics Fair Credit in Group Projects: A New Approach

Explore a fair method for sharing credit in group projects.

2025-07-21T22:54:06+00:00 ― 6 min read

Computation and Language Evaluating Language Models for Scientific Research

A new benchmark for assessing large language models in hypothesis testing.

2025-07-21T19:52:24+00:00 ― 6 min read

Artificial Intelligence Introducing CRAB: A New Benchmark for Language Models

CRAB enhances testing for language models in real-world environments.

2025-07-21T18:41:18+00:00 ― 6 min read

Information Retrieval Evaluating Information Retrieval Systems in Changing Environments

This article examines the impact of temporal changes on information retrieval system evaluations.

2025-07-21T15:08:00+00:00 ― 5 min read

Computer Vision and Pattern Recognition Addressing Fairness in Medical Imaging Models

Introducing FairMedFM to evaluate the fairness of foundation models in healthcare.

2025-07-21T07:45:36+00:00 ― 6 min read