Latest Articles for Evaluation Methods

Physics and Society A Fair Approach to Ranking Choices

Learn how to rank opinions and choices fairly using statistical principles.

2025-08-13T13:14:24+00:00 ― 6 min read

Health Informatics Evaluating Natural Language Generation in Medicine

A study on assessing NLG systems for accurate medical diagnoses.

2025-08-12T17:03:30+00:00 ― 6 min read

Computation and Language Evaluating AI's Understanding of World Knowledge

A look at how AI models grasp essential knowledge of the world.

2025-08-10T22:41:42+00:00 ― 6 min read

Computation and Language Assessing NLG Evaluation with AdvEval Framework

AdvEval exposes weaknesses in Natural Language Generation evaluation metrics.

2025-08-08T07:29:42+00:00 ― 6 min read

Computation and Language Improving Human Evaluation of Language Models

A new framework for evaluating large language models with human insight.

2025-08-06T00:03:48+00:00 ― 8 min read

Artificial Intelligence A New Approach to Decision-Making with Seven-Valued Logic

Learn how seven-valued logic enhances decision-making with multiple criteria.

2025-08-04T05:00:30+00:00 ― 6 min read

Artificial Intelligence Assessing Favoritism in Generative AI Metrics

A new approach to evaluating biases in automated AI evaluation metrics.

2025-08-02T22:04:00+00:00 ― 6 min read

Computation and Language Advancements in Controllable Text Generation with LLMs

Evaluating methods for precise control of text features in LLM outputs.

2025-08-01T15:23:18+00:00 ― 13 min read

Computation and Language Evaluating Language Models Through Collaboration

A new framework assesses language models on emotional intelligence and creativity.

2025-07-30T00:50:48+00:00 ― 7 min read

Machine Learning Evaluating Labeling Sources with WeShap Values

WeShap values improve data labeling quality for machine learning models.

2025-07-28T06:50:42+00:00 ― 7 min read

Artificial Intelligence STAR Framework: Enhancing Red Teaming for AI Safety

A new approach to improve safety assessments of AI systems using diverse perspectives.

2025-07-27T19:47:06+00:00 ― 5 min read

Computation and Language A New Way to Evaluate Large Language Models

Hierarchical Prompting Taxonomy improves evaluation methods for language models.

2025-07-27T05:10:12+00:00 ― 6 min read

Computation and Language Evaluating Language Models: A New Approach

A study on using LLMs to judge other LLMs and its implications.

2025-07-27T04:30:42+00:00 ― 7 min read

Computation and Language New Benchmark for Evaluating LLMs in Intellectual Property

IPEval assesses language models' understanding of intellectual property concepts.

2025-07-27T01:29:00+00:00 ― 5 min read

Computation and Language Evaluating Multilingual Language Models in Indic Languages

A comprehensive study on language models’ performance across 10 Indic languages.

2025-07-25T17:37:12+00:00 ― 7 min read

Computer Vision and Pattern Recognition ChronoMagic-Bench: Advancing Time-Lapse Video Evaluation

New benchmarks improve how we evaluate generated time-lapse videos.

2025-07-24T00:40:18+00:00 ― 6 min read

Computation and Language Evaluating Text Summarization Methods with LLMs

This article examines methods for assessing text summaries using large language models.

2025-07-22T04:41:42+00:00 ― 7 min read

Computer Vision and Pattern Recognition Evaluating Video Generation: The DEVIL Protocol

A new method for assessing text-to-video models focuses on dynamics.

2025-07-21T10:07:48+00:00 ― 6 min read

Computation and Language M5 Benchmark: Assessing Multimodal Models Across Cultures

A new benchmark tackles language model performance worldwide.

2025-07-19T04:40:24+00:00 ― 7 min read

Computation and Language Assessing Machine-Generated Visual Stories

A new method to evaluate storytelling quality in machines is introduced.

2025-07-18T16:09:54+00:00 ― 7 min read

Artificial Intelligence Advancing Interactive Agents with Grounded Language

A study on enhancing AI's ability to follow natural language instructions.

2025-07-15T11:00:30+00:00 ― 8 min read

Artificial Intelligence Evaluating XAI Experiences with the XEQ Scale

A new scale helps measure user experiences in explainable AI systems.

2025-07-13T03:50:24+00:00 ― 5 min read

Artificial Intelligence Evaluating Language Models in Scientific Coding

A new benchmark assesses language models on scientific coding challenges across multiple fields.

2025-07-10T17:22:48+00:00 ― 5 min read

Machine Learning New Framework for Evaluating AI Model Generalization

Introducing a method to assess AI models on unseen data more effectively.

2025-07-09T06:05:36+00:00 ― 6 min read

Computation and Language Evaluating Language Models: A New Toolkit

A toolkit designed for better evaluation of human-bot interactions.

2025-07-06T18:11:06+00:00 ― 5 min read

Sound Assessing Music Understanding with MuChoMusic Benchmark

A new benchmark to evaluate models analyzing music and language.

2025-07-06T05:29:45+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Image Models for Chart Comprehension

New framework assesses how image models interpret graphical information through channel accuracy.

2025-07-04T23:23:36+00:00 ― 5 min read

Machine Learning Evaluating Sparse Autoencoders with Board Games

A new framework to assess sparse autoencoders through chess and Othello.

2025-07-04T12:43:42+00:00 ― 5 min read

Information Retrieval Workshop on Large Language Models in Information Retrieval

Researchers discuss the impact of LLMs on evaluating information retrieval systems.

2025-06-30T04:26:54+00:00 ― 5 min read

Machine Learning Evaluating Large Language Models for Real-World Use

A new approach to assess LLMs with diverse evaluation sets.

2025-06-26T22:53:48+00:00 ― 6 min read

Computation and Language Evaluating Large Language Models Fairly

A new approach to assess language models with varied instructions and tasks.

2025-06-23T14:58:30+00:00 ― 6 min read

Computers and Society Evaluating Trustworthy AI: Methods and Challenges

A look at evaluating trustworthy AI systems and the methods involved.

2025-06-21T05:26:12+00:00 ― 5 min read

Software Engineering Evaluating Bug Report Summaries with LLMs

This study examines how LLMs assess bug report summaries compared to human evaluators.

2025-06-18T20:41:18+00:00 ― 6 min read

Computation and Language Evaluating Long-Form Text Generation in LLMs

LongGenBench assesses large language models in generating high-quality long text.

2025-06-17T21:54:36+00:00 ― 5 min read

Computer Vision and Pattern Recognition Evaluating Computer Vision Models with Item Response Theory

Using IRT for deeper evaluation of computer vision model performance.

2025-06-15T21:19:30+00:00 ― 5 min read

Artificial Intelligence New VisScience Benchmark Evaluates Multi-Modal Learning

VisScience tests large models on scientific reasoning using text and images.

2025-06-15T07:37:54+00:00 ― 5 min read

Computation and Language Evaluating Grounded Question Answering with GroUSE

This article discusses the challenges and solutions in evaluating grounded question answering models.

2025-06-14T07:48:00+00:00 ― 9 min read

Computation and Language Evaluating Retrieval-Augmented Generation Systems: A New Dataset

Introducing a dataset to assess the performance of RAG systems in real-world scenarios.

2025-06-09T11:56:00+00:00 ― 5 min read

Computation and Language Introducing Michelangelo: A New Evaluation for Language Models

Michelangelo evaluates language models on their ability to reason through long contexts.

2025-06-09T07:51:06+00:00 ― 4 min read

Computation and Language Kalahi: Evaluating Language Models in Filipino Culture

A tool to assess language models' relevance and appropriateness in Filipino contexts.

2025-06-09T04:49:24+00:00 ― 5 min read