Latest Articles for Evaluation

Computation and Language Evaluating Medical AI: A New Benchmark for Med-MLLMs

This benchmark assesses the performance of medical language models in healthcare.

2025-09-07T01:47:12+00:00 ― 7 min read

Computation and Language Evaluating Reasoning in Large Language Models

A new framework assesses how LLMs reason to answer complex questions.

2025-09-07T00:51:54+00:00 ― 4 min read

Computation and Language Improving Language Models with Chain-of-Instructions

This article discusses a method to enhance language models using structured instructions.

2025-09-06T20:15:24+00:00 ― 5 min read

Atmospheric and Oceanic Physics Tackling Optical Turbulence with otbench

A new tool aids researchers in modeling optical turbulence effectively.

2025-09-06T15:31:24+00:00 ― 5 min read

Machine Learning Understanding Data Attribution with DualView

Explore how DualView improves data attribution in machine learning models.

2025-09-06T11:41:54+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Machine Unlearning in Diffusion Models

New dataset enhances evaluation methods for machine unlearning in image generation.

2025-09-06T06:02:12+00:00 ― 6 min read

Computation and Language The Importance of Text Simplification for All Readers

Text simplification helps improve access to information for diverse readers.

2025-09-06T01:02:00+00:00 ― 6 min read

Digital Libraries The Role of Literature Reviews in PAMI Research

Examining the significance and challenges of literature reviews in Pattern Analysis and Machine Intelligence.

2025-09-05T22:55:36+00:00 ― 8 min read

Computation and Language Expanding Taxonomies with Large Language Models

Automating taxonomy expansion using advanced language models for better knowledge organization.

2025-09-05T10:25:06+00:00 ― 6 min read

Computation and Language Meta Probing Agents: A New Way to Evaluate LLMs

Introducing a fresh approach to assess large language models effectively.

2025-09-05T09:14:00+00:00 ― 6 min read

Computation and Language Extracting Common Document Structures for Better Understanding

A new method identifies typical document layouts across various fields and languages.

2025-09-05T08:34:30+00:00 ― 9 min read

Scientific Communication and Education Improving Science Communication at NIH

Survey reveals insights on science communication practices among NIH staff.

2025-09-05T01:49:56+00:00 ― 7 min read

Computer Vision and Pattern Recognition Evaluating Vision-Language Models: The Role of Uncertainty

This study highlights the importance of uncertainty in assessing Vision-Language Models.

2025-09-05T01:43:42+00:00 ― 7 min read

Computation and Language KIEval: A New Way to Evaluate Language Models

KIEval offers interactive evaluation to address data contamination in language models.

2025-09-05T00:16:48+00:00 ― 6 min read

Artificial Intelligence Evaluating Hallucinations in Large Vision-Language Models

This article discusses a new framework for assessing hallucinations in LVLMs.

2025-09-04T12:02:06+00:00 ― 6 min read

Computation and Language Introducing SportQA: A New Benchmark for Sports Knowledge in Language Models

SportQA evaluates language models' understanding of sports through over 70,000 questions.

2025-09-04T11:54:12+00:00 ― 7 min read

Computation and Language Addressing Likelihood Bias in Language Models

Research highlights the bias in language model evaluations and proposes methods for improvement.

2025-09-04T11:38:24+00:00 ― 6 min read

Computation and Language Reevaluating Language Model Assessments

Research challenges traditional methods of evaluating language model values and opinions.

2025-09-03T21:41:00+00:00 ― 6 min read

Computer Vision and Pattern Recognition OpenMEDLab: A Platform for Medical AI Resources

OpenMEDLab enhances access to medical AI tools and resources for better healthcare.

2025-09-03T20:06:12+00:00 ― 6 min read

Computers and Society Introducing SyllabusQA: A New Dataset for Course Logistics

SyllabusQA offers insights for automated question answering in education.

2025-09-02T06:34:42+00:00 ― 8 min read

Computation and Language Improving Grammatical Error Correction Evaluation

New dataset enhances evaluation of grammatical error correction systems.

2025-09-01T04:22:36+00:00 ― 5 min read

Computation and Language Evaluating GPT-4's Sentence Simplification Skills

A study on the effectiveness of GPT-4 in simplifying sentences.

2025-08-31T16:31:36+00:00 ― 5 min read

Computation and Language Evaluating Language Processing Tools for Better Performance

A new method for assessing language processing tools shows promise for improvement.

2025-08-31T11:31:24+00:00 ― 5 min read

Computation and Language Improving Commit Message Generation with CommitBench

A new dataset aims to enhance automated commit message quality for developers.

2025-08-31T04:32:42+00:00 ― 9 min read

Computation and Language Improving Social Skills in Language Agents

A new method enhances communication skills of language agents.

2025-08-29T19:45:36+00:00 ― 6 min read

Computation and Language Assessing Bias in Language Models: A New Approach

Evaluating how biases in language models affect real-world applications.

2025-08-29T12:22:36+00:00 ― 5 min read

Computation and Language Advancing Multimodal Models with X-LLaVA

X-LLaVA enhances multilingual capabilities for visual question answering.

2025-08-28T16:46:06+00:00 ― 7 min read

Computation and Language ChartThinker: Improving Automatic Chart Summarization

Discover how ChartThinker enhances chart summaries for better understanding.

2025-08-28T14:16:00+00:00 ― 6 min read

Computation and Language NovelQA: A New Benchmark for Long Text Understanding

Evaluating LLMs on their ability to process long texts in literature.

2025-08-28T03:12:24+00:00 ― 5 min read

Computation and Language Evaluating Language Models with TinyBenchmarks

A new method to assess large language models using fewer examples.

2025-08-28T01:19:48+00:00 ― 6 min read

Databases Advancements in Datalog Program Evaluation

Improving efficiency in Datalog through semirings and grounding techniques.

2025-08-27T18:07:18+00:00 ― 5 min read

Information Retrieval Improving Information Retrieval Through Instruction Following

A new dataset helps IR models adapt to complex instructions for better performance.

2025-08-26T18:49:00+00:00 ― 3 min read

Computation and Language Evaluating Argument Quality with Language Models

Discover how language models can enhance our understanding of argument quality.

2025-08-26T06:02:42+00:00 ― 8 min read

Information Retrieval Challenges in Evaluating Legal Information Retrieval Systems

Exploring the complexities of assessing legal information retrieval systems and their effectiveness.

2025-08-25T05:09:36+00:00 ― 6 min read

Computation and Language Introducing TriviaHG: A New Dataset for Hint Generation

TriviaHG offers hints for questions, promoting deeper thinking and learning.

2025-08-24T23:14:06+00:00 ― 6 min read

Computation and Language Evaluating Language Models in Molecular Research

A new dataset improves assessment of molecular knowledge in language models.

2025-08-24T19:45:30+00:00 ― 7 min read

Neuroscience Understanding Decision-Making Processes in the Brain

This study explores how our brains evaluate choices and make decisions.

2025-08-23T09:22:44+00:00 ― 6 min read

Information Retrieval A Clear Framework for Evaluating Recommendation Systems

This guide helps streamline the evaluation of recommendation systems for better user experience.

2025-08-23T02:20:12+00:00 ― 7 min read

Computation and Language Improving Movie Script Summarization with Salient Scenes

This work focuses on identifying important scenes to enhance movie script summaries.

2025-08-22T12:54:24+00:00 ― 5 min read

Machine Learning The Cram Method: A New Approach to Data Learning

A method for simultaneous learning and evaluation of policies using all available data.

2025-08-22T09:18:56+00:00 ― 6 min read