Latest Articles for Evaluation

Artificial Intelligence Harnessing Language Models for Scientific Hypothesis Generation

This article explores how LLMs generate and refine scientific hypotheses from existing data.

2025-08-22T06:43:06+00:00 ― 7 min read

Artificial Intelligence Improving Knowledge Graph Completion with KGExplainer

KGExplainer enhances transparency in knowledge graph completion through meaningful explanations.

2025-08-22T01:42:54+00:00 ― 5 min read

Computer Vision and Pattern Recognition New Method for Creating Realistic Human Images

A novel approach to generate detailed images of people in complex scenes.

2025-08-21T20:26:54+00:00 ― 6 min read

Computation and Language Evaluating the Safety of Large Language Models

A review of datasets focused on enhancing LLM safety.

2025-08-21T08:04:18+00:00 ― 6 min read

Artificial Intelligence A New Approach to Multi-Agent Learning

Revolutionizing agent performance through evaluation and experience accumulation.

2025-08-21T05:42:06+00:00 ― 6 min read

Artificial Intelligence Evaluating and Improving Digital Agents

A focus on methods to assess and refine digital agents' performance.

2025-08-21T02:24:36+00:00 ― 3 min read

Software Engineering Improving Bug Fixing with Large Language Models

A new method uses LLMs to enhance program repair efficiency.

2025-08-20T06:55:24+00:00 ― 5 min read

Computation and Language Evaluating Self-Reflection in Language Models

Research reveals how self-reflection impacts language model performance across different question types.

2025-08-20T01:15:42+00:00 ― 5 min read

Logic The Role of Schematic Substitution and Unification in Logic

Exploring key concepts in logic and computer science for effective reasoning.

2025-08-19T18:55:30+00:00 ― 7 min read

Software Engineering Evaluating Software Requirements with Language Models

A look at using language models to evaluate software requirements satisfaction.

2025-08-18T13:50:36+00:00 ― 6 min read

Computer Vision and Pattern Recognition Evaluating Visual Perception in Language Models

A new benchmark reveals gaps in visual understanding of large language models.

2025-08-18T12:23:42+00:00 ― 7 min read

Theoretical Economics Matching Markets: The Impact of Noise on College Admissions

Analyzing how noise affects student and college matching in admissions processes.

2025-08-18T06:13:18+00:00 ― 6 min read

Computation and Language Improving Accuracy in Scientific Summaries with Feedback

Using feedback mechanisms to enhance LLM-generated scientific summaries.

2025-08-18T05:40:48+00:00 ― 7 min read

Computer Vision and Pattern Recognition Advancements in Text-Centric Visual Question Answering

New dataset Square-10M significantly boosts open-source visual question answering capabilities.

2025-08-18T02:31:12+00:00 ― 6 min read

Software Engineering Automating Test Scenario Generation in Software Development

This article presents a method for generating test scenarios from natural language requirements.

2025-08-18T02:15:24+00:00 ― 7 min read

Computation and Language A New Method for Web Automation

This approach improves data extraction from web pages using structured rules.

2025-08-18T01:59:36+00:00 ― 5 min read

Computation and Language Evaluating the Accuracy of Large Vision-Language Models

A new benchmark improves how we assess LVLMs and their accuracy.

2025-08-17T06:46:12+00:00 ― 5 min read

Logic in Computer Science CHC-COMP 2023: Evaluating Constrained Horn Clause Solvers

The CHC competition showcased advances in solvers and their applications in program verification.

2025-08-17T00:50:42+00:00 ― 6 min read

Computation and Language Automated Feedback: A New Approach to Essay Writing

This study investigates automated systems for providing essay feedback using language models.

2025-08-16T18:31:30+00:00 ― 6 min read

Machine Learning The Growing Role of Synthetic Data in Research

Synthetic data provides cost-effective solutions while ensuring privacy and reducing bias.

2025-08-16T18:07:48+00:00 ― 5 min read

Computation and Language Assessing Language Models with VISLA Benchmark

A new benchmark evaluates language models' understanding of word meanings and relationships.

2025-08-16T08:07:24+00:00 ― 5 min read

Computer Vision and Pattern Recognition Evaluating Information Extraction in Handwritten Texts

New metrics improve evaluation of information extraction systems in handwritten documents.

2025-08-15T11:58:42+00:00 ― 6 min read

Multiagent Systems Evaluating AI Performance in Multiagent Systems

A framework for assessing AI strategies in competitive and cooperative environments.

2025-08-15T07:22:12+00:00 ― 7 min read

Software Engineering Measuring Trust in AI-Generated Code Summaries

Assessing the reliability of AI-produced summaries for improved software maintenance.

2025-08-15T02:53:36+00:00 ― 7 min read

Health Informatics The Role of ChatGPT in Healthcare

Examining how ChatGPT impacts healthcare and its potential uses.

2025-08-14T23:39:30+00:00 ― 5 min read

Computation and Language DynaMo: Advancing Language Models with Multi-Token Prediction

DynaMo models generate text faster and with better quality using multi-token prediction.

2025-08-14T23:04:30+00:00 ― 5 min read

Computation and Language Enhancing Related Work Sections in Research Papers

A new dataset improves the generation of related work sections in scientific papers.

2025-08-14T00:33:36+00:00 ― 8 min read

Information Retrieval Advancing Conversational Search with TREC iKAT

TREC iKAT aims to improve interactions with conversational agents through personalized dialogues.

2025-08-13T20:28:42+00:00 ― 7 min read

Computation and Language Automating Responses to Customer Reviews

SCRABLE offers automated solutions for effective app review management.

2025-08-13T13:53:42+00:00 ― 4 min read

Computer Vision and Pattern Recognition Evaluating the Future of Video-Large Multi-modal Models

Assessing the capabilities and challenges of advanced video understanding models.

2025-08-13T12:42:36+00:00 ― 5 min read

Computation and Language Evaluating AI Explanations: A New Approach

This study analyzes the effectiveness of LLMs in evaluating AI-generated explanations.

2025-08-12T12:36:54+00:00 ― 7 min read

Computation and Language Assessing Language Models: The DoLoMiTes Benchmark

A new framework evaluates how well language models help experts with writing tasks.

2025-08-12T08:39:54+00:00 ― 5 min read

Computer Vision and Pattern Recognition Introducing PEAVS: A New Way to Measure Audio-Visual Sync

PEAVS analyzes how well audio and video work together for better viewer experiences.

2025-08-12T03:19:55+00:00 ― 7 min read

Artificial Intelligence Assessing DNN Training Impact: A New Method

A quick way to evaluate DNN performance after new training.

2025-08-12T00:22:12+00:00 ― 6 min read

Machine Learning Improving AI Model Transparency with Sparse Autoencoders

Sparse autoencoders enhance the interpretability of AI systems and their decision-making processes.

2025-08-11T02:07:06+00:00 ― 18 min read

Computation and Language Evaluating AI's Understanding of World Knowledge

A look at how AI models grasp essential knowledge of the world.

2025-08-10T22:41:42+00:00 ― 6 min read

Computation and Language Evaluating Toxicity in Multilingual Language Models

New benchmark assesses toxicity in large language models across various languages.

2025-08-10T21:30:36+00:00 ― 7 min read

Software Engineering Improving Fuzzing Evaluations for Better Software Quality

This article discusses the need for better evaluation practices in fuzzing research.

2025-08-10T15:11:24+00:00 ― 5 min read

Human-Computer Interaction Evaluating Saliency Methods in NLP: A Human Perspective

This study assesses saliency methods in NLP through human evaluation.

2025-08-10T07:56:54+00:00 ― 8 min read

Machine Learning Improving Heatmap Analysis for AI Decisions

Introducing PQAH for better understanding of AI heatmaps and their evaluation.

2025-08-09T15:29:24+00:00 ― 7 min read