Latest Articles for Benchmark

Machine Learning Predicting Language Model Performance on Benchmarks

Researchers analyze the predictability of language model performance as training compute scales.

2025-09-18T05:27:54+00:00 ― 6 min read

Computer Vision and Pattern Recognition Examining Backdoor Learning in Deep Neural Networks

A look at backdoor attacks and defenses in deep learning models.

2025-09-14T06:00:24+00:00 ― 6 min read

Software Engineering Evaluating Code Generation Models for Efficiency

This paper assesses the efficiency of generated code from various models.

2025-09-11T17:42:12+00:00 ― 6 min read

Computation and Language Evaluating Language Models with New Benchmark

This article presents a benchmark to assess large language models with complex tasks.

2025-09-11T04:55:54+00:00 ― 6 min read

Artificial Intelligence Evaluating LLMs in Asynchronous Planning Tasks

This study assesses large language models' capabilities in complex planning scenarios.

2025-09-10T23:16:12+00:00 ― 6 min read

Artificial Intelligence Evaluating Robot Behavior Using Video-Language Models

Research examines the use of VLMs to assess robot actions.

2025-09-10T19:19:12+00:00 ― 7 min read

Machine Learning Advancements in Molecular Modeling and Design

Exploring the role of large language models in molecular science.

2025-09-10T18:55:30+00:00 ― 7 min read

Robotics Testing Robots for Unexpected Challenges

Exploring methods to improve robot performance in unpredictable environments.

2025-09-09T02:53:54+00:00 ― 4 min read

Audio and Speech Processing Introducing AV-SUPERB: A New Benchmark for Audio-Visual Models

AV-SUPERB evaluates audio and visual models across various tasks for better performance.

2025-09-08T22:32:35+00:00 ― 5 min read

Information Retrieval Advances in Long Document Retrieval Models

New tools improve how systems retrieve information from long documents.

2025-09-08T20:26:48+00:00 ― 4 min read

Computation and Language Evaluating Medical AI: A New Benchmark for Med-MLLMs

This benchmark assesses the performance of medical language models in healthcare.

2025-09-07T01:47:12+00:00 ― 7 min read

Computation and Language Event-Level Knowledge Editing: A New Approach

A method to keep AI models updated based on real-world events.

2025-09-06T00:54:06+00:00 ― 6 min read

Computation and Language Assessing Multimodal Language Models on Social Media Tasks

New benchmark tests MLLMs on social media tasks like misinformation and hate speech.

2025-09-05T16:28:30+00:00 ― 10 min read

Robotics Advancing Robot Code Generation with RobotScript

RobotScript enhances how robots execute tasks from natural language.

2025-09-05T03:58:00+00:00 ― 7 min read

Cryptography and Security Detecting Hardware Trojans: New Approaches

A fresh perspective on finding hidden threats in hardware design.

2025-09-03T18:55:06+00:00 ― 5 min read

Artificial Intelligence Improving Reasoning Evaluation in Language Models

New methods aim to better evaluate reasoning skills in AI language models.

2025-09-02T23:25:54+00:00 ― 6 min read

Software Engineering Introducing DyPyBench: A New Python Benchmark Tool

DyPyBench offers a diverse set of projects for dynamic analysis in Python.

2025-09-02T10:15:54+00:00 ― 6 min read

Computation and Language AI Transforming Web Development Through Visual Design

AI's capability to turn designs into code is reshaping web development.

2025-09-01T08:03:48+00:00 ― 8 min read

Software Engineering Evaluating Language Models: The Data Contamination Challenge

Study reveals significant data overlap affecting language model evaluations in code generation.

2025-09-01T02:16:12+00:00 ― 6 min read

Bioinformatics Evaluating Large Language Models for Bio-Image Analysis

Assessing LLM performance through a dedicated benchmark for bio-image analysis.

2025-08-31T18:04:57+00:00 ― 6 min read

Computation and Language Evaluating Language Processing Tools for Better Performance

A new method for assessing language processing tools shows promise for improvement.

2025-08-31T11:31:24+00:00 ― 5 min read

Computer Vision and Pattern Recognition Efficient Evaluation of Pre-trained Object Detectors

A method for assessing the transferability of pre-trained models for object detection.

2025-08-29T09:37:18+00:00 ― 4 min read

Robotics New Benchmark for Robot Learning in Daily Tasks

A resource designed to help robots learn everyday tasks effectively.

2025-08-29T07:46:42+00:00 ― 6 min read

Computation and Language Evaluating Large Language Models in Decision-Making

A look at assessing the decision-making capabilities of large language models.

2025-08-29T02:44:12+00:00 ― 7 min read

Computation and Language Enhancing NLP for Diverse Dialects

A framework to improve NLP performance across various language dialects.

2025-08-28T20:51:00+00:00 ― 4 min read

Machine Learning New Benchmark Reveals Limitations of Vision Language Models

A fresh benchmark uncovers strengths and weaknesses of VLLMs in multimodal tasks.

2025-08-28T00:50:12+00:00 ― 6 min read

Computational Physics Monte Carlo Computational Summit: Advancing Simulation Techniques

Experts gather to discuss Monte Carlo simulations and GPU enhancements.

2025-08-27T20:09:15+00:00 ― 6 min read

Software Engineering Challenging Code Generation Models with New Benchmarks

New benchmarks reveal strengths and weaknesses of coding language models.

2025-08-25T06:36:30+00:00 ― 3 min read

Computation and Language Introducing Meerkat-7B: A New Era in Medical AI

Meerkat-7B sets a new standard for open-source medical language models.

2025-08-24T01:22:42+00:00 ― 6 min read

Computer Vision and Pattern Recognition Advancements in Video Summarization Techniques

New methods improve video summarization using large datasets and advanced models.

2025-08-22T11:11:42+00:00 ― 7 min read

Computation and Language Improving Long Text Comprehension in Language Models

Research reveals challenges LLMs face in understanding long texts and proposes new benchmarks.

2025-08-21T09:07:30+00:00 ― 6 min read

Hardware Architecture Performance Monitoring Unit for RISC-V in Space Applications

Exploring the design and benefits of a PMU for RISC-V processors used in space.

2025-08-21T07:56:24+00:00 ― 5 min read

Software Engineering Analyzing Code Generation Benchmarks for Quality Issues

This study examines quality problems in prompts for code generation models.

2025-08-19T17:45:24+00:00 ― 4 min read

Computer Vision and Pattern Recognition Evaluating Visual Perception in Language Models

A new benchmark reveals gaps in visual understanding of large language models.

2025-08-18T12:23:42+00:00 ― 7 min read

Computation and Language Evaluating the Accuracy of Large Vision-Language Models

A new benchmark improves how we assess LVLMs and their accuracy.

2025-08-17T06:46:12+00:00 ― 5 min read

Logic in Computer Science CHC-COMP 2023: Evaluating Constrained Horn Clause Solvers

The CHC competition showcased advances in solvers and their applications in program verification.

2025-08-17T00:50:42+00:00 ― 6 min read

Computation and Language Challenges in Interpreting Indirect Responses

This article explores how to improve the understanding of indirect answers.

2025-08-16T21:56:54+00:00 ― 5 min read

Computation and Language Advancing Few-Shot Learning for Polish Language Tasks

A study evaluating few-shot learning methods for Polish language classification.

2025-08-15T22:38:36+00:00 ― 4 min read

Computation and Language Introducing PatentGPT: Specialized LLMs for Intellectual Property

PatentGPT models are designed to address unique challenges in Intellectual Property.

2025-08-15T17:38:24+00:00 ― 4 min read

Software Engineering Evaluating Smart Contract Security Tools

A study on the effectiveness of SAST tools for smart contracts.

2025-08-15T17:30:30+00:00 ― 8 min read