Latest Articles for Benchmark

Computation and Language Evaluating Multimodal Large Language Models

New benchmarks reveal challenges for MLLMs in real-world tasks with long contexts.

2025-08-15T10:16:00+00:00 ― 7 min read

Software Engineering Examining Multilingual Bias in Code Generation Models

This article explores the bias in code generation models across different languages.

2025-08-15T03:25:12+00:00 ― 8 min read

Computation and Language Understanding Code Hallucinations in Language Models

An overview of code hallucinations in LLMs and their impact on software development.

2025-08-15T01:58:18+00:00 ― 6 min read

Computer Vision and Pattern Recognition Introducing Wake Vision: A New Dataset for TinyML

Wake Vision enhances person detection for TinyML with a vast dataset.

2025-08-14T17:24:48+00:00 ― 7 min read

Computation and Language Challenges and Opportunities in AI Text Generation Explainability

This paper discusses the need for explainability in AI text generation models.

2025-08-11T02:54:30+00:00 ― 6 min read

Computation and Language Evaluating Toxicity in Multilingual Language Models

New benchmark assesses toxicity in large language models across various languages.

2025-08-10T21:30:36+00:00 ― 7 min read

Computational Finance Using SSD to Build Stronger Portfolios

Learn how second order stochastic dominance can enhance your investment strategy.

2025-08-09T19:12:57+00:00 ― 6 min read

Artificial Intelligence Evaluating LLMs in Mathematical Modeling with Mamo

A new benchmark assesses LLMs' abilities in mathematical modeling processes.

2025-08-09T14:10:24+00:00 ― 5 min read

Neural and Evolutionary Computing Improving Differential Evolution with GPUs

Exploring how GPUs enhance the efficiency of Differential Evolution algorithms.

2025-08-06T19:01:24+00:00 ― 5 min read

Computer Vision and Pattern Recognition Advancements in Multi-modal Chain-of-Thought Reasoning

New benchmark aims to improve AI understanding of text and images.

2025-08-06T17:50:18+00:00 ― 7 min read

Machine Learning WeiPer: A New Method for OOD Detection

WeiPer improves out-of-distribution detection in machine learning models using weight adjustments.

2025-08-06T07:49:54+00:00 ― 7 min read

Artificial Intelligence Evaluating Large Language Models in Multi-Turn Math Interactions

This study measures LLMs’ performance in complex math dialogues.

2025-08-05T07:12:36+00:00 ― 7 min read

Machine Learning Improving Link Predictions with Clear Explanations

LinkLogic provides clarity and reliability for link prediction in knowledge graphs.

2025-08-03T12:56:42+00:00 ― 6 min read

Computation and Language Advancing Autoformalization with Lean 4

New methods and benchmarks aim to simplify formalizing mathematics through Lean 4.

2025-08-03T08:59:42+00:00 ― 6 min read

Machine Learning LLMs Struggle with Basic Reasoning Tasks

Recent tests reveal LLMs' weaknesses in simple reasoning despite high benchmark scores.

2025-08-02T09:01:54+00:00 ― 5 min read

Machine Learning Dynamic Benchmarks for Evaluating Language Models

A new system for assessing language models using real-world data streams.

2025-08-02T01:23:42+00:00 ― 5 min read

Machine Learning Addressing Label Noise in Graph Neural Networks

A new benchmark helps improve GNN performance amid label noise challenges.

2025-08-01T13:01:06+00:00 ― 7 min read

Robotics Bench2Drive: A New Standard for Testing Autonomous Driving Systems

Bench2Drive offers a fair evaluation method for autonomous driving technologies.

2025-08-01T06:02:24+00:00 ― 6 min read

Artificial Intelligence Addressing Ill-Defined Problems in Language Models

New methods improve language models' performance on complex reasoning tasks.

2025-07-31T22:55:48+00:00 ― 7 min read

Computer Vision and Pattern Recognition Evaluating Prompt Performance in Image Generation and Retrieval

A study introduces a new benchmark for prompt performance in creating and retrieving images.

2025-07-31T18:43:00+00:00 ― 10 min read

Machine Learning New Insights into Language Model Scaling Performance

Analyzing existing models reveals insights into language model performance trends as size increases.

2025-07-31T14:57:12+00:00 ― 8 min read

Machine Learning Evaluating Java Programming Skills of LLMs

A new benchmark to assess LLMs for Java programming tasks.

2025-07-31T06:52:00+00:00 ― 6 min read

Computer Vision and Pattern Recognition Improving Video Captioning with Causal Understanding

A new method creates better video captions by focusing on narratives and causality.

2025-07-31T02:39:12+00:00 ― 5 min read

Cryptography and Security Evaluating the Role of Large Language Models in Vulnerability Detection

A new benchmark tests LLMs' ability to find software vulnerabilities.

2025-07-30T14:48:12+00:00 ― 5 min read

Computation and Language New Benchmark Evaluates Multilingual Language Models

A new benchmark assesses multilingual model performance in semantic retrieval tasks.

2025-07-30T12:18:06+00:00 ― 7 min read

Computer Vision and Pattern Recognition CMC-Bench: A New Standard in Image Compression

Discover how CMC-Bench is transforming image compression techniques.

2025-07-30T02:46:45+00:00 ― 6 min read

Software Engineering DafnyBench: Improving Software Verification with Machine Learning

DafnyBench benchmarks software verification tools, paving the way for reliable programming.

2025-07-29T23:23:54+00:00 ― 5 min read

Computer Vision and Pattern Recognition Evaluating Video Comprehension in Multimodal Language Models

A new benchmark aims to assess MLLMs in video understanding across multiple topics.

2025-07-29T22:20:42+00:00 ― 6 min read

Computer Vision and Pattern Recognition Challenging the Limits of Vision-Language Models

A new benchmark tests compositional reasoning in advanced models.

2025-07-29T19:42:42+00:00 ― 7 min read

Machine Learning Introducing GuardAgents: A New Safety Layer for LLMs

A framework to enhance safety in LLM agents across various applications.

2025-07-29T07:43:48+00:00 ― 7 min read

Computation and Language Evaluating Temporal Reasoning in Large Language Models

A new benchmark assesses how well models understand time and events.

2025-07-29T07:20:06+00:00 ― 6 min read

Machine Learning Measuring Variance in Language Model Benchmarks

This article examines methods to assess variance in language model evaluation benchmarks.

2025-07-28T23:26:06+00:00 ― 7 min read

Computation and Language Advancing AI for Southeast Asia's Languages

SEACrowd aims to improve AI representation for Southeast Asian languages and cultures.

2025-07-28T21:03:54+00:00 ― 7 min read

Computer Vision and Pattern Recognition Advancements in Image Manipulation Detection

A new benchmark helps researchers improve image integrity detection methods.

2025-07-28T11:35:06+00:00 ― 6 min read

Artificial Intelligence Evaluating LLMs with a New Benchmark for Search Problems

A study on improving LLMs' problem-solving abilities using a new framework.

2025-07-28T01:18:54+00:00 ― 7 min read

Machine Learning Advancing Language Model Evaluation Standards

A new method enhances testing for language models using real user data.

2025-07-27T21:06:06+00:00 ― 5 min read

Computation and Language Evaluating Unlearning in Language Models

New methods reveal challenges in unlearning knowledge from language models.

2025-07-27T17:24:54+00:00 ― 6 min read

Computation and Language The Impact of Long-Context Language Models

Long-context language models streamline complex tasks and improve interaction with AI.

2025-07-27T08:59:18+00:00 ― 7 min read

Computation and Language Assessing Reasoning in Language Models

A new benchmark evaluates reasoning skills in language models.

2025-07-26T22:11:30+00:00 ― 7 min read

Databases The Evolution of GPU Databases

Examining the advancements in GPU database technology and their performance.

2025-07-26T19:49:18+00:00 ― 8 min read