A study evaluates language models on handling multiple tasks simultaneously.
― 7 min read
Cutting edge science explained simply
A study evaluates language models on handling multiple tasks simultaneously.
― 7 min read
A new benchmark tests LLMs' abilities with structured data formats.
― 6 min read
VCEval offers an automated way to assess online course effectiveness.
― 5 min read
A new benchmark targets compositionality in video understanding and language models.
― 6 min read
A new method enhances testing for language models using real user data.
― 5 min read
The Nemotron-4 340B family delivers powerful models for diverse applications and synthetic data generation.
― 7 min read
Evaluating how language models handle cultural cues in real tasks.
― 7 min read
VideoVista offers a comprehensive evaluation for video question-answering models.
― 5 min read
This article explores methods to enhance the reliability of research artifacts in computing.
― 7 min read
GLM-4 models show improved capabilities in language understanding and generation.
― 8 min read
A study on using LLMs to judge other LLMs and its implications.
― 7 min read
A study on how language models generate persuasive rationales for argument evaluation.
― 5 min read
Two new models aim to improve technology access for Galician speakers.
― 5 min read
Examining the difficulties of translating metaphorical language in machine translation.
― 6 min read
DF40 offers a comprehensive approach to improving deepfake detection methods.
― 6 min read
This study assesses the honesty of LLMs in three key areas.
― 5 min read
Discover how companies enhance their question-answering systems for better user support.
― 4 min read
A study on how AI comprehends algorithms and their implications.
― 6 min read
A new metric improves evaluation of text classification models across different domains.
― 7 min read
Data contamination affects the evaluation of large language models significantly.
― 5 min read
A new method for assessing LLMs aligns with human values.
― 6 min read
A new tool to evaluate biases in large vision-language models.
― 6 min read
A study evaluates how machines create varied and creative poetry compared to humans.
― 6 min read
A new method improves how we assess counter narratives to hate speech.
― 6 min read
InternLM-Law enhances responses to diverse Chinese legal questions with advanced training.
― 7 min read
Exploring how user profiles improve personalization in language models.
― 6 min read
Research shows models struggle with step dependencies in cooking recipes.
― 5 min read
This paper presents a method to assess language models across various prompts.
― 6 min read
New method addresses regional differences in gender bias evaluation.
― 6 min read
M2Lingual dataset improves instruction-following capabilities across various languages.
― 5 min read
This article presents a new method for assessing text-to-image models effectively.
― 6 min read
This study benchmarks Language Models' performance using Italian INVALSI tests.
― 7 min read
RAGBench introduces a comprehensive dataset for evaluating Retrieval-Augmented Generation systems.
― 6 min read
Dysca introduces a new way to assess LVLM performance using synthetic data.
― 6 min read
A look at modern methods in engineering design for efficiency and performance.
― 7 min read
A new approach improves causal event extraction using human-centered evaluation.
― 5 min read
Assessing how deferring to human experts affects prediction accuracy in ML models.
― 8 min read
Introducing a new method for better solutions in complex engineering and robotics tasks.
― 6 min read
A study assessing the quality of datasets for identifying hate speech online.
― 7 min read
A new method measures how language models adapt their beliefs with new evidence.
― 9 min read