Dieuwke Hupkes

This article examines methods to assess variance in language model evaluation benchmarks.

2025-07-28T23:26:06+00:00 ― 7 min read

A study on using LLMs to judge other LLMs and its implications.

2025-07-27T04:30:42+00:00 ― 7 min read

Data contamination impacts the performance of language models and evaluation methods.

2025-05-29T09:48:09+00:00 ― 6 min read

Are NLI tasks still relevant for testing large language models?

2025-05-14T07:05:20+00:00 ― 6 min read