This article examines methods to assess variance in language model evaluation benchmarks.
― 7 min read
Cutting edge science explained simply
This article examines methods to assess variance in language model evaluation benchmarks.
― 7 min read
A study on using LLMs to judge other LLMs and its implications.
― 7 min read
Data contamination impacts the performance of language models and evaluation methods.
― 6 min read
Are NLI tasks still relevant for testing large language models?
― 6 min read