A new benchmark assesses models for verifying financial claims in complex documents.
― 7 min read
Cutting edge science explained simply
A new benchmark assesses models for verifying financial claims in complex documents.
― 7 min read
ChemSafetyBench tests chatbots on chemical safety and knowledge.
― 6 min read