Linear Digressions

Benchmarking AI Models

Autor: Vários
Narrador: Vários
Editor: Podcast
Duración: 0:29:55
Mas informaciones

Añadir a la estante

Escucha

Sinopsis

How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.

Linear Digressions

Benchmarking AI Models

Sinopsis

Únete Ahora

¿Necesita ayuda?

Instale la aplicación:

Linear Digressions

Benchmarking AI Models

Informações:

Sinopsis

Únete Ahora

¿Necesita ayuda?

Instale la aplicación: