Hackernews posts about MMLU
- Show HN: Forecaster Arena – Testing LLMs on real events with prediction markets (forecasterarena.com)
- Laws of (New) Media – By Andrew McLuhan (www.a16z.news)
- Super Mario 64 for the PS1 (github.com)
- Illuminating the Insides of Mlx Models (github.com)
- Nature Is Laughing at the AI Build Out (markmaunder.com)
- What happens when the coding becomes the least interesting part of the work (simonwillison.net)
- HTML Kong (2016) (www.xn--8ws00zhy3a.com)
- Bloom: an open source tool for automated behavioral evaluations (www.anthropic.com)
- Preventing agent doom loops (with reasoning traces) (0xmmo.notion.site)
- New medical LLM beats Med-PaLM-2, GPT-4 on MMLU benchmarks (huggingface.co)
- Show HN: Run MMLU benchmark on any LLM endpoint (mmlu.borgcloud.ai)
- Multilingual MMLU Dataset from OpenAI (OpenAI/Mmmlu) (huggingface.co)
- MMLU-Pro: Advanced edition of MMLU & new Leaderboard (huggingface.co)
- Gemini Benchmark – MMLU (compared with GPT-4-turbo, Mixtral) (hub.zenoml.com)
- Multitask Language Understanding (MMLU) on Helm (crfm.stanford.edu)
- OLMo 1.7–7B: A 24 point improvement on MMLU (blog.allenai.org)
- Show HN: BenchFlow – run AI benchmarks as an API (github.com)
- Show HN: Open-source study to measure end user satisfaction levels with LLMs (open-llm-initiative.com)
- Show HN: I built the LLM Comparison Tool I wish existed (llm-stats.com)
- Show HN: LLM Benchmarking Suite (github.com)
- Show HN: Atlas: Independent Evals and Benchmarking for Generative AI Models (app.layerlens.ai)