Hackernews posts about MMLU

WOS: a Rust ARM64 kernel from scratch with MMU and GICv2 working (github.com)

2 points by ulrichxtan 29 days ago | 3 comments
US holds off blacklisting DeepSeek, more than 100 firms deemed security risks (www.reuters.com)

537 points by giuliomagnifico 20 days ago | 603 comments
Robust Jobserver (codeberg.org)

3 points by birdculture 15 days ago | discuss
Local AI Is Not Ready for Coding. Yet? (mmlac.com)

3 points by speckx 20 days ago | discuss
City counsellors under fire for AI Orange Line map [video] (www.youtube.com)

2 points by functionmouse 8 days ago | discuss
GraphRAG – a knowledge graph LLMs can traverse and write back to (github.com)

2 points by mmkumar 29 days ago | discuss
Hey Nico, you didn't vibe code your data room but stole it from Papermark (twitter.com)

620 points by mmunj 12 days ago | 291 comments
Show HN: Crew – Let Claude Code agents talk to each other (github.com)

4 points by mmoustafa 3 days ago | 2 comments
Corgi makes things worse, claims Postmark is overcharging (despite being Free) (twitter.com)

3 points by mmunj 10 days ago | discuss
Show HN: Finding signal in noisy street recording over 2 weeks (amlucas.github.io)

3 points by amlucas 13 days ago | discuss
New medical LLM beats Med-PaLM-2, GPT-4 on MMLU benchmarks (huggingface.co)

16 points by samjulien almost 2 years ago | 2 comments
Show HN: Run MMLU benchmark on any LLM endpoint (mmlu.borgcloud.ai)

2 points by lostmsu about 1 year ago | discuss
LLM Comparison/Test: 25 SOTA LLMs (Including QwQ) Through 59 MMLU-Pro CS Runs (huggingface.co)

2 points by ororm over 1 year ago | discuss
Multilingual MMLU Dataset from OpenAI (OpenAI/Mmmlu) (huggingface.co)

2 points by ekojs almost 2 years ago | discuss
Graph-based multi-agents smash long-context benchmarks–89% MMLU-Pro on 8B models (github.com)

1 points by FeelTheAGI2 5 months ago | 1 comments
iAsk Pro LLM achieves 93.89% on MMLU, beats GPT-4o and Claude 3.5 Sonnet (iask.ai)

1 points by fariszr almost 2 years ago | discuss
Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning

397 points by codelion about 1 year ago | 68 comments
Show HN: Arch-Router – 1.5B model for LLM routing by preferences, not benchmarks

66 points by adilhafeez about 1 year ago | 15 comments
Launch HN: General Instinct (YC P26) – Frontier models on edge devices

63 points by guanming0717 about 1 month ago | 16 comments
Show HN: BenchFlow – run AI benchmarks as an API (github.com)

24 points by xdotli over 1 year ago | 1 comments
Show HN: QwQ-32B APIs – o1 like reasoning at 1% the cost

17 points by ozgune over 1 year ago | 3 comments
Show HN: Open-source study to measure end user satisfaction levels with LLMs (open-llm-initiative.com)

12 points by sparacha almost 2 years ago | 2 comments
LLM Benchmark: Frontier models now statistically indistinguishable

7 points by js4ever 7 months ago | 4 comments
Show HN: I built the LLM Comparison Tool I wish existed (llm-stats.com)

7 points by JonathanChavez over 1 year ago | 3 comments
Show HN: Flint – A 30B model fine-tuned for less repetition (springboards.ai)

6 points by thmsmxwll 3 months ago | 2 comments
Show HN: Claude Code 2.0 router – preference-aligned routing to multiple LLMs (github.com)

4 points by adilhafeez 9 months ago | 1 comments
Show HN: Forecaster Arena – Testing LLMs on real events with prediction markets (forecasterarena.com)

4 points by setrf 7 months ago | discuss
Show HN: 1.5B LLM routing model that aligns to preferences, not leaderboards (huggingface.co)

4 points by honorable_coder 12 months ago | discuss
Show HN: MALLM – A Multi-Agent Framework for Task Solving (github.com)

4 points by jpwahle about 1 year ago | discuss
Diffusion LLM may make most of the AI engineering stack obsolete

3 points by victorpiles99 4 months ago | discuss
Show HN: Model-literals, model-aliases, and preference-aligned routing for LLMs (docs.archgw.com)

2 points by honorable_coder 10 months ago | discuss
Show HN: LLM Benchmarking Suite (github.com)

2 points by Dhyaneesh over 1 year ago | discuss
Show HN: OptiLLMBench – Test how inference optimization tricks scale up LLMs

2 points by codelion over 1 year ago | discuss
Show HN: AIBenchy – Independent AI Leaderboard (aibenchy.com)

1 points by XCSme 5 months ago | 1 comments
Show HN: MarginDash – See which AI customers are profitable (margindash.com)

1 points by gdhaliwal23 5 months ago | 1 comments
Show HN: MarginDash – See which AI customers are profitable (margindash.com)

1 points by gdhaliwal23 5 months ago | 1 comments
Ask HN: Are LLMs getting better, how can you tell?

1 points by ahamilton454 over 1 year ago | 1 comments
ARCHE3-7B – Sparse Moe with SmartRouter and Foundation Curriculum Training

1 points by OpenSynapseLabs 3 months ago | discuss
Show HN: Mafia Arena – LLMs play social deduction games against each other (mafia-arena.com)

1 points by mohsen1 6 months ago | discuss
Show HN: Arch-Router – Aligning LLM Routing with Human Preferences (arxiv.org)

1 points by honorable_coder 11 months ago | discuss