Hackernews posts about Benchmarks

V-JEPA 2 world model and new benchmarks for physical reasoning (ai.meta.com)

300 points by mfiguiere 4 days ago | 91 comments
Terminal-Bench: a benchmark for AI agents in terminal environments (www.tbench.ai)

17 points by mikemerrill 27 days ago | 3 comments
Datadog opens sources a SOTA time series model and 350M point benchmark (www.datadoghq.com)

13 points by chrisdevs 24 days ago | 1 comments
Comparing Elasticsearch, Tempo, ClickHouse and VictoriaLogs in Tracing Benchmark (victoriametrics.com)

6 points by valyala 5 days ago | discuss
AMD Ryzen AI Max+ Pro 395 Linux Benchmarks: Incredible Performance (www.phoronix.com)

5 points by transpute 27 days ago | 1 comments
Show HN: I built an LLM benchmark for Svelte 5 (github.com)

4 points by khromov 14 days ago | 2 comments
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)

4 points by nobody9999 21 days ago | 1 comments
Benchmarks: OpenCL Kernel Latency ~76x Lower for Lunar Lake with Updated Runtime (www.phoronix.com)

4 points by rbanffy 23 days ago | 1 comments
Comparison of Waymo Crash Rates to Human Benchmarks at 56.7M Miles (arxiv.org)

4 points by PaulHoule 28 days ago | discuss
DeepSeek R1 0528 scored 71% on the aider polyglot coding benchmark (3rd) (twitter.com)

3 points by amrrs 6 days ago | discuss
New MCP-Ready Coding LLM Benchmark Structure (feat. Internet Based on Matrix) (blog.hermesloom.org)

3 points by sigalor 7 days ago | discuss
I'm Open-Sourcing My Custom Benchmark GUI (probablydance.com)

3 points by chmaynard 15 days ago | discuss
Python ASGI Framework Benchmarks (gist.github.com)

3 points by harrisonerd 15 days ago | discuss
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)

3 points by gfto 19 days ago | discuss
HTTP Compliance Benchmark (blog.kourier.io)

3 points by gocp 19 days ago | discuss
Ask HN: Any PDF Benchmarks?

2 points by nnurmanov 3 days ago | 1 comments
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence (arxiv.org)

2 points by badmonster 17 days ago | 1 comments
Measuring AGI: Interactive Reasoning Benchmarks [video] (www.youtube.com)

2 points by danielmorozoff about 21 hours ago | discuss
Embedding Benchmark for Retrieval (huggingface.co)

2 points by fzliu 5 days ago | discuss
Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark) (new.knife.day)

2 points by p-s-v 11 days ago | discuss
Retrieval Embedding Benchmark (huggingface.co)

2 points by fzliu 12 days ago | discuss
Open-Sourcing my Custom Benchmark GUI (probablydance.com)

2 points by ibobev 13 days ago | discuss
Show HN: Comprehensive Benchmark Suite for Story Visualization (github.com)

2 points by hzwer 14 days ago | discuss
LiveSQLBench: Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks (livesqlbench.ai)

2 points by terelak 16 days ago | discuss
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)

2 points by pontiacbandit8 20 days ago | discuss
I tried to benchmark our network storage, and this happened (www.simplyblock.io)

2 points by panrobo 23 days ago | discuss
AMD vs. Nvidia Inference Benchmark: Performance per Cost / Million Tokens (semianalysis.com)

2 points by _aavaa_ 23 days ago | discuss
An Analysis of Search Benchmark, the Game (jpountz.github.io)

2 points by aravindputrevu 25 days ago | discuss
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)

2 points by Fake4d 26 days ago | discuss
Rust Scripting Languages Benchmark (github.com)

1 points by 9d 7 days ago | 1 comments