Hackernews posts about Evals

Show HN: Agent-evals – Claude skill to build your own evals (github.com)

9 points by sauercrowd 2 days ago | 1 comments
Zork-bench: An LLM reasoning eval based on text adventure games (www.lowimpactfruit.com)

5 points by mnky9800n 13 days ago | discuss
Gbrain-Evals (github.com)

4 points by mjtk 13 days ago | 1 comments
AI evals are becoming the new compute bottleneck (huggingface.co)

4 points by gmays 2 days ago | discuss
Product Evals in Three Simple Steps (eugeneyan.com)

3 points by eigenBasis 9 days ago | discuss
I built a multi-turn clinical safety eval framework for LLMs (medium.com)

3 points by deepikaa_s 19 days ago | discuss
Show HN: Legal Action Boundary Eval for agentic legal workflows (github.com)

2 points by kankouadio_vx 14 days ago | 2 comments
Cyborg Evals (www.lesswrong.com)

2 points by frmsaul 6 days ago | 1 comments
Dolibarr 23.0.0: PHP eval() whitelist bypass → RCE via two bugs (CVE-2026-22666) (jivasecurity.com)

2 points by jiva 23 days ago | 1 comments
Show HN: AHD – an open-source linter and eval framework for AI-generated UI (ahd.adastra.computer)

2 points by HereticLocke 1 day ago | discuss
Evals Skills for AI Agents (github.com)

2 points by paulaq 3 days ago | discuss
Zork-bench: An LLM reasoning eval based on text adventure games (www.lowimpactfruit.com)

2 points by nicholasjbs 9 days ago | discuss
Show HN: Phinite – The OS layer (eval, observe, govern, A2A native) (www.phinite.ai)

2 points by PhiniteAI 11 days ago | discuss
Show HN: Claude Code skills for building LLM evals (github.com)

2 points by paulaq 13 days ago | discuss
Show HN: FieldOps-Bench an open eval for physical-world AI agents (www.camerasearch.ai)

1 points by Aeroi 15 days ago | 1 comments
LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0) (github.com)

1 points by benmeryem_ai 5 days ago | discuss
Task-Specific LLM Evals That Do and Don't Work (eugeneyan.com)

1 points by eigenBasis 6 days ago | discuss
AI evals are becoming the new compute bottleneck (huggingface.co)

1 points by ibobev 6 days ago | discuss
Build AI evals from real failures (latitude.so)

1 points by paulaq 17 days ago | discuss
Code → Eval → HLD → LLD → Code (p10q.com)

1 points by tmsh 18 days ago | discuss
Bootstrapping AI Evals from Context (Why 'Just Asking Claude' Fails) (scorable.ai)

1 points by Arimbr 20 days ago | discuss
Coding evals are broken. CI is green while AI code quality goes unmeasured (www.stet.sh)

1 points by bisonbear 21 days ago | discuss
Show HN: 2500 vision benchmarks / evals for Vision Language Models (github.com)

1 points by zakariaelhjouji 28 days ago | discuss
Show HN: Nyx – multi-turn, adaptive, offensive testing harness for AI agents (fabraix.com)

20 points by zachdotai 17 days ago | 8 comments
Show HN: Spec27 – Spec-driven validation for AI agents (www.spec27.ai)

13 points by njyx 7 days ago | 9 comments
Show HN: Netlify for Agents (netlify.ai)

13 points by bobfunk 14 days ago | 4 comments
Show HN: I forced Claude to play Tetris in Emacs (imgur.com)

13 points by iLemming 26 days ago | 3 comments
I built a hiring platform that watches engineers work in a real CAD tool

7 points by mind_uncapped 10 days ago | discuss
Show HN: I built a way to see if your SDK is AI-friendly

4 points by nguyenhu 9 days ago | discuss
Show HN: API Ingest – Agentic Search (Inter) API Docs (github.com)

3 points by mohidbutt 14 days ago | 2 comments