Hackernews posts about Evals
- Zork-bench: An LLM reasoning eval based on text adventure games (www.lowimpactfruit.com)
- Gbrain-Evals (github.com)
- AI evals are becoming the new compute bottleneck (huggingface.co)
- Product Evals in Three Simple Steps (eugeneyan.com)
- Cyborg Evals (www.lesswrong.com)
- Show HN: AHD – an open-source linter and eval framework for AI-generated UI (ahd.adastra.computer)
- Evals Skills for AI Agents (github.com)
- Zork-bench: An LLM reasoning eval based on text adventure games (www.lowimpactfruit.com)
- Show HN: Phinite – The OS layer (eval, observe, govern, A2A native) (www.phinite.ai)
- Show HN: Claude Code skills for building LLM evals (github.com)
- Show HN: FieldOps-Bench an open eval for physical-world AI agents (www.camerasearch.ai)
- Task-Specific LLM Evals That Do and Don't Work (eugeneyan.com)
- AI evals are becoming the new compute bottleneck (huggingface.co)
- Build AI evals from real failures (latitude.so)
- Code → Eval → HLD → LLD → Code (p10q.com)
- Show HN: Spec27 – Spec-driven validation for AI agents (www.spec27.ai)
- Show HN: Netlify for Agents (netlify.ai)
- Show HN: I forced Claude to play Tetris in Emacs (imgur.com)