Hackernews posts about Evals
- Show HN: LLMadness – March Madness Model Evals (llmadness.com)
- Test Evals Are Not Enough (voratiq.com)
- Better practical evals for real-world LLM agents (www.colehoffer.ai)
- Selectively reducing eval awareness and murder in Gemma 3 27B via steering (www.lesswrong.com)
- Eval awareness in Claude Opus 4.6's BrowseComp performance (www.anthropic.com)
- AoE 2 Build Order as an Eval for LLM's (wraitii.github.io)
- We ran 600 agent evals – steering hooks hit 100% accuracy, prompts hit 82% (strandsagents.com)
- Bonsai: Use It Where Eval() Would Be Reckless (danfry1.github.io)
- Eval awareness in Claude Opus 4.6's BrowseComp performance (www.anthropic.com)
- Quantifying infrastructure noise in agentic coding evals (www.anthropic.com)
- Eval awareness in Claude Opus 4.6’s BrowseComp performance (www.anthropic.com)
- Show HN: Skill Eval – A framework for testing the quality of AI agent skills (blog.mgechev.com)
- Evals Skills for Coding Agents (hamel.dev)
- Show HN: Optimize_anything: A Universal API for Optimizing Any Text Parameter (gepa-ai.github.io)