Hackernews posts about Evals
- The Vanta AI Quality Eval Maturity Model (www.vanta.com)
- Proof of AGI is the impossibility of evals (thewatershed.markpesce.com)
- Pelican, or pelican't? A hint at Claude evals (noperator.dev)
- Zork-bench: An LLM reasoning eval based on text adventure games (www.lowimpactfruit.com)
- SchemaFlow: Agentic Database Change Impact Analysis, SQL Gen and Eval Guardrails (developers.openai.com)
- Agent Judge: Solving Long-Context Evals for Production Agents (www.judgmentlabs.ai)
- Exercises in benchmarking, evals, and experimental design, part 6 (www.patreon.com)
- Realistic Evals, or You're Blind (zozo123.github.io)
- What 50k Runs of a 5-Line Eval Taught Us (code.visualstudio.com)
- Command injection in NLTK collocations via eval() (aydinnyunus.github.io)
- Macro Evals for Agentic Systems (developers.openai.com)
- We had to build new evals for Fable (hex.tech)
- What happens when run you evals on brainrot? (www.scorecard.io)
- Reality: The Final Eval – Vending Bench Eval (www.latent.space)
- Show HN: HermesBench – workflow reliability evals for personal AI agents (verkyyi.github.io)
- Agent evals should feel like real work (www.zohaib.cc)
- Bateschess – Chess Analytics Feeding Stockfish Evals into LLM's (bateschess.com)
- How to Debug AI Agents with Traces and Evals (medium.com)