Hackernews posts about HumanEval

Maincoder-1B – an open 1B-parameter coding model with 76% HumanEval (huggingface.co)

4 points by MainNews 7 months ago | 3 comments
Show HN: S0 Tuning – +23.6pp on HumanEval by tuning state, not weights (github.com)

2 points by jacknotold 4 months ago | 2 comments
Show HN: Aflow achieves 96.2% on HumanEval at 4.55% of GPT-4's cost (twitter.com)

1 points by metagpt almost 2 years ago | discuss
Launch HN: Strata (YC P25) – One MCP server for AI to handle thousands of tools

133 points by wirehack 10 months ago | 66 comments
LLM Benchmark: Frontier models now statistically indistinguishable

7 points by js4ever 7 months ago | 4 comments
Show HN: I built the LLM Comparison Tool I wish existed (llm-stats.com)

7 points by JonathanChavez over 1 year ago | 3 comments
Show HN: Benchmark AI on your actual code (GPT-5, Claude, Grok, Gemini, o3) (codelens.ai)

7 points by codelensai 10 months ago | discuss
Show HN: European Swallow AI – Sonnet-quality coding at $2.60/M tokens (www.europeanswallowai.com)

7 points by joaquim_d 10 months ago | discuss
Diffusion LLM may make most of the AI engineering stack obsolete

3 points by victorpiles99 5 months ago | discuss
Show HN: Loki Mode hit 99.67% SWE-Bench – MAF built a SaaS overnight (github.com)

2 points by slogansand 7 months ago | 5 comments
Q Evaluation Harness: open-source evals for LLMs on q/kdb+ (github.com)

2 points by erfan_mhi 12 months ago | discuss
Show HN: AIBenchy – Independent AI Leaderboard (aibenchy.com)

1 points by XCSme 5 months ago | 1 comments
ARCHE3-7B – Sparse Moe with SmartRouter and Foundation Curriculum Training

1 points by OpenSynapseLabs 4 months ago | discuss
Show HN: Kore – Stack based language where compiler is the reward function (github.com)

1 points by processorx 6 months ago | discuss
Show HN: Atlas: Independent Evals and Benchmarking for Generative AI Models (app.layerlens.ai)

1 points by Arch223 about 1 year ago | discuss
Beat GPT-4o at Python by searching with 100 dumb LLaMAs (modal.com)

4 points by thundergolfer almost 2 years ago | 1 comments
Beat GPT-4o at Python with 100 dumb LLaMAs (modal.com)

1 points by pierremenard almost 2 years ago | discuss
GPT5 is the best coding LLM because other LLMs admit it?

1 points by adinhitlore 11 months ago | 8 comments
Humanely dealing with humungus crawlers (flak.tedunangst.com)

83 points by freediver 11 months ago | 54 comments
Show HN: HumanAlarm – Real people knock on your door to wake you up (humanalarm.com)

38 points by soelost 11 months ago | 56 comments
Valuing Humans in the Age of Superintelligence: HumaneRank (roadtoartificia.com)

10 points by jlaporte over 1 year ago | 35 comments
Humanely Dealing with Humungus Crawlers (flak.tedunangst.com)

9 points by dpassens 11 months ago | discuss
Refactoring Humanely and "Accidental Pomodoro" (melatonin.dev)

6 points by wlll almost 2 years ago | discuss
How to Kill Bugs Humanely (reducing-suffering.org)

5 points by LlamaTrauma over 1 year ago | discuss
Humanely dealing with humungus crawlers (flak.tedunangst.com)

4 points by carlesfe 11 months ago | 1 comments
Humanely Dealing with Humungus Crawlers (flak.tedunangst.com)

3 points by Bogdanp 11 months ago | discuss
Show HN: Steps.org – Humanely Curated AI Prompts for Porn Addiction Recovery (www.steps.org)

2 points by tiagom87 8 months ago | 1 comments
Ask HN: What tech job would let me get away with the least real work possible?

73 points by makemethrowaway 7 months ago | 70 comments
Vivarium: The keeper of a lab's animals stumbles onto a secret [fiction] (jsomers.net)

71 points by jsomers over 1 year ago | 17 comments
Show HN: The $10 coffee that tanked my credit score (cretit.com)

1 points by soelost 10 months ago | 6 comments