Hackernews posts about Common Crawl
- Show HN: CrawlerCheck v1.5 – Operators Directory and 25 new AI crawlers (crawlercheck.com)
- Large language model data pipelines and Common Crawl (blog.christianperone.com)
- Common Crawl maintains a free, open repository of web crawl data (commoncrawl.org)
- Publishers Target Common Crawl in Fight over AI Training Data (www.wired.com)
- Common Crawl May/June 2024 Newsletter (commoncrawl.org)
- 200M unique domains extracted from Common Crawl (zenodo.org)
- Discovering Shopify Domains: A Journey Through Common Crawl Data (alistechtales.substack.com)
- Common Crawl Statistics Now Available on Hugging Face (commoncrawl.org)
- The Company Quietly Funneling Paywalled Articles to AI Developers (www.theatlantic.com)
- The Nonprofit Feeding the Internet to AI Companies (www.theatlantic.com)
- The Nonprofit Doing the AI Industry's Dirty Work (www.theatlantic.com)
- The Company Funneling Paywalled Articles to AI Developers (www.theatlantic.com)
- Legality of Publishing Web Crawls (2020) (skeptric.com)
- Crawler operators, please stop destroying the commons (lunnova.dev)
- Show HN: Wispbit - Linter for AI coding agents (wispbit.com)
- Show HN: Wispbit – Keep codebase standards alive (wispbit.com)
- Show HN: Copy from URL – A tiny tool to bypass robots.txt for AI chatbot (copyfromurl.com)
- Show HN: Skillz – Use Claude Skills Anywhere (github.com)