Hackernews posts about Common Crawl

US publishers tell Common Crawl to stop scraping and delete archive (pressgazette.co.uk)

30 points by thm 28 days ago | 10 comments
Show HN: Infini-News – 1.36B news articles from Common Crawl, queryable in ms (cs2.uni-graz.at)

6 points by ruggsea 6 days ago | 1 comments
Common Crawl maintains a free, open repository of web crawl data (commoncrawl.org)

27 points by doener over 1 year ago | 1 comments
Ask HN: Is Common Crawl used exhaustively by any search engine?

8 points by n1xis10t 8 months ago | 1 comments
Show HN: Help improve language coverage in Common Crawl

8 points by ccgreg about 1 year ago | discuss
Ask HN: Is there a service that offers Common Crawl as an API?

7 points by georgehill about 1 year ago | 3 comments
Cybersecurity Data Extraction from Common Crawl (arxiv.org)

5 points by PaulHoule 4 months ago | discuss
A Change to Common Crawl Dataset Size Reporting (commoncrawl.org)

3 points by ccgreg 3 months ago | 1 comments
Cc-downloader: command-line tool for downloading Common Crawl data via HTTPS (github.com)

2 points by simonpure over 1 year ago | discuss
Publishers Demand Accountability from Common Crawl over Unauthorized Use (www.newsmediaalliance.org)

1 points by thm 2 months ago | 2 comments
Crawlgraph – Backlink lookup using Common Crawl ($99 lifetime) (crawlgraph.com)

1 points by pucilpet about 1 month ago | discuss
Publishers Tell Common Crawl to Stop Unauthorized Scraping (www.mediapost.com)

1 points by jaredwiener about 2 months ago | discuss
I made Common Crawl's 4.4B edges queryable for backlink lookups (crawlgraph.com)

1 points by pucilpet 2 months ago | discuss
Show HN: I built a 50 site sampler from CommonCrawl refreshing every 30 minutes (randcrawl.com)

1 points by whothatcodeguy 5 months ago | discuss
200M unique domains extracted from Common Crawl (zenodo.org)

1 points by networkcat 6 months ago | discuss
Discovering Shopify Domains: A Journey Through Common Crawl Data (alistechtales.substack.com)

1 points by gmays almost 2 years ago | discuss
Common Crawl Statistics Now Available on Hugging Face (commoncrawl.org)

1 points by nceqs3 almost 2 years ago | discuss
Show HN: Randomly discovered websites from the open internet every 60 minutes (randcrawl.com)

4 points by whothatcodeguy 5 months ago | 1 comments
Show HN: Rhiza – easily create shortcuts and add entries to PATH (github.com)

2 points by skardy over 1 year ago | discuss
Humanties Last War

1 points by vlan121 9 months ago | 1 comments
Show HN: hot or not for .ai websites (ratemyaisite.com)

1 points by prolly97 3 months ago | discuss
The Company Quietly Funneling Paywalled Articles to AI Developers (www.theatlantic.com)

33 points by breve 8 months ago | 16 comments
The Nonprofit Feeding the Internet to AI Companies (www.theatlantic.com)

10 points by ForHackernews 8 months ago | 5 comments
The Nonprofit Doing the AI Industry's Dirty Work (www.theatlantic.com)

9 points by kgwgk 8 months ago | 2 comments
The Company Funneling Paywalled Articles to AI Developers (www.theatlantic.com)

3 points by CaptainZapp 8 months ago | discuss
IPv6 Adoption Across the TopK Web Hosts (commoncrawl.org)

2 points by miyuru 4 months ago | discuss
Crawler operators, please stop destroying the commons (lunnova.dev)

7 points by xena about 1 year ago | discuss
Show HN: CrawlerCheck v1.5 – Operators Directory and 25 new AI crawlers (crawlercheck.com)

1 points by bogozi 5 months ago | discuss
Show HN: Free tool to find RSS feeds, even if not linked on the page

152 points by domysee almost 2 years ago | 48 comments
Show HN: Wispbit - Linter for AI coding agents (wispbit.com)

31 points by dearilos 9 months ago | 14 comments
Quick but powerful research for AI agents with data scrapping and selenium

8 points by alexvomwald over 1 year ago | 6 comments
Ask HN: Why do we still buy things by browsing catalogs?

6 points by dannythecount 4 months ago | 8 comments