Idea for a new prize: big LLM maker segments its training data (or maybe even just #CommonCrawl) by originating person, runs DataSHAP on the segments, gives a prize to the highest scorer.
I have no idea how to think about who it would be.
Latest posts tagged with #CommonCrawl on Bluesky
Idea for a new prize: big LLM maker segments its training data (or maybe even just #CommonCrawl) by originating person, runs DataSHAP on the segments, gives a prize to the highest scorer.
I have no idea how to think about who it would be.
winbuzzer.com/2026/02/15/p...
Publishers Block Internet Archive Over AI Scraping Fears
#AI #WaybackMachine #InternetArchive #Google #Reddit #OpenAI #BigTech #TheNewYorkTimes #NewsPublishers #AIScraping #OpenWeb #CommonCrawl #PerplexityAI #Media
Text Shot: A recent article in The Atlantic (“The Nonprofit Doing the AI Industry’s Dirty Work,” November 4, 2025) makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities. This allegation is untrue. It misrepresents both how Common Crawl operates and the values that guide our work.
Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)
Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection
Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection
The Nonprofit Doing the AI Industry’s Dirty Work -- The web archive Common Crawl has been quietly funneling paywalled articles to AI companies—and lying to publishers about it. #AI #CommonCrawl #TheAtlantic
www.theatlantic.com/technology/2...
A non-profit has built a massive #internet database—and served training data to #AI firms despite pleas from publishers to stop, Alex Reisner reports.
Generative AI in its current form would probably not be possible without #CommonCrawl
www.theatlantic.com/technology/2...
Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl
Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl
The Nonprofit Doing the AI Industry’s Dirty Work www.theatlantic.com/technology/2... #tech #AI #CommonCrawl #PrivacyRights #TechRegulation #SiliconValley #BigBrother
Upcoming Event (October 22nd) Hosted by @stanfordhai.bsky.social: Common Crawl Foundation: Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data
hai.stanford.edu/events/commo... @commoncrawl.bsky.social #AI #commoncrawl #datasets #data
2 organismes professionnels représentant 800 titres de la presse française obtiennent le retrait de leurs contenus de presse sur #CommonCrawl qui sert à alimenter notamment les IA. Une information à bien retenir.
www.lalettre.fr/fr/medias_pr...
Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.
Besonders im Fokus stehen frei zugängliche Datenbanken wie #CommonCrawl, deren Inhalte zum Training von #Sprachmodellen genutzt werden.
Die #Verlage fordern die Entfernung […]
It is one of the most influential databases on the web and most have never even heard of it, let alone understand how it impacts a sites marketing. (psst This is how some of your Google bits are showing in ChatGPT)
www.searchengineworld.com/who-what-whe...
#SEO #CommonCrawl #Google #ChatGPT
Starting to see (and getting a bit excited about) some components of openwebsearch.eu, and I was wondering if the EU will finally get its own Common Crawl, like dataset (commoncrawl.org).
It seems the crawling results aren't publicly accessible yet, and there's already some discussion about […]
Ouch #CommonCrawl
Nearly 12,000 API keys and passwords found in AI training dataset
www.bleepingcomputer.com/news/securit...
#news #tech #technology #AI #CommonCrawl #security #privacy
📈 AI & Data -- How Open Data is Powering AI and Driving Innovation 📈
linkedin.com/posts/aitrek...
#OpenData #AIInnovation #CommonCrawl #LAION #ResearchAndDevelopment #AI #ArtificialIntelligence #Data #Innovation #Technology #Business
Everything you wanted to know but were afraid to ask about #CommonCrawl.
Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI https://piv.al/411QdzC
This is an important and approachable read for anyone interested in understanding LLMs.
#AI
#tech bubble: who else is looking to protect their ip from #CommonCrawl for years now? I'd love to have a chat