Trending

#CommonCrawl

Latest posts tagged with #CommonCrawl on Bluesky

Latest Top
Trending

Posts tagged #CommonCrawl

Idea for a new prize: big LLM maker segments its training data (or maybe even just #CommonCrawl) by originating person, runs DataSHAP on the segments, gives a prize to the highest scorer.

I have no idea how to think about who it would be.

0 0 0 0

winbuzzer.com/2026/02/15/p...

Publishers Block Internet Archive Over AI Scraping Fears

#AI #WaybackMachine #InternetArchive #Google #Reddit #OpenAI #BigTech #TheNewYorkTimes #NewsPublishers #AIScraping #OpenWeb #CommonCrawl #PerplexityAI #Media

0 0 0 0
Text Shot: A recent article in The Atlantic (“The Nonprofit Doing the AI Industry’s Dirty Work,” November 4, 2025) makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

This allegation is untrue. It misrepresents both how Common Crawl operates and the values that guide our work.

Text Shot: A recent article in The Atlantic (“The Nonprofit Doing the AI Industry’s Dirty Work,” November 4, 2025) makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities. This allegation is untrue. It misrepresents both how Common Crawl operates and the values that guide our work.

Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)

4 0 0 0
Preview
Common Crawl defends archive practices amid deletion claims Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection

0 0 0 0
Preview
Common Crawl defends archive practices amid deletion claims Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection

0 0 0 0
Post image

The Nonprofit Doing the AI Industry’s Dirty Work -- The web archive Common Crawl has been quietly funneling paywalled articles to AI companies—and lying to publishers about it. #AI #CommonCrawl #TheAtlantic

www.theatlantic.com/technology/2...

1 0 0 0
Preview
The Company Quietly Funneling Paywalled Articles to AI Developers “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.

A non-profit has built a massive #internet database—and served training data to #AI firms despite pleas from publishers to stop, Alex Reisner reports.
Generative AI in its current form would probably not be possible without #CommonCrawl

www.theatlantic.com/technology/2...

0 0 0 0
Preview
Common Crawl supplies paywalled content to AI companies despite publisher objections Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl

0 0 0 0
Preview
Common Crawl supplies paywalled content to AI companies despite publisher objections Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl

0 0 0 0
Preview
The Company Quietly Funneling Paywalled Articles to AI Developers “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.

The Nonprofit Doing the AI Industry’s Dirty Work www.theatlantic.com/technology/2... #tech #AI #CommonCrawl #PrivacyRights #TechRegulation #SiliconValley #BigBrother

0 0 0 0

#AI
#THEFT
#COPYRIGHT
#COMMONCRAWL

0 0 0 0
Preview
Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.

Upcoming Event (October 22nd) Hosted by @stanfordhai.bsky.social: Common Crawl Foundation: Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data
hai.stanford.edu/events/commo... @commoncrawl.bsky.social #AI #commoncrawl #datasets #data

1 0 0 0
Preview
L'APIG et le SEPM obtiennent le retrait des contenus de presse pillés par l'IA sur Common Crawl - LA LETTRE La première offensive des éditeurs de quotidiens et de magazines pour se protéger du pillage de leurs contenus par les géants de l'IA s'est avérée payante.

2 organismes professionnels représentant 800 titres de la presse française obtiennent le retrait de leurs contenus de presse sur #CommonCrawl qui sert à alimenter notamment les IA. Une information à bien retenir.
www.lalettre.fr/fr/medias_pr...

0 0 0 0
Original post on mastodon.online

Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.

Besonders im Fokus stehen frei zugängliche Datenbanken wie #CommonCrawl, deren Inhalte zum Training von #Sprachmodellen genutzt werden.

Die #Verlage fordern die Entfernung […]

0 1 0 0
Preview
What is the Common Crawl Database, and Why should a Site Owner Care? What Is Common Crawl? It is one of the most influential data sources on the web and most site owners don't even realize their content is in it. So what is it?

It is one of the most influential databases on the web and most have never even heard of it, let alone understand how it impacts a sites marketing. (psst This is how some of your Google bits are showing in ChatGPT)

www.searchengineworld.com/who-what-whe...

#SEO #CommonCrawl #Google #ChatGPT

2 0 0 0
Original post on paperbay.org

Starting to see (and getting a bit excited about) some components of openwebsearch.eu, and I was wondering if the EU will finally get its own Common Crawl, like dataset (commoncrawl.org).

It seems the crawling results aren't publicly accessible yet, and there's already some discussion about […]

1 1 0 0
Preview
Nearly 12,000 API keys and passwords found in AI training dataset Close to 12,000 valid secrets that include API keys and passwords have been found in the Common Crawl dataset used for training multiple artificial intelligence models.

Ouch #CommonCrawl

0 0 0 0
Preview
Nearly 12,000 API keys and passwords found in AI training dataset Close to 12,000 valid secrets that include API keys and passwords have been found in the Common Crawl dataset used for training multiple artificial intelligence models.

Nearly 12,000 API keys and passwords found in AI training dataset

www.bleepingcomputer.com/news/securit...

#news #tech #technology #AI #CommonCrawl #security #privacy

1 1 0 0
Preview
AI Trekkers 🚀 AIT on LinkedIn: #opendata #aiinnovation #commoncrawl #laion #researchanddevelopment #ai… 📈 AI & Data -- How Open Data is Powering AI and Driving Innovation 📈 Open data is revolutionizing the field of AI and driving innovation across various…

📈 AI & Data -- How Open Data is Powering AI and Driving Innovation 📈

linkedin.com/posts/aitrek...

#OpenData #AIInnovation #CommonCrawl #LAION #ResearchAndDevelopment #AI #ArtificialIntelligence #Data #Innovation #Technology #Business

1 0 0 0

Everything you wanted to know but were afraid to ask about #CommonCrawl.

Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI https://piv.al/411QdzC

This is an important and approachable read for anyone interested in understanding LLMs.

#AI

1 0 0 0

#tech bubble: who else is looking to protect their ip from #CommonCrawl for years now? I'd love to have a chat

1 0 0 0