A suburban caveman house
Quit freaking out. Remember that in 10,000 B.C., when America had ZERO international trade, a family could afford a house like this on a single income.
A suburban caveman house
Quit freaking out. Remember that in 10,000 B.C., when America had ZERO international trade, a family could afford a house like this on a single income.
Me showing Claude what I've been working on
i am sick of βmore monkeys jumping on the bedβ discourse. itβs as though these people have no memory of 2017 when one fell off and bumped his head. doctor spoke out against it, mama endorsed doctorβs findings. iβm limiting replies to followers because i do not have the energy for YMMJOTBers today
Buckle up because we're banging into the new year with my annual retrospective of the last year in databases! Highlights include license change blowback, Databricks vs. Snowflake gangwar, @duckdb.org's shotgun weddings, and buying a quarterback to impress your lover: www.cs.cmu.edu/~pavlo/blog/...
The new family Christmas Eve tradition: watching Verandah Santa and The Sign episodes from Bluey!
See you at AWS re:Invent next week! If you're in Vegas happy to catch up on anything data curation related!
Words have no meaning anymore.
I am excited about the release of our results on web-scale text data curation @datologyai.com. Our curation pipeline transforms the RedPajama V1 dataset into the DAIT dataset which outperforms the best publicly-available pretraining datasets for training LLMs better, faster, smaller.
Tired: Bringing up politics at Thanksgiving
Wired: Bringing up @datologyai.comβs new text curation results at Thanksgiving
Thatβs right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a π¦
π§΅
If you're interested in Data-Centric AI, follow The DatologyAI Starter Pack for damn-good data memes and occasional data curation insights: go.bsky.app/NJ9sTot
Amazon S3 just grew "append"! It's only available for the more expensive, lower latency S3 Express One Zone bucket class but you can now append data to an object up to 10,000 times - previously you could only atomically replace a whole object with an updated version simonwillison.net/2024/Nov/22/...
This is the most interesting and most impactful data pipeline problem I have ever worked on (and if you know me, you know thatβs saying something.)
So happy to be able to share this work with the world! And now itβs time for a little vacation. π
π§΅Weβve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and Iβm SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency π₯π₯π₯
Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai
tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!
Read on to see how our product powers multimodal data curation
1/n π§΅
Oh finagle from twitter was so good at this!
I think AWS can be a great place to get GPUs fromβwe get them from them, and we had a great time. It depends on how many and what time frames you are looking at.