“They said it could not be done”. We’re releasing Pleias 1.0, the first suite of models trained on open data (either permissibly licensed or uncopyrighted): Pleias-3b, Pleias-1b and Pleias-350m, all based on the two trillion tokens set from Common Corpus.
05.12.2024 16:39
👍 249
🔁 85
💬 11
📌 19
███░░░░░░░░░ ~25% trained
"A painting of a mountain lake with a boat in the foreground, surrounded by lush green grass, trees, and rocks. The sky is filled with white, fluffy clouds, creating a peaceful atmosphere."
06.12.2024 22:28
👍 13
🔁 3
💬 2
📌 0
Great study on misinformation. Just want to point out that this kind of work is impossible without the fair use doctrine. Massive copying, computational analysis, ...
29.11.2024 22:44
👍 33
🔁 12
💬 2
📌 1
Hi, so I've spent the past almost-decade studying research uses of public social media data, like e.g. ML researchers using content from Twitter, Reddit, and Mastodon.
Anyway, buckle up this is about to be a VERY long thread with lots of thoughts and links to papers. 🧵
27.11.2024 15:33
👍 964
🔁 452
💬 59
📌 123
Making a bsky dataset is a bit like breaking glaze. It's in users best interests to know how easy it is, but they'll hate you for it.
27.11.2024 04:10
👍 2
🔁 0
💬 0
📌 0
Sincerely do not tell anyone in the replies what the fire hose is lmao
15.11.2024 22:14
👍 18
🔁 6
💬 3
📌 0
100%. And I think the challenge is real not because it requires complicated technology, but because both AI orgs and rights holders see opt-outs as a compromise that they'd need to be forced into.
14.11.2024 03:04
👍 2
🔁 0
💬 1
📌 0