Original essay: papers.ssrn.com/sol3/papers....
New article!
My thoughts on the slow death of scaling essay by Sara Hooker
Ok, I'll confess! I too like Roland Emmerich's Godzilla. I even like the creature design in this film!
The one Frankenstein film to rule them all!
Thank you, @realgdt.bsky.social π
The Consciousness API: What if consciousness isn't contained within us, but rather we are temporary antennas, tuning into a vast, universal broadcast of awareness?
In this article, I explore the story behind some of the ideas introduced in the Transformer paper.
Exploring things from the fundamental attention mechanism that lies at its heart to the surprisingly simple explanation for its name.
You may find it interesting! π
πlink below
We're particularly proud to release Aya Vision 8B - it's compact π and efficient π, outperforming models up to 11x its size π.
Releasing open weights helps to make breakthroughs in VLMs accessible to the research community.
π
Event on Mozilla AI discord: discord.gg/QTCRfefF?eve...
π ProGen paper: www.biorxiv.org/content/10.1...
𧬠Join us this Wednesday on @mozilla.ai discord server in our second session of the Biological Representation Learning series where we discuss landmark papers in the field!
We will be presenting the ProGen protein language model paper from Salesforce. See you there! π
π’ Join us on Discord for our first Blueprints Hub event π’
Discover Blueprints and learn how to transform text into podcast-style conversations using entirely open source tools.
ποΈ Wednesday, Jan. 22nd
β° 1:30-2:00 PM EST
π Event: discord.gg/BaYFBaeh?eve...
#OpenSource #AI #Blueprints #MozillaAI
As the @cohereforai.bsky.social joins the Bluesky family β we will be sharing paper gems from when we first started as a lab.
This paper is part of a larger research agenda where we have focused on how to better represent the long tail = making AI work for almost all real world distributions.
Meet Helium-1 preview, our 2B multi-lingual LLM, targeting edge and mobile devices, released under a CC-BY license. Start building with it today!
huggingface.co/kyutai/heliu...
And lastly, big thanks to you for making it this far π€, donβt forget to read the paper!
www.dataprovenance.org/Multimodal_D...
11/n
Big thanks to Melissa HeikkilΓ€ for featuring our work in MIT Tech Review.
www.technologyreview.com/2024/12/18/1...
Xuhui Zhou, Caiming Xiong, Luis Villa,
@stellaathena.bsky.social, Alex Pentland,
@sarahooker.bsky.social, Jad Kabbara
9/n
An Dinh, Shrestha Mohanty, Deividas Mataciunas,
Tobin South, Jianguo Zhang,
@arielnlee.bsky.social , Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma
8/n
π big thanks to all the contributors to this huge and magnificent effort. I'm truly honored for the chance to work alongside all of you: Manan Dey, Nayan Saxena,
Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang
7/n
This work was supported by the Mozilla Foundation Data Futures Lab, and was lead by: @shaynelongpre.bsky.social, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska,
William Brannon, and Robert Mahari
6/n
4οΈβ£ Linguistic representation has not improved by most measures: Gini Coefficients for text and speech datasets show significant concentration, indicating limited progress in diversifying data sources.
5/n
3οΈβ£ Geographical representation has not improved for a decade: Datasets from African and South American organizations account for < 0.2% of all modality content, while North American or European organizations span 93% of text tokens and 60%+ hours of speech and video.
4/n
2οΈβ£ Inconsistent dataset licenses: While ~30% of datasets have permissive licenses, 78%+ of their sources carry hidden anti-crawling or licensing restrictions, making compliance a minefield.
3/n
π Key Findings
1οΈβ£ The web is still the primary source: The internet, social media platforms, and synthetically generated data are increasingly becoming the predominant sources for multimodal data, compared to curated sources.
2/n
β¨ Excited to share our latest work from The Data Provenance Initiative βΈοΈ
This is the most comprehensive audit of multimodal training data, auditing ~4000 datasets between 1990 and 2024, and covering more than 400 unique tasks in 608 languages!
π§΅ 1/n
EPIC! π€
π 500! π
Our Community Computer Vision Course Repo just reached 500 stars on GitHub: github.com/johko/comput... π€©
I'm really proud of all the amazing content people from the community have contributed here and that they still keep on adding very cool and helpful material πͺ
The Hudsucker Proxy is the most underrated Coen Brothers film!
Funny thought: if "post-training" refers mostly to supervised instruction-tuning and alignment of a "pre-trained" model, then where does the actual "training" happen! π