๐ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐ฌ on Discord: https://t.co/toLKAPje4E
๐ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐ฌ on Discord: https://t.co/toLKAPje4E
โจ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of Inria, the ALMAnaCH and CommonCrawl. Specially thanks to the contributions of @ujj.bsky.social, Rua Ismail, @sobamchan.bsky.social, Sebastian Nagel and Benoรฎt Sagot.
As Colossal OSCAR 1.0 is based on Common Crawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual comments users agree to the Common Crawl Terms of use ๐
๐ https://commoncrawl.org/terms-of-use/
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 Common Crawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. ๐ฃ๏ธ
Colossal OSCAR 1.0 is by far our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. ๐ค๐งโ๐ฌ๐
๐ฃ The OSCAR Project and DFKI are happy to announce the release of Colossal OSCAR 1.0 ๐, which is now available on the Hugging Face Hub ๐ค at https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
Colossal OSCAR 1.0 was put together by @pjox.bsky.social as part of the OpenGPT-X collaboration.