OSCAR Project

@oscarproject

The Open Super-large Crawled Aggregated coRpus

22
Followers 7
Following 6
Posts 24.07.2023
Joined

Posts Following

Latest posts by OSCAR Project @oscarproject

Join the OSCAR Project Discord Server! Check out the OSCAR Project community on Discord - hang out with 365 other members and enjoy free voice and text chat.

👀 We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community 💬 on Discord: https://t.co/toLKAPje4E

10.08.2023 15:50 👍 0 🔁 0 💬 0 📌 0

✨ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of Inria, the ALMAnaCH and CommonCrawl. Specially thanks to the contributions of @ujj.bsky.social, Rua Ismail, @sobamchan.bsky.social, Sebastian Nagel and Benoît Sagot.

10.08.2023 15:49 👍 0 🔁 0 💬 1 📌 0

Terms of Use – Common Crawl

As Colossal OSCAR 1.0 is based on Common Crawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual comments users agree to the Common Crawl Terms of use 📄
👉 https://commoncrawl.org/terms-of-use/

10.08.2023 15:46 👍 0 🔁 0 💬 1 📌 0

Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 Common Crawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. 🗣️

10.08.2023 15:45 👍 0 🔁 0 💬 1 📌 0

Colossal OSCAR 1.0 is by far our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. 🤓🧑‍🔬📊

10.08.2023 15:44 👍 0 🔁 0 💬 1 📌 0

📣 The OSCAR Project and DFKI are happy to announce the release of Colossal OSCAR 1.0 📚, which is now available on the Hugging Face Hub 🤗 at https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
Colossal OSCAR 1.0 was put together by @pjox.bsky.social as part of the OpenGPT-X collaboration.

10.08.2023 15:44 👍 6 🔁 1 💬 1 📌 2