Skip to content
@oscar-project

OSCAR

The Open Super-large Crawled Aggregated coRpus

OSCAR

The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback by writing to our mail address. Also consider citing our papers.

If you want to contribute to OSCAR please open a pull request!

Since 2019, The OSCAR Project has been funded by Inria (project-team ALMAnaCH), the PRAIRIE institute. Starting in 2023, DFKI and the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X, have joined Inria, ALMAnaCH and the PRAIRIE institute in providing funding for the OSCAR Project. During 2022 and at the beginning of 2023, OSCAR was also shortly funded by The University of Mannheim.

If you are interested in OSCAR and would like to access the corpus, send us a mail using our mail address, with "OSCAR Access Request" as mail title. Please include your name, last name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Grab the latest OSCAR release here! 🚀

Join our Discord community here! 💬

Pinned Loading

  1. ungoliant ungoliant Public

    🕷️ The pipeline for the OSCAR corpus

    Rust 162 14

  2. oscar-tools oscar-tools Public

    The original tooling for the OSCAR corpus rewritten in Rust

    Rust 5 3

  3. oscar-website oscar-website Public

    The website of the Oscar Project

    TeX 10 14

Repositories

Showing 10 of 16 repositories
  • oscar-project/documentation’s past year of commit activity
    3 Apache-2.0 4 9 0 Updated Oct 2, 2024
  • oscar-statistics Public

    Compute statistics for OSCAR Monthly releases

    oscar-project/oscar-statistics’s past year of commit activity
    Rust 2 Apache-2.0 0 0 0 Updated Sep 2, 2024
  • oscar-io Public

    Readers/Writers for OSCAR Corpus

    oscar-project/oscar-io’s past year of commit activity
    Rust 0 Apache-2.0 1 2 1 Updated Apr 16, 2024
  • oscar-tools Public

    The original tooling for the OSCAR corpus rewritten in Rust

    oscar-project/oscar-tools’s past year of commit activity
    Rust 5 Apache-2.0 3 9 6 Updated Dec 19, 2023
  • ungoliant Public

    🕷️ The pipeline for the OSCAR corpus

    oscar-project/ungoliant’s past year of commit activity
    Rust 162 Apache-2.0 14 18 (1 issue needs help) 13 Updated Dec 18, 2023
  • ut1-rs Public

    ut1-blocklist rust library

    oscar-project/ut1-rs’s past year of commit activity
    Rust 1 MIT 0 1 0 Updated Nov 9, 2023
  • oscar-website Public

    The website of the Oscar Project

    oscar-project/oscar-website’s past year of commit activity
    TeX 10 Apache-2.0 14 2 0 Updated Nov 9, 2023
  • .github Public
    oscar-project/.github’s past year of commit activity
    0 1 0 1 Updated Aug 8, 2023
  • oscar-project/OSCAR-CommonCrawl-Collab’s past year of commit activity
    Jupyter Notebook 2 0 2 0 Updated Mar 12, 2023
  • download_oscar Public

    Downloading all files of a language from the OSCAR (Open Super-large Crawled Aggregated coRpus)

    oscar-project/download_oscar’s past year of commit activity
    Python 10 MIT 2 0 1 Updated Feb 28, 2023

Top languages

Loading…

Most used topics

Loading…