An open-source datalake to ingest, organize and efficiently store all data contributions made to gpt4all.
Hosted version: https://api.gpt4all.io
The core datalake architecture is a simple HTTP API (written in FastAPI) that ingests JSON in a fixed schema, performs some integrity checking and stores it. This JSON is transformed into storage efficient Arrow/Parquet files and stored in a target filesystem.
- Data is stored on disk / S3 in parquet files in subdirectories organized by day. These parquet files have a standardized schema allowing for easy manipulation in any programming language.
- The input data model can be found here.
Nomic AI will provide automatic snapshots of this raw parquet data. You will be able to interact with the snapshots:
- In their raw exported form.
- In automatic Atlas maps over its raw, cleaned and curated form.
- Through downloads where the data has been curated, de-duplicated and cleaned for LLM training/finetuning.
By sending data to the GPT4All-Datalake you agree to the following.
Data sent to this datalake will be used to train open-source large language models and released to the public. There is no expectation of privacy to any data entering this datalake. You can, however, expect attribution. If you attach a unique identifier that associates you as the data contributor, Nomic will retain that identifier in any LLM trains that it conducts. You will receive credit and public attribution if Nomic releases any model trained on your submitted data. You can also submit data anonymously.
While open-sourced under an Apache-2 License, this datalake runs on infrastructure managed and paid for by Nomic AI. You are welcome to run this datalake under your own infrastructure! We just ask you also release the underlying data that gets sent into it under the same attribution terms.
- Clone down the repository.
- Run
make testenv
to build all docker images and launch the HTTP server. - Go to 'http://localhost/docs' to view the API documentation.
- You can run the unit tests with
make test
. Any edits made to the FastAPI app will hot reload.