- Extract Data (✅)
- Train GPT2
- Build an API for GPT2 and Diffusers (✅, GPT part left).
As for the dataset, we use the following websites:
- for English, extracted the data from the Gutenberg Website.
- Used the dataset by mateibejan to extract the txt files.
- We took up a subset of the books listed in the dataset.
- For Tamil, extracted the data from Siruvarmalar and the Oscar/unshuffled_deduplicated_ta dataset for adding more to the corpus and pretraining.