AutoPureData

Automated Filtering of Undesirable Web Data to Update LLM Knowledge

Created by Praneeth Vadlapati (@prane-eth)

Note

Please star ⭐ the repository to show your support.

Why AutoPureData?

LLMs (Generative AI) like ChatGPT do not have the latest updated information. The reason for not auto-updating with the latest data is a lot of unsafe or unwanted text around the web.

This project is to automatically collect the data and filter unwanted text using AI and LLMs. The auto-filtered data can be used to automatically update knowledge of LLMs.

What are filtered:

Unsafe content ☣️: Toxic, threat, insult, discrimination, political, self-harm, religious, violence, sexual, profanity, flirtation, spam, scam, misleading, and more
Content from unreliable sources 📰: Unsafe websites and unindexed domains (that are not crawled by search engines)
Personal details 👤: Phone, address, credit card, SSN, IP address, and more
Attacks 🛡️: Adversarial attack attempts (with Data Poisoning)

Languages supported: Only English for now (more languages will be added when contributors are available)

🚀 Quick Start

pip install -r requirements.txt
cp .env.example .env

Now, edit the .env file and add your API keys.
Run the file Data_flagging.ipynb to collect and filter the latest web data. Run the file Analytics_and_Filtering.ipynb to manually correct the flagging.

After the filtering process, the data can be used with an LLM as mentioned in Usage_with_LLMs.ipynb

This file pushes the filtered data to Pinecone DB and uses it with an LLM.

🛠️ Contributing

The code has a lot of room for improvement and is still in progress.
Contributions are welcome! Feel free to create an issue for any bug reports or suggestions.
Please contribute to the code by adding more filters and making the code more efficient.
To contribute, star ⭐ the repository and create an Issue. If I can't solve it, I will allow anyone to create a pull request.

📄 Research Paper

A pre-print of the research paper is available on arXiv:2406.19271

📑 Citation

To use my paper for reference, please cite it as below:

@misc{vadlapati2024autopuredataautomatedfilteringweb,
	title={{AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning}},
	author={Praneeth Vadlapati},
	year={2024},
	eprint={2406.19271},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2406.19271}, 
}

🪪 License

Copyright (c) 2024 Praneeth Vadlapati
Please refer to the LICENSE file for more information.

⚠️ Disclaimer

The code is not intended for use in production environments. This code is for educational and research purposes only.

No author is responsible for any misuse or damage caused by this code. Use it at your own risk. The code is provided as is without any guarantees or warranty.

Note: The experiment was not re-run using Llama 3.1, as the same accuracy was achieved using Llama 3.

🌐 Acknowledgements

Special thanks to Groq (https://groq.com/) for a fast Llama 3 inference engine
Dataset: HuggingFace FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb
Unsafe text detections: Meta Llama Guard 2 https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md
Unwanted text detections using LLM: Meta Llama 3 (70B) https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Analytics page: Gradio https://gradio.app/
Vector DB: Pinecone https://www.pinecone.io/

📧 Contact

For personal queries, please find my contact details here: linktr.ee/prane.eth

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
files		files
.env.example		.env.example
.gitignore		.gitignore
Analytics_and_Filtering.ipynb		Analytics_and_Filtering.ipynb
CITATION.cff		CITATION.cff
Data_flagging.ipynb		Data_flagging.ipynb
LICENSE.md		LICENSE.md
README.md		README.md
Usage_with_LLMs.ipynb		Usage_with_LLMs.ipynb
common_functions.py		common_functions.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoPureData

Why AutoPureData?

What are filtered:

🚀 Quick Start

🛠️ Contributing

📄 Research Paper

📑 Citation

🪪 License

⚠️ Disclaimer

Note: The experiment was not re-run using Llama 3.1, as the same accuracy was achieved using Llama 3.

🌐 Acknowledgements

📧 Contact

About

Languages

License

Pro-GenAI/AutoPureData

Folders and files

Latest commit

History

Repository files navigation

AutoPureData

Why AutoPureData?

What are filtered:

🚀 Quick Start

🛠️ Contributing

📄 Research Paper

📑 Citation

🪪 License

⚠️ Disclaimer

Note: The experiment was not re-run using Llama 3.1, as the same accuracy was achieved using Llama 3.

🌐 Acknowledgements

📧 Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages