This project was created to show basic analysis of public datasets of fake news. Main idea is to make each analysis replicable, so everyone can add his own analysis and use it for his experiments and data mining. Every dataset has its own python jupyter notebook with simple analysis, which can help to choose appropriate dataset.
To run all jupyter notebooks with appropriate libraries installed, we refer to use Docker.
With installed Docker, run the following command to build docker image and start container:
./scripts/run.sh -b
Note: Next time, when no build is needed (because image has been already built), you can just run container by skipping -b
argument.
List of all processed datasets with simple comparison is stored in datasets/README.md file.
All datasets analyses are stored in datasets/ folder. Each dataset has its own folder with simple description in README file and jupyter notebook (also can include different files, e.g. data itself).
Dataset files (e.g. .csv
or .tsv
files) are stored using Git LFS (see Git LFS for more information).
When adding new dataset, please follow these steps:
- Call
./scripts/create_structure.sh {name}
script with name argument supplied insnake_case
format (e.g.fake_news_detection_kaggle
). This script will create needed folders and files indatasets/{name}
folder. - Add data into
datasets/{name}/data
directory. - Update
datasets/{name}/README.md
file to provide link, potential tasks, description and attributes descriptions. Please, follow template file structure. - Update
datasets/{name}/{name}.ipynb
file with analysis of the dataset. Please, follow template file structure. - Add dataset and details into table of datasets in
datasets/README.md
file (please, follow the alphabetical order).
Finish prepared datasets:
- coaid
- that_is_a_known_lie
- fake_health
- fake_covid