This code is adopted from this study BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique (code)
Please refer to this blog post for more details about this repository.
Create conda environments
conda create -n AraTop python=3.7 anaconda
conda activate AraTop
Install req
pip install bertopic
pip install flair
The dataset is based on the ArabGend dataset 2022 [1] 108053 tweets
Getting the tweets ID from data file or from [1] or and then retrieve tweets using Twitter API
pip install twarc
twarc2 hydrate ids.txt tweets.json
twarc2 hydrate twitt_ID.txt tweets.json
Convert json file to CSV twarc
pip3 install --upgrade twarc-csv
twarc2 csv --no-json-encode-all tweets.json tweets_CSV.csv
csvcut --columns id,text tweets_CSV.csv
To clean and pre-process the dataset
python arabic_cleaner.py
[1] ArabGend:Gender Analysis and Inference on Arabic Twitter
For Topic modeling via umap
run_umap.sh
For Topic modeling via HDBSCAN
run_hdbscan.sh
For joint model (umap+hdbscan)
run joint.sh
loading the tranined model
python infer.py
The implementation of the project relies on resources from BERTopic, Huggingface Transformers, and SBERT. We thank the original authors for their well-organized codebase.