openml-topic-model

We have about 40,000 datasets on OpenML. We would like to group these datasets into topics, based on the description of the datasets.

In this repo:

The data folder contains the latest version of the downloaded descriptions.
The src folder has the source code for obtaining the dataset descriptions (getdata.py), preprocessing and creating a pre-processed dataframe(preprocess.py) and algorithms for performing topic modeling (model.py). utils.py and preprocess.py have helper functions which are used by the other files.
The config.py files allows you to configure whether the dataset needs to be downloaded again (DOWNLOAD_DATASET_AGAIN), whether it needs to be preprocessed again and also allows you to configure the preprocessing methods.
Once the parameters are configured in config.py, the model can be run using run_model.py and the results should be available in the results folder.
We currently support LDA with different parameters and seeded LDA. Support for contextualized topic models will be added soon.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.idea		.idea
data		data
examples		examples
openml_topic		openml_topic
results/NOUN_filter_0.8		results/NOUN_filter_0.8
.gitignore		.gitignore
README.md		README.md
config.py		config.py
run.sh		run.sh

Provide feedback