Clustering documents according to document similarity, with a focus on scientific publications.
Table of Contents
Download the repo and execute pip install -e .
in the project root.
To run the tests, you may need to have poppler-cpp-devel installed on your system. Make sure you have an active conda environment with python >=3.9, hatch and nox.
Run nox -s test
to test the code.
generate_sample.py
is a good starting point. The code is very readable and hackable.
The recommended way to use this software is with a conda-managed python-environment.
Make sure conda is installed and create a new environment conda create --file environment.yml
.
(You can later update it with conda env update -f environment.yml
.)
conda activate document-clustering && python generate_sample.py
will run the code, but after activating the conda-env,
./generate_sample.py
should work just as well.
(I personally prefer ipython -i generate_sample.py
)
The script will start by fetching the sample of documents from arxiv, and terminate on error,
however all completed downloads are saved on disk, and will be re-used for successive runs.
(The cache is located at ~/.cache/document-clustering
).
The code was mostly written from scratch and by referencing the scikit-learn documentation (it's pretty good).
But there are several earlier in-house implementations in the old/
directory on the beta-writer branch 70482b7, that I also had a look at:
old/2019-beta-writer
: plain cosine-based document clusteringold/2021-beta-writer
: minor revision of2019-beta-writer
with duplicate avoidance
document-clustering
(excluding the aforementioned beta-writer code) is distributed under the terms of the MIT license.