Citizens Document Clustering

Document clustering for Citizens Foundation.
Built using document embeddings from Gensim by RaRe-Technologies.
Our Doc2Vec model assumes lemmas as input, although inflected words work, too.
Icelandic texts are lemmatized using ice_lemmatizer.py, which is built on Reynir by Mideind.
English texts are lemmatized using en_lemmatizer.py, which is supported by spaCy.
This repository is a work in progress.

Doc2Vec

Includes:

A script to train a Doc2Vec model.
A script to test a Doc2Vec model.
A script to infer a vector from a previously unseen document.
A script to get the similarity (float) between all docs in the model.
A couple of short texts (not suited for training a reliable model) used for testing.
NOTE: This plot is only an example to show the relations between the files.

Spelling

Includes:

A script to see if a word is split in two, a common spelling mistake in Icelandic.
- bílakjallari | *bíla kjallari
A script that catches spelling mistakes, based on Word2Vec and probability.

Word2Vec

Includes:

A script to train a Word2Vec model.
- As of now, the model is only used for correction of spelling mistakes.
- Might be used to classify documents further based on keyword vectors.

License

AGPL

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
doc2vec		doc2vec
imgs		imgs
spelling		spelling
word2vec		word2vec
README.md		README.md
bin_lemmatizer.py		bin_lemmatizer.py
en_lemmatizer.py		en_lemmatizer.py
ice_lemmatizer.py		ice_lemmatizer.py
ice_noun_extraction.py		ice_noun_extraction.py
lemmatizer_template.py		lemmatizer_template.py
nrw_stemmer.py		nrw_stemmer.py
read_xml_lemmas.py		read_xml_lemmas.py
read_xml_word_form.py		read_xml_word_form.py
requirements.txt		requirements.txt
weights.json		weights.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citizens Document Clustering

Doc2Vec

Includes:

Spelling

Includes:

Word2Vec

Includes:

License

About

Releases

Packages

Languages

atlijas/citizens_document_clustering

Folders and files

Latest commit

History

Repository files navigation

Citizens Document Clustering

Doc2Vec

Includes:

Spelling

Includes:

Word2Vec

Includes:

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages