The goal of the project is to perform a TM on 90 Russian plays written from 1747 to 1943 (RusDraCor, https://dracor.org/rus). The plays are encoded in the TEI standard. The algorithm can be reproduced and reapplied to the updated corpus.
-
the processed TEI-xml files with excluded proper names of the characters
-
the stop-words and proper-names lists and the script revoming them
-
the preprocessed corpora of 90 Russian plays
- each folder has subfolders byauthor, bycharacter, byplay, bysex
- lemmatisation and POS-tagging was done with pymystem3 Python module (wrapped Mystem)
-
the corpus version that was used for the project (Only Nouns corpus)
-
it also includes subfolders bygenre and byyear_range
-
checkout the TM (modeling only nouns-based topics) you will need only this folder.
-
Action | Description | Dependencies |
---|---|---|
stopwords_and_others/ extract_capitalised_words.py |
Extracting all capitalised words not in the beginning of a sentence | os, re, nltk |
stopwords_and_others/ characters(proper)_names.txt |
Filtered the list to keep only character's proper names | |
stopwords_and_others/ remove_characters(proper)_names_from_TEI.py |
Removing proper names from the TEI documents | os, re |
scripts_for_text_extraction/ get_plays_texts_clean_POS_restriction.py |
Extracting characters' speech-texts from the TEI documents with POS restictions (different options available) | os, re, codecs, glob, lxml, pymystem3 |
classification_using_TM_vectors_gender.py | Trying to choose the best model with a character's gender classificaton task | sklearn |
semantic_vectors.py | Choosing the best model by calculating "semdensity" of topics | sklearn, numpy, glob, re, matplotlib, wordcloud, random, gensim, logging, pymystem3, pre-downloaded vectors' model |
topic_modeling_predict_year.py | Applying the model to spot topics' temporal distribution | sklearn, numpy, glob, re, matplotlib, wordcloud, random |
topic_modeling_predict_genre.py | Applying the model to spot topics' distribution by genre | sklearn, numpy, glob, re, matplotlib, wordcloud, random |
topic_modeling_predict_author.py | Applying the model to spot topics' distribution by author | sklearn, numpy, glob, re, matplotlib, wordcloud, random |
topic_modeling_predict_gender.py | Applying the model to spot topics' distribution by character's gender | sklearn, numpy, glob, re, matplotlib, wordcloud, random |