The SubtextAnalyzer
module is a program designed to take as input a small sample of
English language and decide if it contains any subtext. At present, there are only
four types of potential output: violent
, sexual
, depressive
, or no_subtext
.
It functions by loading the pre-trained Word2Vec embeddings by Google, then copying
and re-training those embeddings in the different directions of each viable subtext.
The result is 4 sets of new embeddings to compare against the original Google embeddings.
The program functions in three parts:
WordsToData
takes as input a large text file of English language which is deemed to
contain subtext of a particular type, let's say depressive
, as well as a vocabulary
size, vocab_size
. We will then use Google's pre-trained embeddings and take the most
used vocab_size
number of words as our vocabulary. WordsToData.py
will then create
a new file, where each word is replaced with the corresponding index of that word. If
the word from the input file is not in our vocabulary, it will instead be replaced by a
0.
This does not take into account multi-word phrases like "New_Orleans" or "total_recall," many of which have been helpfully captured as single words in Google's model.
SubtextRetraining
is a TensorFlow program that retrains the Google word embeddings on
the file that WordsToData
has produced. It uses a skip-gram model and
noise-contrastive estimation sampling to create the new embeddings for each subtext.
Finally, SubtextAnalyzer
will take a string of English language and break it down into
individual words. Then each word will be tested against the original word embeddings and
all of the new subtext-specific embeddings. For each word and each subtext, we're looking
at the cosine-distance between the vector representation of that word in its original
Google-trained form and its new subtext-specific form. Whichever subtext has the greatest
cumulative distance for all the input words is chosen as the subtext of the input and is
returned as the answer.
As it stands, the program has a rather low success rate. I believe this is due, in large
part, to the relatively small sample text I have for each subtext. The inputs I am using
have about 5,000 - 20,000 words per subtext and is pulled from very varied sources. At
present, the results are based more on the principle of "was this word ever re-trained for
this subtext? If so, then the co-sine distance is quite large." For SubtextAnaylzer
to
work properly, the question needs to be: "when this word was retrained for this
particular subtext, how different did it become from its original embedding?"