We use Vector Space Model(VSM) and Latent Semantic Indexing(LSI) Model to calculate documents similarity based on one part of People's Daily corpora, which contains about 3,000 documents.
There are two files under data
directory.
199801_clear_1.txt
is the People's Daily corpora.small_data_for_test.txt
is a small dataset just for testing codes.
See dictionary_builder.py
See doc_similarity_VSM.py
See doc_similarity_LSI.py