Skip to content

Searching similar documents in a large corpus of documents.

License

Notifications You must be signed in to change notification settings

ankushbhatia2/docsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docsearch

It is a python library for Searching similar documents in a large corpus of documents based on my recent project. It uses a 2-layer Earth Mover's Distance (my research) or a Jenson Shannon Distance over latent topic distribution of documents and word embeddings.

I'll update about further methodology once my paper is published.

Installation

pip install docsearch

link to pypi project : https://pypi.org/project/docsearch/

Classes

DocSearch() :

(i)init() takes 5 optional arguments. """ :param n_topics: number of topics (default 100) :param wv_size: word embedding dimension (default 100) :param stop_words: stop words list (default list) :param min_word_freq: minimum word frequency (default 15000) :param sim_metric: allowed values :['jenson-shannon', 'emd'] """

(ii) fit() takes one single argument which is the list of documents.

(iii) get_most_similar_documents() takes 2 arguments viz. query_document and number of similar documents to be shown(k).

Usage

import pandas as pd

docsearch = DocSearch()

path = "path/to/dataset.csv"
df = pd.read_csv(path)

docsearch.fit(df['text'])

print docsearch.get_most_similar_documents([str(df.at[100, 'text'])])```

About

Searching similar documents in a large corpus of documents.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages