Skip to content

Using text mining to build a plagiarism detector based on similarity of documents.

Notifications You must be signed in to change notification settings

ShreshthSaxena/Text-Mining

Repository files navigation

Text-Mining

The data is a collection of documents (text/pdf files) contained in the "msc-plagiarism-assigment" folder.

The Assignment is divided into 3 parts :

a)Normalize the text and create a similarity matrix using Jaccard index. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

b)Create Tf-idf matrix of the collection. Using Cosine distance, create a similarity matrix. Cluster the documents using K means clustering, and find the number of clusters (k) that minimizes SSE. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents

c)Perform LSA using reduced latent space with 4 dimensions. For each topic identify the set of 5 top weighted terms. Find the similarity matrix for the documents in the reduced space. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

About

Using text mining to build a plagiarism detector based on similarity of documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages