The data is a collection of documents (text/pdf files) contained in the "msc-plagiarism-assigment" folder.
a)Normalize the text and create a similarity matrix using Jaccard index. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.
b)Create Tf-idf matrix of the collection. Using Cosine distance, create a similarity matrix. Cluster the documents using K means clustering, and find the number of clusters (k) that minimizes SSE. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents
c)Perform LSA using reduced latent space with 4 dimensions. For each topic identify the set of 5 top weighted terms. Find the similarity matrix for the documents in the reduced space. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.