Text-Mining

The data is a collection of documents (text/pdf files) contained in the "msc-plagiarism-assigment" folder.

The Assignment is divided into 3 parts :

a)Normalize the text and create a similarity matrix using Jaccard index. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

b)Create Tf-idf matrix of the collection. Using Cosine distance, create a similarity matrix. Cluster the documents using K means clustering, and find the number of clusters (k) that minimizes SSE. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents

c)Perform LSA using reduced latent space with 4 dimensions. For each topic identify the set of 5 top weighted terms. Find the similarity matrix for the documents in the reduced space. Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
msc-plagiarism-assigment		msc-plagiarism-assigment
plots		plots
to_test		to_test
.gitattributes		.gitattributes
Jaccard Index.py		Jaccard Index.py
Latent Semantic Analysis.py		Latent Semantic Analysis.py
README.md		README.md
REPORT.pdf		REPORT.pdf
REPORT2b.pdf		REPORT2b.pdf
REPORT2c.pdf		REPORT2c.pdf
cosine similarity.py		cosine similarity.py
dm2c.py		dm2c.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Mining

The Assignment is divided into 3 parts :

About

Releases

Packages

Languages

ShreshthSaxena/Text-Mining

Folders and files

Latest commit

History

Repository files navigation

Text-Mining

The Assignment is divided into 3 parts :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages