#NewsClustering#
This is a repo for CS473 Final Project. It's aimed to cluster news threads from Purdue Newsroom and give summarizations for each thread.
#Getting Started#
- Install Python (3.0 > Python >= 2.5)
- Install dependencies for gensim, including NumPy and SciPy. sudo apt-get install python-dev
- Install gensim
- Install stemming 1.0 You can install with
easy_install -U stemming
if easy_install is installed
You can find detailed Install Steps from gensim website .
- Run parse.py
- Run transform.py
- Run lda.py
- Run index.py
#To-do#
Save document urlsTransform to sparseLDA stepInverse hash stemmed word to original wordDo TF-IDF tranform to corpusSimilarity queries for all the documents (Clustering)- Summarize topics
- Evaluation Datasets for single-label text categorization
#Data# The file "2012data.txt" in Data directory is used as the input for this project. It has the following structure
url\n
<Content>\n
title\n
content\n
<\Content>\n
\n
#Doc_index# doc[doc_b] is a list of pairs. Each pair is (topic_id, probability). wordlist[list[listofwords]] is a list of list of words with its corresponding percentage. e.g. [['0.006', 'student'], ['0.005', 'engin'], ['0.004', 'agricultur'], ['0.004', 'program'] ...
#Resources# Python Official Tutorial