NewsClustering is a final project developed by a group of three students (Ilaria Ceppa, Marco Grandi and Marco Ponza) for the Information Retrieval course.
The goal of the project was to develop, experiment and analyze results of a clustering software which uses GibsLDA++ to generate clusters of italian news articles.
The final report is available in the current repository (italian only).
The project can be compiled by typing:
make clean
make all
and the helper can be displayed with:
./clusteringLDA --help
To run the application on a news dataset type:
./clusteringLDA [-v] [-a alpha] [-b beta] [-n clusters] [-t terms] [-m size] [-i iter] [-s step] [-o file] [-c clust] [-d string] dataset_file
where:
-v
shows the parameter values before running the application;-a alpha
set thealpha
parameter of GibsLDA++;-b beta
set thebeta
parameter of GibsLDA++;-n clusters
set the number of clusters you want to generate;-t terms
set the number of terms that will be showed to the output file;-m size
minimum cluster size (clusters with a lower size will be removed);-i iter
set the number of iterations of GibsLDA++;-s step
set the number of iterations after which a temporary model will be generated;-o file
set the output file;-c clust
model name generated by GibsLDA++;-d string
set the preprocessing algorithms to NOT use:.
disables the punctuation filter;s
disables stopwords;w
disables shingling;i
disable the idf filter;m
disables cluster-size thresholding;p
disables document filter.