Skip to content

Accelerating K-means clustering for document data sets with an architecture-friendly pruning method

Notifications You must be signed in to change notification settings

nttcslab/KmeansDocData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Accelerating K-Means Clustering for Documents
with an Architecture-Friendly Pruning Method

This repository contains supplemental materials including an additional document and a code set for appying our K-means clustering algorithm, AF-ICP, to Large-scale and High-dimensional sparse data sets such as the 8.2M-sized PubMed data set and comparing it with the other algorithms, ICP, TA-ICP, and CS-ICP in ./Comparision. The codes are implemented with C.

Requirements for executing codes

  1. OS: CentOS 7.6 and later
  2. g++ (GCC): >= 8.2.0
  3. perl: >= 5.16
  4. perf: 3.10
  5. bzip2 (optional)

Quick start: AF-ICP in five iterations

  1. Prepare the 8.2M-sized PubMed data set with a procedure in dataset.
    This procedure creates ./dataset/pubmed.8_2M.db that is avilable for the codes in this repository.
    You can download pubmed.8_2M.db.bz2 if you fail to download the original data (docword.pubmed.txt) from UCI machine learning repository. Then, execute bzip2 -d pubmed.8_2M.db.bz2 to extract the pubmed.8_2M.db and move it to ./dataset directory.
  2. Execute make -f Makefile_itr5_aficp in ./src.
    This makes ./bin/itr5_aficp object in your system.
  3. Execute the perl script ./itr5_exeAFICP_8.2Mpubmed_perf.pl in ./exe.
    The 8.2M-sized PubMed data set is loaded from ./dataset/pubmed.8_2M.db (3.8GB) in around two minutes and given K=10,000, AF-ICP is executed with 50-thread parallel processing (default).
    You can change default values in the perl scripts. For instance, the number of threads is defined by $NumThreads in the script.
    A log file is generated in ./Log.

Compare AF-ICP with other algorithms, ICP, TA-ICP, and CS-ICP

Go to Comparison.

License

Please check LICENSE for the detail.

About

Accelerating K-means clustering for document data sets with an architecture-friendly pruning method

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published