Skip to content

takeuchi-lab/Post-Hierarchical-Clustering-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Post Hierarchical Clustering Inference

Abstract

It is an important task to analyze data having multiple clusters behind such as gene expression level data and customer’s purchase history data and discover trends peculiar to each cluster. However, the clustering results are based on subjectivity such as technical knowledge of data, and are not objective. Therefore, we consider using a statistical hypothesis testing to evaluate the reliability of clustering. However, when performing two step inference, such as inference after clustering, the influence of clustering must be considered and corrected appropriately. In this study, we first apply Ward method, which is one of hierarchical clustering, to data with multiple average structures for each cluster to obtain the cluster hierarchical structure. After that, we perform two valid hypothesis testing methods in each branch by exploiting the framework of Selective Inference.

Environmental Requirement

  • gcc 8.2.0
  • GNU Make 3.81
  • Install eigen and openmp if compiling c++ source
  • Python version 3.7.0
  • Please install required packages when the python "ImportError" occurs

Usage

Hypothesis testing for differences between cluster centers

Under the cluster directory.

Compile (When necessary)

$ cd /cluster/cpp_source
$ make

Warning: Change the path of eigen as needed.

Preprocessing (When applied to real data)

preprocess.py

  • Split the data for variance estimation and p-value calculation (variance estimation: p-value calculation = 2 : 8).
  • Data for p-value calculation is normalized each variables.
  1. Execute preprocess.py for the data you want to preprocess.
  2. data,stat, and interval directorys are created, and the following files are created in the data directory.
    • data.csv : For p-value calculation
    • d_ind.csv : Index which data to use for p-value calculations
    • estimate.csv : For estimation variances
    • sigma.csv, xi.csv :

Example

$ python preprocess.py data.csv

Clustering and Hypothesis Testing

Calculate p-value for one step

run calc_p.py

Arguments

  • path of data.csv
  • path of sigma.csv
  • path of xi.csv
  • step ()
  • Whether to compute in parallel (2 or more: parallel)

Example (parallel computation using 3 cores in the first step) 

$ pyton calc_p.py data/data.csv data/sigma.csv data/xi.csv 0 3

Calculate p-value for all steps

run calc_p_all.py

Arguments

  • path of data.csv
  • path of sigma.csv
  • path of xi.csv
  • Whether to compute in parallel (2 or more: parallel)

Example (not parallel computation) 

$ pyton calc_p_all.py data/data.csv data/sigma.csv data/xi.csv 1

Warning: The file extension of the execute file used in calc_p_all.py, calc_p.py is exe, so please change it appropriately.

Both calc_p.py and calc_p_all.py, the p-value calculation result is output under the result directory

  • naive_p.csv
  • selective_p.csv

Other outputs

  • stat directory : output a csv file that describes the statistic and the number of dimensions of data at each step
  • interval directory : output the interval required when calculating the selective-p value
  • cluster_result directory : output the following csv file.
    • output.csv (It has the same format as Z of scipy linkage)

Demo

Simple demo data is placed directly under cluster/data/

50%

Example

$ python calc_p_all.py data/demo_data.csv data/demo_sigma.csv data/demo_xi.csv 1

Display dendrogram with p-value.

$ python display_dendro_p_cluster.py

50%

Demo (synthetic)

synthetic experiments program in the demo_synthetic directory
Please run calc_synthetic.py

Arguments

  • Number of epoch (default:1000)
  • Sample size
  • Dimension
  • step ()
  • Whether to compute in parallel (2 or more: parallel)
  • Mean : Generate two clusters by changing the mean of half () of the data

Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.

FPR
Setting mean to 0.0 is an FPR experiment, and n becomes the maximum size ().

Example1 (Experiment of FPR, , first step, calculation of $1000$ times no parallel)

$ python calc_synthetic.py 1000 50 5 0 1 0.0

TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.

Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,

$ python calc_synthetic.py 1000 30 10 28 100 4 2.0

Hypothesis testing for differences between each dimension of cluster centers

Under the each_dim directory

Preprocessing (When applied to real data)

preprocess.py

Same as preprocess.py in Hypothesis testing for differences between cluster centers

Clustering and Hypothesis Testing

Calculate p-value for one step

run execute.py

Arguments

  • path of data.csv
  • path of Sigma.csv
  • path of Xi.csv
  • step ()
  • Whether to compute in parallel (2 or more: parallel)

Example

$ pyton execute.py data/data.csv data/Sigma.csv data/Xi.csv 0 1

Calculate p-value for all steps

run execute_allstep.py

Arguments

  • path of data.csv
  • path of Sigma.csv
  • path of Xi.csv
  • step (0 ~ n - 2)
  • Whether to compute in parallel (2 or more: parallel)

Example

$ pyton execute_allstep.py data/data.csv data/Sigma.csv data/Xi.csv 1

Both calc_p.py and calc_p_all.py, the p-value calculation result is output under the result directory.

Both execute.py and execute_allstep.py, the following csv file is output under the result directory.

  • output.csv (It has the same format as Z of scipy linkage)
  • naive_p.csv
  • selective_p.csv

Demo

Simple demo data is placed under each_dim/data/.
Data not separated in the first dimension (horizontal axis) but separated in the second dimension (vertical axis).

50%

Example

$ python execute_allstep.py data/demo_data.csv data/demo_Sigma.csv data/demo_Xi.csv 1

Display dendrogram with p-values.

$ python display_dendro_p_dim.py 0

A dendrogram with the p-value given in the test in the first dimension is obtained. Since it doesn't separate in the first dimension, a large value is obtained.

50%

Test results in second dimension. At the top step of p-value is small because data is actually separated in second dimension.

$ python display_dendro_p_dim.py 1

50%

Demo (synthetic)

synthetic experiments program in the demo_synthetic directory
Please run execute_synthetic.py

Arguments

  • Number of epoch (default:1000)
  • Sample size
  • Dimension
  • step () or last
  • Whether to compute in parallel (2 or more: parallel)
  • Mean : Generate two clusters by changing the mean of half () of the data

Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.

FPR
Setting mean to 0.0 is an FPR experiment, and becomes the maximum size ().

Example1 (Experiment of FPR, , first step, calculation of $1000$ times no parallel)

$ python execute_synthetic.py 1000 50 5 0 1 0.0

TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.

Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,
****

$ python execute_synthetic.py 1000 30 10 28 100 4 2.0

Display dendrogram with p-value

Please import pv_dendrogramfunction in the /cluster/display_dendro_p_cluster.pyor/each_dim/display_dendro_p_cluster.py``pv_dendrogram

pv_dendrogram(sp, nap, start, output, root=0, width=100, height=0, decimal_place=3, font_size=15, **kwargs)

Arguments

  • sp: ndarray
    The ndim of ndarray is one.
  • nap: ndarray
    The ndim of ndarray is one.
  • start: int
    From which hierarchy to display the p-value. If start = 0, display p-value of all steps.
  • output: list, ndarray
    Z of scipy.cluster.hierarchy.linkage.
  • root: int
    Takes the specified number of times square root in the distance of Z. Default is 0.
  • width: double
    Width between naive-p and selective-p values in each step. Default is 100.
  • height: double
    Height when displaying naive-p value and selective-p value of each step. The higher the size, the higher it is displayed.   Default is 0.
  • decimal_place: int
    How many decimal places are displayed. Default is 3.
  • font_size: int
    Fontsize of naive-p, selective-p, and legend.
  • **kwargs:
    It is possible to specify kwargs of scipy.cluster.hierarchy.dendrogram.

Returns:
Output of scipy.cluster.hierarchy.dendrogram.

Structure of directory

root/
    |- cluster/
          |- cluster_result/
          |- cpp_source/
          |- data/
          |- interval/
          |- stat/
          |- calc_p_all.py
          |- calc_p.py
          |- ...
    |- each_dim/
          |- cpp_source/
          |- data/
          |- result/...
          |- execute_allstep.py
          |- execute.py
          |- ...
    |- figs
    |- README.md

Notes

  • data.csv, sigma.csv, xi.csv should be a value-only format, otherwise an error will occur.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published