Post Hierarchical Clustering Inference

Abstract

It is an important task to analyze data having multiple clusters behind such as gene expression level data and customer’s purchase history data and discover trends peculiar to each cluster. However, the clustering results are based on subjectivity such as technical knowledge of data, and are not objective. Therefore, we consider using a statistical hypothesis testing to evaluate the reliability of clustering. However, when performing two step inference, such as inference after clustering, the influence of clustering must be considered and corrected appropriately. In this study, we first apply Ward method, which is one of hierarchical clustering, to data with multiple average structures for each cluster to obtain the cluster hierarchical structure. After that, we perform two valid hypothesis testing methods in each branch by exploiting the framework of Selective Inference.

Environmental Requirement

gcc 8.2.0
GNU Make 3.81
Install eigen and openmp if compiling c++ source
Python version 3.7.0
Please install required packages when the python "ImportError" occurs

Usage

Hypothesis testing for differences between cluster centers

Under the cluster directory.

Compile (When necessary)

$ cd /cluster/cpp_source
$ make

Warning: Change the path of eigen as needed.

Preprocessing (When applied to real data)

preprocess.py

Split the data for variance estimation and p-value calculation (variance estimation: p-value calculation = 2 : 8).
Data for p-value calculation is normalized each variables.

Execute preprocess.py for the data you want to preprocess.
data,stat, and interval directorys are created, and the following files are created in the data directory.
- data.csv : For p-value calculation
- d_ind.csv : Index which data to use for p-value calculations
- estimate.csv : For estimation variances
- sigma.csv, xi.csv :

Example

$ python preprocess.py data.csv

Clustering and Hypothesis Testing

Calculate p-value for one step

run calc_p.py

Arguments

path of data.csv
path of sigma.csv
path of xi.csv
step ()
Whether to compute in parallel (2 or more: parallel)

Example (parallel computation using 3 cores in the first step)

$ pyton calc_p.py data/data.csv data/sigma.csv data/xi.csv 0 3

Calculate p-value for all steps

run calc_p_all.py

Arguments

path of data.csv
path of sigma.csv
path of xi.csv
Whether to compute in parallel (2 or more: parallel)

Example (not parallel computation)

$ pyton calc_p_all.py data/data.csv data/sigma.csv data/xi.csv 1

Warning: The file extension of the execute file used in calc_p_all.py, calc_p.py is exe, so please change it appropriately.

Both calc_p.py and calc_p_all.py, the p-value calculation result is output under the result directory

naive_p.csv
selective_p.csv

Other outputs

stat directory : output a csv file that describes the statistic and the number of dimensions of data at each step
interval directory : output the interval required when calculating the selective-p value
cluster_result directory : output the following csv file.
- output.csv (It has the same format as Z of scipy linkage)

Demo

Simple demo data is placed directly under cluster/data/

Example

$ python calc_p_all.py data/demo_data.csv data/demo_sigma.csv data/demo_xi.csv 1

Display dendrogram with p-value.

$ python display_dendro_p_cluster.py

Demo (synthetic)

synthetic experiments program in the demo_synthetic directory
Please run calc_synthetic.py

Arguments

Number of epoch (default:1000)
Sample size
Dimension
step ()
Whether to compute in parallel (2 or more: parallel)
Mean : Generate two clusters by changing the mean of half () of the data

Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.

FPR
Setting mean to 0.0 is an FPR experiment, and n becomes the maximum size ().

Example1 (Experiment of FPR, , first step, calculation of $1000$ times no parallel)

$ python calc_synthetic.py 1000 50 5 0 1 0.0

TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.

Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,

$ python calc_synthetic.py 1000 30 10 28 100 4 2.0

Hypothesis testing for differences between each dimension of cluster centers

Under the each_dim directory

Preprocessing (When applied to real data)

preprocess.py

Same as preprocess.py in Hypothesis testing for differences between cluster centers

Clustering and Hypothesis Testing

Calculate p-value for one step

run execute.py

Arguments

path of data.csv
path of Sigma.csv
path of Xi.csv
step ()
Whether to compute in parallel (2 or more: parallel)

Example

$ pyton execute.py data/data.csv data/Sigma.csv data/Xi.csv 0 1

Calculate p-value for all steps

run execute_allstep.py

Arguments

path of data.csv
path of Sigma.csv
path of Xi.csv
step (0 ~ n - 2)
Whether to compute in parallel (2 or more: parallel)

Example

$ pyton execute_allstep.py data/data.csv data/Sigma.csv data/Xi.csv 1

Both calc_p.py and calc_p_all.py, the p-value calculation result is output under the result directory.

Both execute.py and execute_allstep.py, the following csv file is output under the result directory.

output.csv (It has the same format as Z of scipy linkage)
naive_p.csv
selective_p.csv

Demo

Simple demo data is placed under each_dim/data/.
Data not separated in the first dimension (horizontal axis) but separated in the second dimension (vertical axis).

Example

$ python execute_allstep.py data/demo_data.csv data/demo_Sigma.csv data/demo_Xi.csv 1

Display dendrogram with p-values.

$ python display_dendro_p_dim.py 0

A dendrogram with the p-value given in the test in the first dimension is obtained. Since it doesn't separate in the first dimension, a large value is obtained.

Test results in second dimension. At the top step of p-value is small because data is actually separated in second dimension.

$ python display_dendro_p_dim.py 1

Demo (synthetic)

synthetic experiments program in the demo_synthetic directory
Please run execute_synthetic.py

Arguments

Number of epoch (default:1000)
Sample size
Dimension
step () or last
Whether to compute in parallel (2 or more: parallel)
Mean : Generate two clusters by changing the mean of half () of the data

Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.

FPR
Setting mean to 0.0 is an FPR experiment, and becomes the maximum size ().

Example1 (Experiment of FPR, , first step, calculation of $1000$ times no parallel)

$ python execute_synthetic.py 1000 50 5 0 1 0.0

TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.

Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,
****

$ python execute_synthetic.py 1000 30 10 28 100 4 2.0

Display dendrogram with p-value

Please import pv_dendrogramfunction in the /cluster/display_dendro_p_cluster.pyor/each_dim/display_dendro_p_cluster.py``pv_dendrogram

pv_dendrogram(sp, nap, start, output, root=0, width=100, height=0, decimal_place=3, font_size=15, **kwargs)

Arguments

sp: ndarray
The ndim of ndarray is one.
nap: ndarray
The ndim of ndarray is one.
start: int
From which hierarchy to display the p-value. If start = 0, display p-value of all steps.
output: list, ndarray
Z of scipy.cluster.hierarchy.linkage.
root: int
Takes the specified number of times square root in the distance of Z. Default is 0.
width: double
Width between naive-p and selective-p values in each step. Default is 100.
height: double
Height when displaying naive-p value and selective-p value of each step. The higher the size, the higher it is displayed. Default is 0.
decimal_place: int
How many decimal places are displayed. Default is 3.
font_size: int
Fontsize of naive-p, selective-p, and legend.
**kwargs:
It is possible to specify kwargs of scipy.cluster.hierarchy.dendrogram.

Returns:
Output of scipy.cluster.hierarchy.dendrogram.

Structure of directory

root/
    |- cluster/
          |- cluster_result/
          |- cpp_source/
          |- data/
          |- interval/
          |- stat/
          |- calc_p_all.py
          |- calc_p.py
          |- ...
    |- each_dim/
          |- cpp_source/
          |- data/
          |- result/...
          |- execute_allstep.py
          |- execute.py
          |- ...
    |- figs
    |- README.md

Notes

data.csv, sigma.csv, xi.csv should be a value-only format, otherwise an error will occur.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post Hierarchical Clustering Inference

Abstract

Environmental Requirement

Usage

Hypothesis testing for differences between cluster centers

Compile (When necessary)

Preprocessing (When applied to real data)

Clustering and Hypothesis Testing

Demo

Demo (synthetic)

Hypothesis testing for differences between each dimension of cluster centers

Preprocessing (When applied to real data)

Clustering and Hypothesis Testing

Calculate p-value for all steps

Demo

Demo (synthetic)

Display dendrogram with p-value

pv_dendrogram(sp, nap, start, output, root=0, width=100, height=0, decimal_place=3, font_size=15, **kwargs)

Structure of directory

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
cluster		cluster
each_dim		each_dim
figs		figs
result		result
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

takeuchi-lab/Post-Hierarchical-Clustering-Inference

Folders and files

Latest commit

History

Repository files navigation

Post Hierarchical Clustering Inference

Abstract

Environmental Requirement

Usage

Hypothesis testing for differences between cluster centers

Compile (When necessary)

Preprocessing (When applied to real data)

Clustering and Hypothesis Testing

Demo

Demo (synthetic)

Hypothesis testing for differences between each dimension of cluster centers

Preprocessing (When applied to real data)

Clustering and Hypothesis Testing

Calculate p-value for all steps

Demo

Demo (synthetic)

Display dendrogram with p-value

pv_dendrogram(sp, nap, start, output, root=0, width=100, height=0, decimal_place=3, font_size=15, **kwargs)

Structure of directory

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages