It is an important task to analyze data having multiple clusters behind such as gene expression level data and customer’s purchase history data and discover trends peculiar to each cluster. However, the clustering results are based on subjectivity such as technical knowledge of data, and are not objective. Therefore, we consider using a statistical hypothesis testing to evaluate the reliability of clustering. However, when performing two step inference, such as inference after clustering, the influence of clustering must be considered and corrected appropriately. In this study, we first apply Ward method, which is one of hierarchical clustering, to data with multiple average structures for each cluster to obtain the cluster hierarchical structure. After that, we perform two valid hypothesis testing methods in each branch by exploiting the framework of Selective Inference.
- gcc 8.2.0
- GNU Make 3.81
- Install eigen and openmp if compiling c++ source
- Python version 3.7.0
- Please install required packages when the python "ImportError" occurs
Under the cluster
directory.
$ cd /cluster/cpp_source
$ make
Warning: Change the path of eigen
as needed.
preprocess.py
- Split the data for variance estimation and p-value calculation (variance estimation: p-value calculation = 2 : 8).
- Data for p-value calculation is normalized each variables.
- Execute
preprocess.py
for the data you want to preprocess. data
,stat
, andinterval
directorys are created, and the following files are created in thedata
directory.
Example
$ python preprocess.py data.csv
Calculate p-value for one step
run calc_p.py
Arguments
- path of
data.csv
- path of
sigma.csv
- path of
xi.csv
- step ()
- Whether to compute in parallel (2 or more: parallel)
Example (parallel computation using 3 cores in the first step)
$ pyton calc_p.py data/data.csv data/sigma.csv data/xi.csv 0 3
Calculate p-value for all steps
run calc_p_all.py
Arguments
- path of
data.csv
- path of
sigma.csv
- path of
xi.csv
- Whether to compute in parallel (2 or more: parallel)
Example (not parallel computation)
$ pyton calc_p_all.py data/data.csv data/sigma.csv data/xi.csv 1
Warning: The file extension of the execute file used in calc_p_all.py
, calc_p.py
is exe
, so please change it appropriately.
Both calc_p.py
and calc_p_all.py
, the p-value calculation result is output under the result
directory
naive_p.csv
selective_p.csv
Other outputs
stat
directory : output a csv file that describes the statistic and the number of dimensions of data at each stepinterval
directory : output the interval required when calculating the selective-p valuecluster_result
directory : output the following csv file.- output.csv (It has the same format as
Z
of scipy linkage)
- output.csv (It has the same format as
Simple demo data is placed directly under cluster/data/
Example
$ python calc_p_all.py data/demo_data.csv data/demo_sigma.csv data/demo_xi.csv 1
Display dendrogram with p-value.
$ python display_dendro_p_cluster.py
synthetic experiments program in the demo_synthetic
directory
Please run calc_synthetic.py
Arguments
- Number of epoch (default:1000)
- Sample size
- Dimension
- step ()
- Whether to compute in parallel (2 or more: parallel)
- Mean : Generate two clusters by changing the mean of half () of the data
Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.
FPR
Setting mean to 0.0 is an FPR experiment, and n becomes the maximum size ().
Example1 (Experiment of FPR, , first step, calculation of
$ python calc_synthetic.py 1000 50 5 0 1 0.0
TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.
Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,
$ python calc_synthetic.py 1000 30 10 28 100 4 2.0
Under the each_dim
directory
preprocess.py
Same as preprocess.py
in Hypothesis testing for differences between cluster centers
Calculate p-value for one step
run execute.py
Arguments
- path of
data.csv
- path of
Sigma.csv
- path of
Xi.csv
- step ()
- Whether to compute in parallel (2 or more: parallel)
Example
$ pyton execute.py data/data.csv data/Sigma.csv data/Xi.csv 0 1
run execute_allstep.py
Arguments
- path of
data.csv
- path of
Sigma.csv
- path of
Xi.csv
- step (0 ~ n - 2)
- Whether to compute in parallel (2 or more: parallel)
Example
$ pyton execute_allstep.py data/data.csv data/Sigma.csv data/Xi.csv 1
Both calc_p.py and calc_p_all.py, the p-value calculation result is output under the result directory.
Both execute.py
and execute_allstep.py
, the following csv file is output under the result
directory.
- output.csv (It has the same format as
Z
of scipy linkage) - naive_p.csv
- selective_p.csv
Simple demo data is placed under each_dim/data/
.
Data not separated in the first dimension (horizontal axis) but separated in the second dimension (vertical axis).
Example
$ python execute_allstep.py data/demo_data.csv data/demo_Sigma.csv data/demo_Xi.csv 1
Display dendrogram with p-values.
$ python display_dendro_p_dim.py 0
A dendrogram with the p-value given in the test in the first dimension is obtained. Since it doesn't separate in the first dimension, a large value is obtained.
Test results in second dimension. At the top step of p-value is small because data is actually separated in second dimension.
$ python display_dendro_p_dim.py 1
synthetic experiments program in the demo_synthetic
directory
Please run execute_synthetic.py
Arguments
- Number of epoch (default:1000)
- Sample size
- Dimension
- step () or last
- Whether to compute in parallel (2 or more: parallel)
- Mean : Generate two clusters by changing the mean of half () of the data
Setting mean that is the sixth argument to 0.0 is an experiment of FPR , and to a value greater than 0.0 is an experiment for TPR.
FPR
Setting mean to 0.0 is an FPR experiment, and becomes the maximum size ().
Example1 (Experiment of FPR, , first step, calculation of
$ python execute_synthetic.py 1000 50 5 0 1 0.0
TPR
Setting mean to a value greater than 0.0 is an TPR experiment, and becomes the maximum size (0.5, 1.0, 1.5, ..., ).
In the TPR experiment, any value entered in the argument of step will be automatically changed to the last step.
Example2 (Experiment of TPR, , last step, calculation of 100 times, parallel computation using 3 cores,
****
$ python execute_synthetic.py 1000 30 10 28 100 4 2.0
Please import pv_dendrogram
function in the
/cluster/display_dendro_p_cluster.py
or/each_dim/display_dendro_p_cluster.py``pv_dendrogram
pv_dendrogram(sp, nap, start, output, root=0, width=100, height=0, decimal_place=3, font_size=15, **kwargs)
Arguments
- sp: ndarray
The ndim of ndarray is one. - nap: ndarray
The ndim of ndarray is one. - start: int
From which hierarchy to display the p-value. If start = 0, display p-value of all steps. - output: list, ndarray
Z
of scipy.cluster.hierarchy.linkage. - root: int
Takes the specified number of times square root in the distance ofZ
. Default is 0. - width: double
Width between naive-p and selective-p values in each step. Default is 100. - height: double
Height when displaying naive-p value and selective-p value of each step. The higher the size, the higher it is displayed. Default is 0. - decimal_place: int
How many decimal places are displayed. Default is 3. - font_size: int
Fontsize of naive-p, selective-p, and legend. - **kwargs:
It is possible to specify kwargs of scipy.cluster.hierarchy.dendrogram.
Returns:
Output of scipy.cluster.hierarchy.dendrogram.
root/
|- cluster/
|- cluster_result/
|- cpp_source/
|- data/
|- interval/
|- stat/
|- calc_p_all.py
|- calc_p.py
|- ...
|- each_dim/
|- cpp_source/
|- data/
|- result/...
|- execute_allstep.py
|- execute.py
|- ...
|- figs
|- README.md
data.csv
,sigma.csv
,xi.csv
should be a value-only format, otherwise an error will occur.