Skip to content

Latest commit

 

History

History
100 lines (66 loc) · 6.09 KB

README.md

File metadata and controls

100 lines (66 loc) · 6.09 KB

The Gap Statistic

An implementation of the gap statistic in R. The gap statistic, proposed by Robert Tibshirani, Guenther Walther, and Trevor Hastie, is a method for estimating the number of clusters in a set of data. For details, please check the original paper "Estimating the number of clusters in a data set via the gap statistic".

I found that the existing implementation clusGap in the cluster package is not completely align to the original algorithm, so I decided to push my implementation onto github for those who are interested.

Warning: the code is not robust. If you encounter any problem with the code, please contact me at Github or send me an email.

What's the Difference?

The Reference Distribution

The original paper proposed two ways to generate reference distribution:

Unif: Generate each reference feature uniformly over the range of the observed values for that feature.

And:

PC: Generate the reference features from a uniform distribution over a box aligned with the principal components of the data.

The clusGap only provides the PC method, but the experiment results in the paper says that the PC way does not always out perform the Unif method; sometimes the Unif way actually does better (see Table 1 of the paper).

I implemented both ways to generate the reference distribution. To decide which method to use, just determine the parameter pc when using the function. For instance:

	gap(iris[,-5], pc = F)

If we assign pc to False, the Unif way will be used; If we assign pc to True, the PC way will be used.

Note: There's another implementation implemented by echen on Github. Only the Unif way is implemented in echen's code.

Standard Deviation Calculation

The paper says we should select the smallest k where 1 < k <= max such that Gap(k) <= Gap(k+1) - s(k+1), where s is some value multiplied by standard deviation of the within-group dispersion drawn from several samples. In clusGap, the built-in var function is used to calculate standard deviation (as we know, standard deviation is the square root of variance.) However, in the built-in var function, the denominator of the standard deviation is deducted by the degree of freedom (which is 1), which is different from the original algorithm (the denominator of the standard deviation proposed in the paper is not deducted by 1)!

Despite the fact that it may not cause big difference, I still decide to implement the original paper's version (not to use the built-in function var).

Example

The usage is similar to clusGap. For instance:

	source('your/path/to/gap.r')
	set.seed(1)
	x = c(rnorm(10, 0, 1), rnorm(10, 5, 1.5), rnorm(10, 10, 1))
	y = c(rnorm(10, 10, 1), rnorm(10, 5, 1), rnorm(10, 0, 1))
	data = cbind(x,y)
	gap.stat(data = data, max = 10, clusFUN = kmeans, pc = T, B = 50, iter.max = 10, tibs = F) # B indicates how many samples to select from the reference distribution.
	# The "iter.max" parameter is for the "kmeans" function.

Note that you only have to specify the data like gap.stat(data), which gives you the same result. The result is:

	logW    ElogW          S           Gap         GapMinusS
	[1,] 3.932728 3.764804 0.06969463 -0.16792329 -0.23761792
	[2,] 3.190214 3.120146 0.06970189 -0.07006775 -0.13976964
	[3,] 2.477820 2.776090 0.06688969  0.29827007  0.23138038
	[4,] 2.327328 2.577955 0.07746779  0.25062745  0.17315966
	[5,] 2.215417 2.399859 0.07217596  0.18444261  0.11226665
	[6,] 2.050635 2.274811 0.07124480  0.22417568  0.15293088
	[7,] 1.974787 2.132784 0.08352412  0.15799618  0.07447206
	[8,] 1.821773 2.018260 0.08526062  0.19648702  0.11122640
	[9,] 1.664564 1.915915 0.09375918  0.25135046  0.15759129
	[10,] 1.569001 1.766674 0.10887759  0.19767347  0.08879588
	
	$GlobalMaxClusterNum
	[1] 3
	
	$TibsClusterNum
	[1] 3

	$ClusterNum
	[1] 3

You can plot(data) if you'd like to see the distribution of the generated data.

$GlobalMaxClusterNum gives the number of cluster that have the biggest gap of within-group dispersion between the given data and reference distribution; $TibsClusterNum gives the number of cluster and is generated by the original method proposed in the paper. If you assign the parameter tibs to True, then $ClusterNum gives same number as $TibsClusterNum; otherwise, $ClusterNum gives same number as $GlobalMaxClusterNum.

If you want to specify a certain clustering method (kmeans, hclust, etc), just add an argument clusFUN. For instance:

	gap.stat(iris[,-5], clusFUN = hclust, method = "average")

The method = "average" argument is used by the hclust function. This instruction makes gap.stat to use hierarchical (agglomerative) clustering with average linkage to cluster the data.

Example - Other Approaches

The paper considers 4 other approaches to be baselines/benchmarks. To make it easier to compare performances of different approaches, I wrapped up the mentioned 4 approaches. To use them, just source('your/path/to/other.r'). For instance:

	source('your/path/to/other.r')
	set.seed(1)
	x = c(rnorm(10, 0, 1), rnorm(10, 5, 1.5), rnorm(10, 10, 1))
	y = c(rnorm(10, 10, 1), rnorm(10, 5, 1), rnorm(10, 0, 1))
	data = cbind(x,y)
	KL.stat(data, max = 10)
	H.stat(data)
	s.stat(data)
	CH.stat(data)

The 4 functions takes same parameters. Note that these 4 functions do not take pc and B arguments.