diff --git a/docs/_main_files/figure-html/unnamed-chunk-100-1.png b/docs/_main_files/figure-html/unnamed-chunk-100-1.png index 184b26c..1e6c325 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-100-1.png and b/docs/_main_files/figure-html/unnamed-chunk-100-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-101-1.png b/docs/_main_files/figure-html/unnamed-chunk-101-1.png index c16cc72..184b26c 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-101-1.png and b/docs/_main_files/figure-html/unnamed-chunk-101-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-102-1.png b/docs/_main_files/figure-html/unnamed-chunk-102-1.png new file mode 100644 index 0000000..c16cc72 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-102-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-43-1.png b/docs/_main_files/figure-html/unnamed-chunk-43-1.png index cfbad06..dabb070 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-43-1.png and b/docs/_main_files/figure-html/unnamed-chunk-43-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-74-1.png b/docs/_main_files/figure-html/unnamed-chunk-74-1.png new file mode 100644 index 0000000..d4fc79e Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-74-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-77-1.png b/docs/_main_files/figure-html/unnamed-chunk-77-1.png index 3f7d3c3..ddafd43 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-77-1.png and b/docs/_main_files/figure-html/unnamed-chunk-77-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-78-1.png b/docs/_main_files/figure-html/unnamed-chunk-78-1.png index f19ef62..3f7d3c3 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-78-1.png and b/docs/_main_files/figure-html/unnamed-chunk-78-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-79-1.png b/docs/_main_files/figure-html/unnamed-chunk-79-1.png index 3f55a0d..f19ef62 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-79-1.png and b/docs/_main_files/figure-html/unnamed-chunk-79-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-82-1.png b/docs/_main_files/figure-html/unnamed-chunk-82-1.png new file mode 100644 index 0000000..ba0dda2 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-82-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-91-1.png b/docs/_main_files/figure-html/unnamed-chunk-91-1.png index d539b78..caa592f 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-91-1.png and b/docs/_main_files/figure-html/unnamed-chunk-91-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-92-1.png b/docs/_main_files/figure-html/unnamed-chunk-92-1.png index 6d22954..d539b78 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-92-1.png and b/docs/_main_files/figure-html/unnamed-chunk-92-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-94-1.png b/docs/_main_files/figure-html/unnamed-chunk-94-1.png index 6d22954..d91a2d3 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-94-1.png and b/docs/_main_files/figure-html/unnamed-chunk-94-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-95-1.png b/docs/_main_files/figure-html/unnamed-chunk-95-1.png index 55d3eed..6d22954 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-95-1.png and b/docs/_main_files/figure-html/unnamed-chunk-95-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-98-1.png b/docs/_main_files/figure-html/unnamed-chunk-98-1.png index 184b26c..55d3eed 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-98-1.png and b/docs/_main_files/figure-html/unnamed-chunk-98-1.png differ diff --git a/docs/calibration-and-validation.html b/docs/calibration-and-validation.html index dc11180..f798631 100644 --- a/docs/calibration-and-validation.html +++ b/docs/calibration-and-validation.html @@ -220,7 +220,7 @@
Let’s look at plots and start with summary plot.
plot(m)
The plot is very similar to what we seen for PCA model, the only difference is that it shows modelling power instead of loadings. Modelling power is a measure of contribution of each variable to the model and varies from 0 to 1. Usually variables with modelling power below 0.1 are considered as irrelevant.
Let’s give a closer look at the residuals plot with different values for alpha (we will keep number of components equal to three in all cases).
m1 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.01)
@@ -233,7 +233,7 @@ Calibration and validation
plotResiduals(m2)
plotResiduals(m3)
plotResiduals(m4)
As you can see, using alpha = 0.01 reduced number of false negatives to zero, as the acceptance limits became larger, while alpha = 0.15 gives a lot of incorrectly rejected class members. It must be noted, that decreasing alpha will also lead to a larger number of false positives, which we can not see in this case.
The classification results can be shown both graphically and numerically. Here is a prediction plot for the results.
par(mfrow = c(2, 1))
plotPredictions(res)
plotPredictions(res, ncomp = 2)
So we can see that for the model with three components we have no false positives (specificity = 1) and one false negative (sensitivity = 24/25 = 0.96). You can also show the predictions as a matrix with -1 and +1 using method showPredictions()
or get the array with predicted class values directly as it is shown in the example below (for first 10 rows, different number of components and the first classification variable).
show(res$c.pred[31:40, 1:3, 1])
## Comp 1 Comp 2 Comp 3
diff --git a/docs/models-and-results-1.html b/docs/models-and-results-1.html
index a79485d..1305d97 100644
--- a/docs/models-and-results-1.html
+++ b/docs/models-and-results-1.html
@@ -232,6 +232,24 @@ Model calibration
## IQ
## -0.088658626
You can see a summary for the regression coefficients object by calling function summary()
for the object m$coeffs
like it is show below. By default it shows only estimated regression coefficients for the selected y-variable and number of components. However, if you employ Jack-Knifing (see section Variable selection below), the object with regression coefficients will also contain some statistics, including standard error, p-value (for test if the coefficient is equal to zero in populatio) and confidence interval. All statistics in this case will be shown automatically with summary()
.
summary(m$coeffs)
## Regression coefficients for Shoesize
+## ------------------------------------
+## Height Weight Hairleng Age Income
+## 0.176077659 0.175803980 -0.164627444 0.046606027 0.059998121
+## Beer Wine Sex Swim Region
+## 0.133136867 0.002751573 -0.164627444 0.173739533 -0.031357608
+## IQ
+## -0.003353428
+summary(m$coeffs, ncomp = 3)
## Regression coefficients for Shoesize
+## ------------------------------------
+## Height Weight Hairleng Age Income
+## 0.210411676 0.197646483 -0.138824482 0.026613035 -0.000590693
+## Beer Wine Sex Swim Region
+## 0.148917913 0.138138095 -0.138824482 0.223962000 0.010392542
+## IQ
+## -0.088658626
You can also get the corrected coefficients, which can be applied directly to the raw data (without centering and standardization), by using method getRegcoeffs()
:
show(getRegcoeffs(m, ncomp = 3))
## Shoesize
@@ -321,7 +339,7 @@ Model validation
par(mfrow = c(1, 2))
plotRMSE(m1)
plotRMSE(m2)
-
+
Parameter cv
has the same format as for PCA. If it is a number, it will be used as number of segments for random cross-validation, e.g. if cv = 2
cross-validation with two segments will be carried out. For full cross-validation use cv = 1
like we did in the example above. For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4)
for running random cross-validation with four segments and four repetitions or cv = list('ven', 8)
for systematic split into eight segments (venetian blinds).
Method summary()
for model shows performance statistics calculated using optimal number of components for each of the results.
summary(m1)
diff --git a/docs/multiclass-classification.html b/docs/multiclass-classification.html
index 3acc771..22d838f 100644
--- a/docs/multiclass-classification.html
+++ b/docs/multiclass-classification.html
@@ -231,7 +231,7 @@ Multiclass classification
Now we apply the combined model to the test set and look at the predictions.
res = predict(m, X.t, c.t)
plotPredictions(res)
-
+
In this case the predictions are shown only for the number of components each model found optimal. The names of classes along y-axis are the individual models. Similarly we can show the predicted values.
show(res$c.pred[20:30, 1, 1:3])
## setosa virginica versicola
@@ -250,17 +250,17 @@ Multiclass classification
par(mfrow = c(1, 2))
plotModelDistance(m, 1)
plotModelDistance(m, 2)
-
+
The second plot is a discrimination power, mentioned in the beginning of the section.
par(mfrow = c(1, 2))
plotDiscriminationPower(m, c(1, 3), show.labels = T)
plotDiscriminationPower(m, c(2, 3), show.labels = T)
-
+
And, finally, a Cooman’s plot showing an orthogonal distance from objects to two selected classes/models.
par(mfrow = c(1, 2))
plotCooman(m, c(1, 3), show.labels = T)
plotCooman(m, c(2, 3), show.labels = T)
-
+
diff --git a/docs/plotting-methods-1.html b/docs/plotting-methods-1.html
index c75bbea..9cce77b 100644
--- a/docs/plotting-methods-1.html
+++ b/docs/plotting-methods-1.html
@@ -259,7 +259,7 @@ Plotting methods
par(mfrow = c(1, 2))
plotPredictions(m1)
plotPredictions(m1, ncomp = 1)
-
+
The plots for variables are available only for a model object and include:
@@ -301,7 +301,7 @@ Plotting methods
And, of course, both model and result objects have method plot()
for giving an overview.
plot(m1)
-
+
Excluding rows and columns
From v. 0.8.0 PCA implementation as well as any other method in mdatools
can exclude rows and columns from calculations. The implementation works similar to what was described for PCA. For example it can be useful if you have some candidates for outliers or do variable selection and do not want to remove rows and columns physically from the data matrix. In this case you can just specify two additional parameters, exclcols
and exclrows
, using either numbers or names of rows/columns to be excluded. You can also specify a vector with logical values (all TRUE
s will be excluded).
diff --git a/docs/search_index.json b/docs/search_index.json
index 7192189..da929d8 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -14,7 +14,7 @@
["models-and-results.html", "Models and results", " Models and results In mdatools, any method for data analysis, such as PCA, PLS regression, SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca and pcares correspondingly. Each object includes a list with variables (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for investigation. Model calibration Let’s see how this works using a simple example — People data. We used this data when was playing with plots, it consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on.). More information about the data can be found using ?people. We will first load the data matrix and split it into two subsets as following: library(mdatools) data(people) idx = seq(4, 32, 4) X.c = people[-idx, ] X.t = people[idx, ] So X.c is our calibration subset we are going to use for creating a PCA model and X.t is a subset we will apply the calibrated model to. Now let’s calibrate the model and show an information about the model object: m = pca(X.c, 7, scale = T, info = "People PCA model") m = selectCompNum(m, 5) Here pca is a function that builds (calibrates) a PCA model and returns the model object. Function selectCompNum allows to select an “optimal” number of components for the model. In our case we calibrate model with 7 principal components (second argument for the function pca()) however, e.g. after investigation of explained variance we found out that 5 components is optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 10 components but will use optimal number of components when necessary. For example when showing residuals plot for the model. Or when PCA model is used in SIMCA classification. Finally, function print prints the model object info: print(m) ## ## PCA model (class pca) ## ## ## Call: ## pca(x = X.c, ncomp = 7, scale = T, info = "People PCA model") ## ## Major fields: ## $loadings - matrix with loadings ## $eigenvals - eigenvalues for components ## $ncomp - number of calculated components ## $ncomp.selected - number of selected components ## $center - values for centering data ## $scale - values for scaling data ## $cv - number of segments for cross-validation ## $alpha - significance level for Q residuals ## $calres - results (scores, etc) for calibration set As you can see there are no scores, explained variance values, residuals and so on. Because they actually are not part of a PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, number of segments for cross-validation (if used) and significance levels are the model fields: m$loadings[1:4, 1:4] ## Comp 1 Comp 2 Comp 3 Comp 4 ## Height -0.3792846 0.08004057 -0.06676611 0.04512380 ## Weight -0.3817929 0.08533809 -0.08527883 -0.04051629 ## Hairleng 0.3513874 -0.22676635 -0.02273504 0.01575716 ## Shoesize -0.3776985 0.12503739 -0.02117369 0.09929010 One can also notice that the model object has a particular field — calres, which is in fact a PCA result object containing results of applying the model to the calibration set. If we look at the object description we will get the following: print(m$calres) ## ## Results for PCA decomposition (class pcares) ## ## Major fields: ## $scores - matrix with score values ## $T2 - matrix with T2 distances ## $Q - matrix with Q residuals ## $ncomp.selected - selected number of components ## $expvar - explained variance for each component ## $cumexpvar - cumulative explained variance And if we want to look at scores, here is the way: m$calres$scores[1:4, 1:4] ## Comp 1 Comp 2 Comp 3 Comp 4 ## Lars -5.108742 -1.2714943 1.0765871 1.08910438 ## Peter -3.021811 -0.3163758 -0.2958259 -1.36053121 ## Rasmus -2.887335 -0.4428721 0.1231706 -1.15070563 ## Mette 1.116457 -1.3716444 -1.6344512 -0.03803356 Both model and result objects also have related functions (methods), first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper method. The result object is also created every time you apply a model to a new data. Like in many built-in R methods, method predict() is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model): res = predict(m, X.t) print(res) ## ## Results for PCA decomposition (class pcares) ## ## Major fields: ## $scores - matrix with score values ## $T2 - matrix with T2 distances ## $Q - matrix with Q residuals ## $ncomp.selected - selected number of components ## $expvar - explained variance for each component ## $cumexpvar - cumulative explained variance Model validation Any model can be validated with cross-validation or/and test set validation. The validation results are, of course, represented by result objects, which are fields of a model object similar to calres, but with names cvres and testres correspondingly. Here is how to build a PCA model with full cross-validation and test set validation (we will use X.t as test data) at the same time: m = pca(X.c, 7, scale = T, cv = 1, x.test = X.t, info = "PCA model") m = selectCompNum(m, 5) Parameter cv specifies options for cross-validation. If a numeric value is provided then it will be used as number of segments for random cross-validation, e.g. if cv = 2 cross-validation with two segments will be used. For full cross-validation use cv = 1 like we did in the example above (this is perhaps a bit misleading, but I keep this option for compatability). For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4) for running random cross-validation with four segments and four repetitions. And here is the model object info: print(m) ## ## PCA model (class pca) ## ## ## Call: ## pca(x = X.c, ncomp = 7, scale = T, cv = 1, x.test = X.t, info = "PCA model") ## ## Major fields: ## $loadings - matrix with loadings ## $eigenvals - eigenvalues for components ## $ncomp - number of calculated components ## $ncomp.selected - number of selected components ## $center - values for centering data ## $scale - values for scaling data ## $cv - number of segments for cross-validation ## $alpha - significance level for Q residuals ## $calres - results (scores, etc) for calibration set ## $cvres - results for cross-validation ## $testres - results for test set As you can see we have all three types of results now — calibration (calres), cross-validation (cvres) and test set validation (testres). Let us compare, for example, the explained variance values for the results: var = data.frame(cal = m$calres$expvar, cv = m$cvres$expvar, test = m$testres$expvar) show(round(var, 1)) ## cal cv test ## Comp 1 54.2 43.1 44.8 ## Comp 2 20.3 21.2 17.2 ## Comp 3 13.1 14.7 17.0 ## Comp 4 7.9 13.0 8.0 ## Comp 5 2.3 3.6 4.4 ## Comp 6 1.1 2.0 2.4 ## Comp 7 0.5 0.8 0.7 Every model and every result has a method summary(), which shows some statistics for evaluation of a model performance. Here are some examples. summary(m) ## ## PCA model (class pca) summary ## ## Info: ## PCA model ## ## Eigvals Expvar Cumexpvar ## Comp 1 6.509 54.24 54.24 ## Comp 2 2.434 20.28 74.52 ## Comp 3 1.572 13.10 87.62 ## Comp 4 0.946 7.88 95.51 ## Comp 5 0.272 2.27 97.77 ## Comp 6 0.137 1.14 98.92 ## Comp 7 0.058 0.48 99.39 summary(m$calres) ## ## Summary for PCA results ## ## Selected components: 5 ## ## Expvar Cumexpvar ## Comp 1 54.24 54.24 ## Comp 2 20.28 74.52 ## Comp 3 13.10 87.62 ## Comp 4 7.88 95.51 ## Comp 5 2.27 97.77 ## Comp 6 1.14 98.92 ## Comp 7 0.48 99.39 The same methodology is used for any other method, e.g. PLS or SIMCA. In the next section we will look at how to use plotting functions for models and results. "],
["plotting-methods.html", "Plotting methods", " Plotting methods First of all you can use the methods mdaplot() and mdaplotg() (or any others, e.g. ggplot2) for easy visualisation the results as they all available as matrices with proper names, attributes, etc. In the example below we create several scores and loadings plots. Here I assume that the last model you have created was the one with test set and cross-validation. par(mfrow = c(1, 2)) mdaplot(m$calres$scores, type = 'p', show.labels = T, show.lines = c(0, 0)) mdaplot(m$loadings, type = 'p', show.labels = T, show.lines = c(0, 0)) To simplify this routine, every model and result class also has a number of functions for visualization. Thus for PCA the function list includes scores and loadings plots, explained variance and cumulative explained variance plots, T2 vs. Q residuals and many others. A function that does the same for different models and results has always the same name. For example plotPredictions will show predicted vs. measured plot for PLS model and PLS result, MLR model and MLR result, PCR model and PCR result and so on. The first argument must always be either a model or a result object. The major difference between plots for model and plots for result is following. A plot for result always shows one set of data objects — one set of points, lines or bars. For example predicted vs. measured values for calibration set or scores values for test set and so on. For such plots method mdaplot() is used and you can provide any arguments, available for this method (e.g. color group scores for calibration results). And a plot for a model in most cases shows several sets of data objects, e.g. predicted values for calibration and validation. In this case, a corresponding method uses mdaplotg() and therefore you can adjust the plot using arguments described for this method. Here are some examples for results: par(mfrow = c(2, 2)) plotScores(m$calres, show.labels = T) plotScores(m$calres, c(1, 3), pch = 18, cgroup = X.c[, 'Income'], show.labels = T, labels = 'indices') plotResiduals(m$calres, show.labels = T, cgroup = X.c[, 'Weight']) plotVariance(m$calres, type = 'h', show.labels = T, labels = 'values') The color grouping option is not available for the group (model) plots as colors are used there to underline the groups. Now let’s look at similar plots (plus loadings) for a model. par(mfrow = c(2, 2)) plotScores(m, c(1, 3), show.labels = T) plotLoadings(m, c(1, 3), show.labels = T) plotResiduals(m, col = c('red', 'green', 'blue')) plotVariance(m, type = 'h', show.labels = T, labels = 'values') Method plot() shows the main four PCA plots as a model (or results) overview. plot(m, show.labels = T) You do not have to care about labels, names, legend and so on, however if necessary you can always change almost anything. See full list of methods available for PCA by ?pca and ?pcares. Support for images As it was described before, images can be used as a source of data for any methods. In this case the results, related to objects/pixels will inherit all necessary attributes and can be show as images as well. In the example below we make a PCA model for the image data from the package and show scores and residuals. data(pellets) X = mda.im2data(pellets) m = pca(X) par(mfrow = c(2, 2)) imshow(m$calres$scores) imshow(m$calres$Q) imshow(m$calres$scores, 2) imshow(m$calres$Q, 2) Manual x-values for loading line plot As it was discussed in the previous chapter, you can specify a special attribute, 'xaxis.values' to a dataset, which will be used as manual x-values in bar and line plots. When we create any model and/or results the most important attributes, including this one, are inherited. For example when you make a loading line plot it will be shown using the attribute values. data(simdata) X = simdata$spectra.c attr(X, 'xaxis.name') = 'Wavelength, nm' attr(X, 'xaxis.values') = simdata$wavelength m = pca(X, 3) plotLoadings(m, 1:3, type = 'l') Excluding rows and columns From v. 0.8.0 PCA implementation as well as any other method in mdatools can exclude rows and columns from calculations. For example it can be useful if you have some candidates for outliers or do variable selection and do not want to remove rows and columns physically from the data matrix. In this case you can just specify two additional parameters, exclcols and exclrows, using either numbers or names of rows/columns to be excluded. You can also specify a vector with logical values (all TRUEs will be excluded). The excluded rows are not used for creating a model and calculaiton of model’s and results’ performance (e.g. explained variance). However main results (for PCA — scores and residuals) are calculated for these rows as well and set hidden, so you will not see them on plots. You can always e.g. show scores for excluded objects by using show.excluded = TRUE. It is implemented via attributes “known” for plotting methods from mdatools so if you use e.g. ggplot2 you will see all points. The excluded columns are not used for any calculations either, the corresponding results (e.g. loadings or regression coefficients) will have zero values for such columns and be also hidden on plots. Here is a simple example. data(people) m = pca(people, 5, scale = T, exclrows = c('Lars', 'Federico'), exclcols = 'Income') par(mfrow = c(2, 2)) plotScores(m, show.labels = T) plotScores(m, show.excluded = T, show.labels = T) plotResiduals(m, show.excluded = T, show.labels = T) plotLoadings(m, show.excluded = T, show.labels = T) # show matrix with loadings (look at row Income and attribute "exclrows") show(m$loadings) ## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 ## Height -0.386393327 0.10697019 -0.004829174 0.12693029 -0.13128331 ## Weight -0.391013398 0.07820097 0.051916032 0.04049593 -0.14757465 ## Hairleng 0.350435073 -0.11623295 -0.103852349 -0.04969503 -0.73669997 ## Shoesize -0.385424793 0.13805817 -0.069172117 0.01049098 -0.17075488 ## Age -0.103466285 0.18964288 -0.337243182 -0.89254403 -0.02998028 ## Income 0.000000000 0.00000000 0.000000000 0.00000000 0.00000000 ## Beer -0.317356319 -0.38259695 0.044338872 -0.03908064 -0.21419831 ## Wine 0.140711271 0.57861817 -0.059833970 0.12347379 -0.41488773 ## Sex 0.364537185 -0.23838610 0.010818891 0.04025631 -0.18263577 ## Swim -0.377470722 0.04330411 0.008151288 0.18149268 -0.30163601 ## Region 0.140581701 0.60435183 0.040969200 0.15147464 0.17857614 ## IQ 0.009849911 0.09372132 0.927669306 -0.32978247 -0.11815762 ## attr(,"exclrows") ## [1] 6 ## attr(,"name") ## [1] "Loadings" ## attr(,"xaxis.name") ## [1] "Components" Such behavior will help to exclude and include rows and columns interactively, when GUI add-in for mdatools() is available. -->"],
["partial-least-squares-regression.html", "Partial least squares regression", " Partial least squares regression Partial least squares regression is a linear regression method, which uses principles similar to PCA: data is decomposed using latent variables. Because in this case we have two datasets, predictors (\\(X\\)) and responses (\\(Y\\)) we do decomposition for both, computing scores, loadings and residuals: \\(X = TP^T + E_x\\), \\(Y = UQ^T + E_y\\). In addition to that, orientation of latent variables in PLS is selected to maximize the covariance between the X-scores, \\(T\\), and Y-scores \\(U\\). This approach makes possible to work with datasets where more traditional Multiple Linear Regression fails — when number of variables exceeds number of observations and when X-variables are mutually correlated. But at the end PLS-model is a linear model, where response value is a linear combination of predictors, so the main outcome is a vector with regression coefficients. There are two main algorithms for PLS, NIPALS and SIMPLS, in the mdatools only the last one is implemented. PLS model and PLS results objects have a lot of components and performance statistics, which can be visualised via plots. Besides that the implemented pls() method calculates selectivity ratio and VIP scores, which can be used for selection of most important variables. We will discuss most of the methods in this chapter and you can get the full list using ?pls. "],
-["models-and-results-1.html", "Models and results", " Models and results Like we discussed in PCA, matools creates two types of objects — a model and a result. Every time you build a PLS model you get a model object. Every time you apply the model to a dataset you get a result object. For PLS, the objects have classes pls and plsres correspondingly. Model calibration Let’s use the same People data and create a PLS-model for prediction of Shoesize (column number four) using other 11 variables as predictors. As usual, we start with preparing datasets (we will also split the data into calibration and test subsets): library(mdatools) data(people) idx = seq(4, 32, 4) X.c = people[-idx, -4] y.c = people[-idx, 4, drop = F] X.t = people[idx, -4] y.t = people[idx, 4, drop = F] So X.c and y.c are predictors and response values for calibration subset. Now let’s calibrate the model and show an information about the model object: m = pls(X.c, y.c, 7, scale = T, info = "Shoesize prediction model") ## Warning in selectCompNum.pls(model): No validation results were found! m = selectCompNum(m, 3) As you can see, the procedure is very similar to PCA, here we use 7 latent variables and select 3 first as an optimal number. Here is an info for the model object: print(m) ## ## PLS model (class pls) ## ## Call: ## pls.cal(x = x, y = y, ncomp = ncomp, center = center, scale = scale, ## method = method, cv = cv, alpha = alpha, coeffs.ci = coeffs.ci, ## coeffs.alpha = coeffs.alpha, info = info, light = light, ## exclcols = exclcols, exclrows = exclrows, ncomp.selcrit = ncomp.selcrit) ## ## Major fields: ## $ncomp - number of calculated components ## $ncomp.selected - number of selected components ## $coeffs - object (regcoeffs) with regression coefficients ## $xloadings - vector with x loadings ## $yloadings - vector with y loadings ## $weights - vector with weights ## $calres - results for calibration set ## ## Try summary(model) and plot(model) to see the model performance. As expected, we see loadings for predictors and responses, matrix with weights, and a special object (regcoeffs) for regression coefficients. The values for regression coefficients are available in m.coeffs.values, it is an array with dimension nVariables x nComponents x nPredictors. The reason to use the object instead of just an array is mainly for being able to get and plot regression coefficients for different methods. Besides that, it is possible to calculate confidence intervals and other statistics for the coefficients using Jack-Knife method (will be shown later), which produces extra entities. The regression coefficients can be shown as plot using either function plotRegcoeffs() for the PLS model object or function plot() for the object with regression coefficients. You need to specify for which predictor (if you have more than one y-variable) and which number of components you want to see the coefficients for. By default it shows values for the optimal number of components and first y-variable as it is shown on example below. par(mfrow = c(2, 2)) plotRegcoeffs(m) plotRegcoeffs(m, ncomp = 2) plot(m$coeffs, ncomp = 3, type = 'h', show.labels = T) plot(m$coeffs, ncomp = 2) The model keeps regression coefficients, calculated for centered and standardized data, without intercept, etc. Here are the values for three PLS components. show(m$coeffs$values[, 3, 1]) ## Height Weight Hairleng Age Income ## 0.210411676 0.197646483 -0.138824482 0.026613035 -0.000590693 ## Beer Wine Sex Swim Region ## 0.148917913 0.138138095 -0.138824482 0.223962000 0.010392542 ## IQ ## -0.088658626 You can see a summary for the regression coefficients object by calling function summary() for the object m$coeffs like it is show below. By default it shows only estimated regression coefficients for the selected y-variable and number of components. However, if you employ Jack-Knifing (see section Variable selection below), the object with regression coefficients will also contain some statistics, including standard error, p-value (for test if the coefficient is equal to zero in populatio) and confidence interval. All statistics in this case will be shown automatically with summary(). You can also get the corrected coefficients, which can be applied directly to the raw data (without centering and standardization), by using method getRegcoeffs(): show(getRegcoeffs(m, ncomp = 3)) ## Shoesize ## Intercept 1.251537e+01 ## Height 8.105287e-02 ## Weight 5.110732e-02 ## Hairleng -5.375404e-01 ## Age 1.147785e-02 ## Income -2.580586e-07 ## Beer 6.521476e-03 ## Wine 1.253340e-02 ## Sex -5.375404e-01 ## Swim 1.164947e-01 ## Region 4.024083e-02 ## IQ -2.742712e-02 ## attr(,"name") ## [1] "Regression coefficients for Shoesize" It also returns all statistics in case if Jack-Knifing was applied. Similar to PCA, model object may contain three fields for results obtained using calibration set (calres), cross-validation (cvres) and test set validation (testres). All three have class plsres, here is how calres looks like: print(m$calres) ## ## PLS results (class plsres) ## ## Call: ## plsres(y.pred = yp, y.ref = y.ref, ncomp.selected = object$ncomp.selected, ## xdecomp = xdecomp, ydecomp = ydecomp) ## ## Major fields: ## $ncomp.selected - number of selected components ## $yp - array with predicted y values ## $y - matrix with reference y values ## $rmse - root mean squared error ## $r2 - coefficient of determination ## $slope - slope for predicted vs. measured values ## $bias - bias for prediction vs. measured values ## $ydecomp - decomposition of y values (ldecomp object) ## $xdecomp - decomposition of x values (ldecomp object) The xdecomp and ydecomp are objects similar to pcares, they contain scores, residuals and variances for decomposition of X and Y correspondingly. print(m$calres$xdecomp) ## ## Results of data decomposition (class ldecomp) ## ## Major fields: ## $scores - matrix with score values ## $T2 - matrix with T2 distances ## $Q - matrix with Q residuals ## $ncomp.selected - selected number of components ## $expvar - explained variance for each component ## $cumexpvar - cumulative explained variance Other fields are mostly various performance statistics, including slope, coefficient of determination (R2), bias, and root mean squared error (RMSE). Besides that, the results also include reference y-values and array with predicted y-values. The array has dimension nObjects x nComponents x nResponses. PLS predictions for a new set can be obtained using method predict: res = predict(m, X.t, y.t) print(res) ## ## PLS results (class plsres) ## ## Call: ## plsres(y.pred = yp, y.ref = y.ref, ncomp.selected = object$ncomp.selected, ## xdecomp = xdecomp, ydecomp = ydecomp) ## ## Major fields: ## $ncomp.selected - number of selected components ## $yp - array with predicted y values ## $y - matrix with reference y values ## $rmse - root mean squared error ## $r2 - coefficient of determination ## $slope - slope for predicted vs. measured values ## $bias - bias for prediction vs. measured values ## $ydecomp - decomposition of y values (ldecomp object) ## $xdecomp - decomposition of x values (ldecomp object) Model validation Validation is implemented similar to PCA, the only difference is that you need to provide two datasets for a test set — one for predictors (x.test) and one for response (y.test) values. Cross-validation is very important for PLS as it helps to find optimal number of PLS components (so test set performance is more fair as in this case you do not use test set for optimization). Therefore, it is always recommended to use cross-validation. You probably have noticed a small warning we got when created the first PLS model in this chapter: m = pls(X.c, y.c, 7, scale = T, info = "Shoesize prediction model") ## Warning in selectCompNum.pls(model): No validation results were found! When you create a model, it tries to select optimal number of components automatically (which, of course, you can always change later). To do that, the method uses RMSE values, calculated for different number of components and cross-validation predictions. So, if we do not use cross-validation, it warns use about this. There are two different ways/criteria for automatic selection. One is using first local minimum on the RMSE plot and second is so called Wold criterion, based on a ratio between PRESS values for current and next component. You can select which criterion to use by specifying parameter ncomp.selcrit (either 'min' or 'wold') as it is shown below. m1 = pls(X.c, y.c, 7, scale = T, cv = 1, ncomp.selcrit = 'min') show(m1$ncomp.selected) ## [1] 5 m2 = pls(X.c, y.c, 7, scale = T, cv = 1, ncomp.selcrit = 'wold') show(m2$ncomp.selected) ## [1] 4 And here are the RMSE plots (they are identical of course): par(mfrow = c(1, 2)) plotRMSE(m1) plotRMSE(m2) Parameter cv has the same format as for PCA. If it is a number, it will be used as number of segments for random cross-validation, e.g. if cv = 2 cross-validation with two segments will be carried out. For full cross-validation use cv = 1 like we did in the example above. For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4) for running random cross-validation with four segments and four repetitions or cv = list('ven', 8) for systematic split into eight segments (venetian blinds). Method summary() for model shows performance statistics calculated using optimal number of components for each of the results. summary(m1) ## ## PLS model (class pls) summary ## ## Performance and validation: ## Number of selected components: 5 ## X cumexpvar Y cumexpvar RMSE Slope Bias RPD ## Cal 97.64 98.19 0.521 0.98 0e+00 7.59 ## CV 92.90 96.22 0.753 0.98 -2e-04 5.26 If you want more details run summary() for one of the result objects. summary(m1$calres) ## ## PLS regression results (class plsres) summary ## ## Response variable Shoesize: ## X expvar X cumexpvar Y expvar Y cumexpvar RMSE Slope Bias RPD ## Comp 1 50.505 50.505 93.779 93.779 0.966 0.938 0 4.1 ## Comp 2 20.979 71.484 2.926 96.705 0.703 0.967 0 5.6 ## Comp 3 8.667 80.151 0.917 97.622 0.597 0.976 0 6.6 ## Comp 4 5.847 85.998 0.479 98.101 0.534 0.981 0 7.4 ## Comp 5 11.642 97.640 0.088 98.189 0.521 0.982 0 7.6 ## Comp 6 0.495 98.135 0.347 98.536 0.468 0.985 0 8.4 ## Comp 7 0.442 98.577 0.202 98.738 0.435 0.987 0 9.1 There is no column for R2 as Y cumexpvar values are the same. "],
+["models-and-results-1.html", "Models and results", " Models and results Like we discussed in PCA, matools creates two types of objects — a model and a result. Every time you build a PLS model you get a model object. Every time you apply the model to a dataset you get a result object. For PLS, the objects have classes pls and plsres correspondingly. Model calibration Let’s use the same People data and create a PLS-model for prediction of Shoesize (column number four) using other 11 variables as predictors. As usual, we start with preparing datasets (we will also split the data into calibration and test subsets): library(mdatools) data(people) idx = seq(4, 32, 4) X.c = people[-idx, -4] y.c = people[-idx, 4, drop = F] X.t = people[idx, -4] y.t = people[idx, 4, drop = F] So X.c and y.c are predictors and response values for calibration subset. Now let’s calibrate the model and show an information about the model object: m = pls(X.c, y.c, 7, scale = T, info = "Shoesize prediction model") ## Warning in selectCompNum.pls(model): No validation results were found! m = selectCompNum(m, 3) As you can see, the procedure is very similar to PCA, here we use 7 latent variables and select 3 first as an optimal number. Here is an info for the model object: print(m) ## ## PLS model (class pls) ## ## Call: ## pls.cal(x = x, y = y, ncomp = ncomp, center = center, scale = scale, ## method = method, cv = cv, alpha = alpha, coeffs.ci = coeffs.ci, ## coeffs.alpha = coeffs.alpha, info = info, light = light, ## exclcols = exclcols, exclrows = exclrows, ncomp.selcrit = ncomp.selcrit) ## ## Major fields: ## $ncomp - number of calculated components ## $ncomp.selected - number of selected components ## $coeffs - object (regcoeffs) with regression coefficients ## $xloadings - vector with x loadings ## $yloadings - vector with y loadings ## $weights - vector with weights ## $calres - results for calibration set ## ## Try summary(model) and plot(model) to see the model performance. As expected, we see loadings for predictors and responses, matrix with weights, and a special object (regcoeffs) for regression coefficients. The values for regression coefficients are available in m.coeffs.values, it is an array with dimension nVariables x nComponents x nPredictors. The reason to use the object instead of just an array is mainly for being able to get and plot regression coefficients for different methods. Besides that, it is possible to calculate confidence intervals and other statistics for the coefficients using Jack-Knife method (will be shown later), which produces extra entities. The regression coefficients can be shown as plot using either function plotRegcoeffs() for the PLS model object or function plot() for the object with regression coefficients. You need to specify for which predictor (if you have more than one y-variable) and which number of components you want to see the coefficients for. By default it shows values for the optimal number of components and first y-variable as it is shown on example below. par(mfrow = c(2, 2)) plotRegcoeffs(m) plotRegcoeffs(m, ncomp = 2) plot(m$coeffs, ncomp = 3, type = 'h', show.labels = T) plot(m$coeffs, ncomp = 2) The model keeps regression coefficients, calculated for centered and standardized data, without intercept, etc. Here are the values for three PLS components. show(m$coeffs$values[, 3, 1]) ## Height Weight Hairleng Age Income ## 0.210411676 0.197646483 -0.138824482 0.026613035 -0.000590693 ## Beer Wine Sex Swim Region ## 0.148917913 0.138138095 -0.138824482 0.223962000 0.010392542 ## IQ ## -0.088658626 You can see a summary for the regression coefficients object by calling function summary() for the object m$coeffs like it is show below. By default it shows only estimated regression coefficients for the selected y-variable and number of components. However, if you employ Jack-Knifing (see section Variable selection below), the object with regression coefficients will also contain some statistics, including standard error, p-value (for test if the coefficient is equal to zero in populatio) and confidence interval. All statistics in this case will be shown automatically with summary(). summary(m$coeffs) ## Regression coefficients for Shoesize ## ------------------------------------ ## Height Weight Hairleng Age Income ## 0.176077659 0.175803980 -0.164627444 0.046606027 0.059998121 ## Beer Wine Sex Swim Region ## 0.133136867 0.002751573 -0.164627444 0.173739533 -0.031357608 ## IQ ## -0.003353428 summary(m$coeffs, ncomp = 3) ## Regression coefficients for Shoesize ## ------------------------------------ ## Height Weight Hairleng Age Income ## 0.210411676 0.197646483 -0.138824482 0.026613035 -0.000590693 ## Beer Wine Sex Swim Region ## 0.148917913 0.138138095 -0.138824482 0.223962000 0.010392542 ## IQ ## -0.088658626 You can also get the corrected coefficients, which can be applied directly to the raw data (without centering and standardization), by using method getRegcoeffs(): show(getRegcoeffs(m, ncomp = 3)) ## Shoesize ## Intercept 1.251537e+01 ## Height 8.105287e-02 ## Weight 5.110732e-02 ## Hairleng -5.375404e-01 ## Age 1.147785e-02 ## Income -2.580586e-07 ## Beer 6.521476e-03 ## Wine 1.253340e-02 ## Sex -5.375404e-01 ## Swim 1.164947e-01 ## Region 4.024083e-02 ## IQ -2.742712e-02 ## attr(,"name") ## [1] "Regression coefficients for Shoesize" It also returns all statistics in case if Jack-Knifing was applied. Similar to PCA, model object may contain three fields for results obtained using calibration set (calres), cross-validation (cvres) and test set validation (testres). All three have class plsres, here is how calres looks like: print(m$calres) ## ## PLS results (class plsres) ## ## Call: ## plsres(y.pred = yp, y.ref = y.ref, ncomp.selected = object$ncomp.selected, ## xdecomp = xdecomp, ydecomp = ydecomp) ## ## Major fields: ## $ncomp.selected - number of selected components ## $yp - array with predicted y values ## $y - matrix with reference y values ## $rmse - root mean squared error ## $r2 - coefficient of determination ## $slope - slope for predicted vs. measured values ## $bias - bias for prediction vs. measured values ## $ydecomp - decomposition of y values (ldecomp object) ## $xdecomp - decomposition of x values (ldecomp object) The xdecomp and ydecomp are objects similar to pcares, they contain scores, residuals and variances for decomposition of X and Y correspondingly. print(m$calres$xdecomp) ## ## Results of data decomposition (class ldecomp) ## ## Major fields: ## $scores - matrix with score values ## $T2 - matrix with T2 distances ## $Q - matrix with Q residuals ## $ncomp.selected - selected number of components ## $expvar - explained variance for each component ## $cumexpvar - cumulative explained variance Other fields are mostly various performance statistics, including slope, coefficient of determination (R2), bias, and root mean squared error (RMSE). Besides that, the results also include reference y-values and array with predicted y-values. The array has dimension nObjects x nComponents x nResponses. PLS predictions for a new set can be obtained using method predict: res = predict(m, X.t, y.t) print(res) ## ## PLS results (class plsres) ## ## Call: ## plsres(y.pred = yp, y.ref = y.ref, ncomp.selected = object$ncomp.selected, ## xdecomp = xdecomp, ydecomp = ydecomp) ## ## Major fields: ## $ncomp.selected - number of selected components ## $yp - array with predicted y values ## $y - matrix with reference y values ## $rmse - root mean squared error ## $r2 - coefficient of determination ## $slope - slope for predicted vs. measured values ## $bias - bias for prediction vs. measured values ## $ydecomp - decomposition of y values (ldecomp object) ## $xdecomp - decomposition of x values (ldecomp object) Model validation Validation is implemented similar to PCA, the only difference is that you need to provide two datasets for a test set — one for predictors (x.test) and one for response (y.test) values. Cross-validation is very important for PLS as it helps to find optimal number of PLS components (so test set performance is more fair as in this case you do not use test set for optimization). Therefore, it is always recommended to use cross-validation. You probably have noticed a small warning we got when created the first PLS model in this chapter: m = pls(X.c, y.c, 7, scale = T, info = "Shoesize prediction model") ## Warning in selectCompNum.pls(model): No validation results were found! When you create a model, it tries to select optimal number of components automatically (which, of course, you can always change later). To do that, the method uses RMSE values, calculated for different number of components and cross-validation predictions. So, if we do not use cross-validation, it warns use about this. There are two different ways/criteria for automatic selection. One is using first local minimum on the RMSE plot and second is so called Wold criterion, based on a ratio between PRESS values for current and next component. You can select which criterion to use by specifying parameter ncomp.selcrit (either 'min' or 'wold') as it is shown below. m1 = pls(X.c, y.c, 7, scale = T, cv = 1, ncomp.selcrit = 'min') show(m1$ncomp.selected) ## [1] 5 m2 = pls(X.c, y.c, 7, scale = T, cv = 1, ncomp.selcrit = 'wold') show(m2$ncomp.selected) ## [1] 4 And here are the RMSE plots (they are identical of course): par(mfrow = c(1, 2)) plotRMSE(m1) plotRMSE(m2) Parameter cv has the same format as for PCA. If it is a number, it will be used as number of segments for random cross-validation, e.g. if cv = 2 cross-validation with two segments will be carried out. For full cross-validation use cv = 1 like we did in the example above. For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4) for running random cross-validation with four segments and four repetitions or cv = list('ven', 8) for systematic split into eight segments (venetian blinds). Method summary() for model shows performance statistics calculated using optimal number of components for each of the results. summary(m1) ## ## PLS model (class pls) summary ## ## Performance and validation: ## Number of selected components: 5 ## X cumexpvar Y cumexpvar RMSE Slope Bias RPD ## Cal 97.64 98.19 0.521 0.98 0e+00 7.59 ## CV 92.90 96.22 0.753 0.98 -2e-04 5.26 If you want more details run summary() for one of the result objects. summary(m1$calres) ## ## PLS regression results (class plsres) summary ## ## Response variable Shoesize: ## X expvar X cumexpvar Y expvar Y cumexpvar RMSE Slope Bias RPD ## Comp 1 50.505 50.505 93.779 93.779 0.966 0.938 0 4.1 ## Comp 2 20.979 71.484 2.926 96.705 0.703 0.967 0 5.6 ## Comp 3 8.667 80.151 0.917 97.622 0.597 0.976 0 6.6 ## Comp 4 5.847 85.998 0.479 98.101 0.534 0.981 0 7.4 ## Comp 5 11.642 97.640 0.088 98.189 0.521 0.982 0 7.6 ## Comp 6 0.495 98.135 0.347 98.536 0.468 0.985 0 8.4 ## Comp 7 0.442 98.577 0.202 98.738 0.435 0.987 0 9.1 There is no column for R2 as Y cumexpvar values are the same. "],
["plotting-methods-1.html", "Plotting methods", " Plotting methods Plotting methods, again, work similar to PCA, so in this section we will look more detailed on the available methods instead of on how to customize them. PLS has a lot of different results and much more possible plots. Here is a list of methods, which will work both for a model and for a particular results. Methods for summary statistics. Methods Description plotRMSE(obj, ny = 1, ...) RMSE values vs. number of components in a model plotXVariance(obj) explained variance for X decomposition for each component plotXCumVariance(obj) same as above but cumulative plotYVariance(obj) explained variance for Y decomposition for each component plotYCumVariance(obj) same as above but cumulative Here and in some other methods parameter ny is used to specify which y-variable you want to see a plot for (if y is multivariate). Methods for objects. Methods Description plotPredictions(obj, ny = 1, ncomp) Plot with predicted vs. measured (reference) y-values plotXScores(obj, comp) Scores for decompositon of X (similar to PCA scores plot) plotYScores(obj, comp) Scores for decompositon of y (similar to PCA scores plot) plotXResiduals(obj, ncomp) Residuals for decompositon of X (similar to PCA residuals plot) plotYResiduals(obj, ncomp) Residuals for y vs. real (reference) y-values plotXScores(obj, ncomp) Y-scores vs. X-scores for a particular PLS component. Parameter comp allows to provide a number of selected components (one or several) to show the plot for, while parameter ncomp assume that only one number is expected (number of components in a model or individual component). So if e.g. you created model for five components and selected three, you can also see, for example, prediction plot if you use only one or four components. Here is an example for m1 model: par(mfrow = c(1, 2)) plotPredictions(m1) plotPredictions(m1, ncomp = 1) The plots for variables are available only for a model object and include: Methods Description plotXLoadings(obj, comp) Loadings for decompositon of X (similar to PCA loadings plot) plotYLoadings(obj, comp) Loadings for decompositon of y (similar to PCA loadings plot) plotWeights(obj, comp) Weights (W) for PLS decomposition plotRegcoeffs(obj, ny, ncomp) Plot with regression coefficients plotVIPScores(obj, ny) VIP scores for the predictors plotSelectivityRation(obj, ny, ncomp) Selectivity ratio of the predictors And, of course, both model and result objects have method plot() for giving an overview. plot(m1) Excluding rows and columns From v. 0.8.0 PCA implementation as well as any other method in mdatools can exclude rows and columns from calculations. The implementation works similar to what was described for PCA. For example it can be useful if you have some candidates for outliers or do variable selection and do not want to remove rows and columns physically from the data matrix. In this case you can just specify two additional parameters, exclcols and exclrows, using either numbers or names of rows/columns to be excluded. You can also specify a vector with logical values (all TRUEs will be excluded). The excluded rows are not used for creating a model and calculation of model’s and results’ performance (e.g. explained variance). However main results (for PLS — scores, predictions, residuals) are calculated for these rows as well and set hidden, so you will not see them on plots. You can always e.g. show scores for excluded objects by using show.excluded = TRUE. It is implemented via attributes “known” for plotting methods from mdatools so if you use e.g. ggplot2 you will see all points. The excluded columns are not used for any calculations either, the corresponding results (e.g. loadings, weights or regression coefficients) will have zero values for such columns and be also hidden on plots. "],
["variable-selection.html", "Variable selection", " Variable selection PLS calculates several statistics, which can be used to select most important (or remove least important) variables in order to improve performance and make model simpler. The first two are VIP-scores (variables important for projection) and Selectivity ratio. All details and theory can be found e.g. here. Both parameters can be shown as plots and as vector of values for a selected y-variable. Selectivity ration is calculated for all possible components in a model, but VIP scores (due to computational time) only for selected number of components and are recalculated every time you change number of optimal components using selectCompNum() method. Here are some plots. par(mfrow = c(2, 2)) plotVIPScores(m1, type = 'h', show.labels = T) plotSelectivityRatio(m1, type = 'b', show.labels = T) plotSelectivityRatio(m1, ncomp = 1, type = 'h', show.labels = T) plotSelectivityRatio(m1, ncomp = 2, type = 'h', show.labels = T) In the example below, I create two other PLS models by excluding variables with VIP score or selectivity ratio below a threshold (I use 0.5 and 1 correspondingly) and show the performance for both. m3 = pls(X.c, y.c, 5, scale = T, cv = 1, exclcols = getVIPScores(m1, ncomp = 2) < 0.5) summary(m3) ## ## PLS model (class pls) summary ## ## Performance and validation: ## Number of selected components: 4 ## X cumexpvar Y cumexpvar RMSE Slope Bias RPD ## Cal 85.29 98.15 0.527 0.98 0.0000 7.50 ## CV 80.53 96.34 0.741 0.97 -0.0382 5.34 m4 = pls(X.c, y.c, 5, scale = T, cv = 1, exclcols = getSelectivityRatio(m1, ncomp = 2) < 1) summary(m4) ## ## PLS model (class pls) summary ## ## Performance and validation: ## Number of selected components: 1 ## X cumexpvar Y cumexpvar RMSE Slope Bias RPD ## Cal 86.80 94.97 0.868 0.95 0.0000 4.56 ## CV 84.37 93.91 0.955 0.94 0.0034 4.14 Another way is make an inference about regression coefficients and calculate confidence intervals and p-values for each variable. This can be done usine Jack-Knife approach, when model is cross-validated using efficient number of segments (at least ten) and statistics are calculated using the distribution of regression coefficient values obtained for each step. There are two parameters, coeffs.ci and coeffs.alpha, first is to select the method (so far only Jack-Knife is available, the value is 'jk') and second is a level of significance for computing confidence intervals (by default is 0.1). Here is an example. mjk = pls(X.c, y.c, 7, scale = T, coeffs.ci = 'jk', coeffs.alpha = 0.05) If number of segments is not specified as in the example above, full cross-validation will be used. The statistics are calculated for each y-variable and each available number of components. When you show a plot for regression coefficients, confidence interval will be shown automatically. You can changes this by using parameter show.ci = FALSE. par(mfrow = c(2, 2)) plotRegcoeffs(mjk, type = 'h', show.labels = T) plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T) plotRegcoeffs(mjk, type = 'h', show.labels = T, show.ci = F) plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T, show.ci = F) Calling function summary() for regression coefficients allows to get all calculated statistics. summary(mjk$coeffs, ncomp = 2) ## Regression coefficients for Shoesize ## ------------------------------------ ## Estimated coefficients t-value SE p-value 95% CI (lo) ## Height 0.19357436 25.6 0.007507308 0.000 0.17804431 ## Weight 0.18935489 24.6 0.007629829 0.000 0.17357139 ## Hairleng -0.17730737 -17.3 0.010263098 0.000 -0.19853820 ## Age 0.02508113 0.9 0.026825735 0.358 -0.03041213 ## Income 0.01291497 0.4 0.034034881 0.703 -0.05749155 ## Beer 0.11441573 6.3 0.018145181 0.000 0.07687956 ## Wine 0.08278735 2.7 0.030127820 0.012 0.02046321 ## Sex -0.17730737 -17.3 0.010263098 0.000 -0.19853820 ## Swim 0.19426871 14.6 0.013195834 0.000 0.16697105 ## Region 0.02673161 1.1 0.023419425 0.271 -0.02171516 ## IQ -0.01705852 -0.5 0.032839748 0.603 -0.08499272 ## 95% CI (up) ## Height 0.20910441 ## Weight 0.20513839 ## Hairleng -0.15607653 ## Age 0.08057439 ## Income 0.08332148 ## Beer 0.15195189 ## Wine 0.14511150 ## Sex -0.15607653 ## Swim 0.22156638 ## Region 0.07517838 ## IQ 0.05087567 ## ## Degrees of freedom: 23, t-value: 2.07 summary(mjk$coeffs, ncomp = 2, alpha = 0.01) ## Regression coefficients for Shoesize ## ------------------------------------ ## Estimated coefficients t-value SE p-value 99% CI (lo) ## Height 0.19357436 25.6 0.007507308 0.000 0.172498826 ## Weight 0.18935489 24.6 0.007629829 0.000 0.167935399 ## Hairleng -0.17730737 -17.3 0.010263098 0.000 -0.206119327 ## Age 0.02508113 0.9 0.026825735 0.358 -0.050227714 ## Income 0.01291497 0.4 0.034034881 0.703 -0.082632366 ## Beer 0.11441573 6.3 0.018145181 0.000 0.063476112 ## Wine 0.08278735 2.7 0.030127820 0.012 -0.001791551 ## Sex -0.17730737 -17.3 0.010263098 0.000 -0.206119327 ## Swim 0.19426871 14.6 0.013195834 0.000 0.157223579 ## Region 0.02673161 1.1 0.023419425 0.271 -0.039014575 ## IQ -0.01705852 -0.5 0.032839748 0.603 -0.109250717 ## 99% CI (up) ## Height 0.21464989 ## Weight 0.21077438 ## Hairleng -0.14849541 ## Age 0.10038997 ## Income 0.10846230 ## Beer 0.16535534 ## Wine 0.16736626 ## Sex -0.14849541 ## Swim 0.23131385 ## Region 0.09247780 ## IQ 0.07513367 ## ## Degrees of freedom: 23, t-value: 2.81 Function getRegcoeffs() in this case may also return corresponding t-value, standard error, p-value, and confidence interval for each of the coefficient (except intercept) if user specifies a parameter full. The standard error and confidence intervals are also computed for raw, undstandardized, variables (similar to coefficients). show(getRegcoeffs(mjk, ncomp = 2, full = T)) ## Estimated coefficients t-value SE p-value ## Intercept 1.342626e+01 NA NA NA ## Height 7.456695e-02 25.5886755 2.891897e-03 2.116375e-18 ## Weight 4.896328e-02 24.6148053 1.972917e-03 5.012592e-18 ## Hairleng -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14 ## Age 1.081716e-02 0.9380153 1.156959e-02 3.579832e-01 ## Income 5.642220e-06 0.3857258 1.486897e-05 7.032445e-01 ## Beer 5.010542e-03 6.2692224 7.946214e-04 2.136646e-06 ## Wine 7.511377e-03 2.7258935 2.733526e-03 1.204856e-02 ## Sex -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14 ## Swim 1.010496e-01 14.6146811 6.863861e-03 3.941369e-13 ## Region 1.035071e-01 1.1283478 9.068204e-02 2.708059e-01 ## IQ -5.277164e-03 -0.5268369 1.015919e-02 6.033516e-01 ## 95% CI (lo) 95% CI (up) ## Intercept NA NA ## Height 6.858461e-02 8.054929e-02 ## Weight 4.488199e-02 5.304457e-02 ## Hairleng -7.687571e-01 -6.043419e-01 ## Age -1.311636e-02 3.475068e-02 ## Income -2.511659e-05 3.640103e-05 ## Beer 3.366742e-03 6.654342e-03 ## Wine 1.856647e-03 1.316611e-02 ## Sex -7.687571e-01 -6.043419e-01 ## Swim 8.685061e-02 1.152486e-01 ## Region -8.408298e-02 2.910972e-01 ## IQ -2.629305e-02 1.573872e-02 ## attr(,"name") ## [1] "Regression coefficients for Shoesize" It is also possible to change significance level for confidence intervals. show(getRegcoeffs(mjk, ncomp = 2, full = T, alpha = 0.01)) ## Estimated coefficients t-value SE p-value ## Intercept 1.342626e+01 NA NA NA ## Height 7.456695e-02 25.5886755 2.891897e-03 2.116375e-18 ## Weight 4.896328e-02 24.6148053 1.972917e-03 5.012592e-18 ## Hairleng -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14 ## Age 1.081716e-02 0.9380153 1.156959e-02 3.579832e-01 ## Income 5.642220e-06 0.3857258 1.486897e-05 7.032445e-01 ## Beer 5.010542e-03 6.2692224 7.946214e-04 2.136646e-06 ## Wine 7.511377e-03 2.7258935 2.733526e-03 1.204856e-02 ## Sex -6.865495e-01 -17.2818695 3.973960e-02 1.137354e-14 ## Swim 1.010496e-01 14.6146811 6.863861e-03 3.941369e-13 ## Region 1.035071e-01 1.1283478 9.068204e-02 2.708059e-01 ## IQ -5.277164e-03 -0.5268369 1.015919e-02 6.033516e-01 ## 99% CI (lo) 99% CI (up) ## Intercept NA NA ## Height 6.644843e-02 8.268548e-02 ## Weight 4.342464e-02 5.450192e-02 ## Hairleng -7.981119e-01 -5.749871e-01 ## Age -2.166256e-02 4.329689e-02 ## Income -3.609997e-05 4.738441e-05 ## Beer 2.779773e-03 7.241311e-03 ## Wine -1.625491e-04 1.518530e-02 ## Sex -7.981119e-01 -5.749871e-01 ## Swim 8.178042e-02 1.203188e-01 ## Region -1.510678e-01 3.580821e-01 ## IQ -3.379741e-02 2.324309e-02 ## attr(,"name") ## [1] "Regression coefficients for Shoesize" The p-values, t-values and standard errors are stored each as a 3-way array similar to regression coefficients. The selection can be made by comparing e.g. p-values with a threshold similar to what we have done with VIP-scores and selectivity ratio. exclcols = mjk$coeffs$p.values[, 2, 1] > 0.05 show(exclcols) ## Height Weight Hairleng Age Income Beer Wine Sex ## FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE ## Swim Region IQ ## FALSE TRUE TRUE Here p.values[, 2, 1] means values for all predictors, model with two components, first y-variable. newm = pls(X.c, y.c, 3, scale = T, cv = 1, exclcols = exclcols) summary(newm) ## ## PLS model (class pls) summary ## ## Performance and validation: ## Number of selected components: 3 ## X cumexpvar Y cumexpvar RMSE Slope Bias RPD ## Cal 97.85 97.79 0.575 0.98 0.0000 6.87 ## CV 96.64 96.49 0.727 0.97 -0.0161 5.44 show(getRegcoeffs(newm)) ## Shoesize ## Intercept 4.952691046 ## Height 0.091079963 ## Weight 0.052708955 ## Hairleng -0.315927643 ## Age 0.000000000 ## Income 0.000000000 ## Beer 0.009320427 ## Wine 0.018524717 ## Sex -0.315927643 ## Swim 0.134354170 ## Region 0.000000000 ## IQ 0.000000000 ## attr(,"exclrows") ## Age Income Region IQ ## 4 5 10 11 ## attr(,"name") ## [1] "Regression coefficients for Shoesize" As you can see, the variables Age, Income, Region and IQ have been excluded as they are not related to the Shoesize, which seems to be correct. Variable selection as well as all described above can be also carried out for PLS discriminant analysis (PLS-DA), which can be explained later in one of the next chapters. -->"],
["simca-classification.html", "SIMCA classification", " SIMCA classification SIMCA (Soft Independent Modelling of Class Analogy) is a simple one-class classification method mainly based on PCA. The general idea is to create a PCA model using data for samples/objects belonging to a class and classify new objects based on how good the model can fit them. The decision is made using two residual distances — \\(Q\\) (squared residual distance from an object to its projection to PCA space) and \\(T^2\\) - distance between the projection of the object and origin of PC space. The \\(T^2\\) distance is calculated for normalized scores. The first distance (also known as “orthogonal distance”) shows how good the new object following the same trend as the other objects from the class used to create the model, while the second (also known as “score distance”) tells how extreme is it. Both distances may have certain statistical limits, which can be used to cut-off the strangers and accept class members with a pre-define expected ratio of false negatives (\\(\\alpha\\)). There are several ways to calculate the limits, for example, see this paper. In mdatools so far we use a simplest way, suggested by Svante Wold, where the limit for orthogonal distance is calculated using F-distribution and the score distance is computed using Hotelling T2 distribution. More methods will be available in future releases. The classification performance is assessed using true/false positives and negatives and statistics, showing the ability of a classification model to recognize class members (sensitivity or true positive rate) and how good the model is for identifying strangers (specificity or true negative rate). In addition to that, model also calculates a percent of misclassified objects. All statistics are calculated for calibration and validation (if any) results, but one must be aware that specificity can not be computed without objects not belonging to the class and, therefore, calibration and cross-validation results in SIMCA do not have specificity values. It must be also noted that any SIMCA model or result is also a PCA object and all plots, methods, statistics, available for PCA, can be used for SIMCA objects as well. "],
diff --git a/docs/variable-selection.html b/docs/variable-selection.html
index 04d0dda..c341652 100644
--- a/docs/variable-selection.html
+++ b/docs/variable-selection.html
@@ -185,7 +185,7 @@ Variable selection
plotSelectivityRatio(m1, type = 'b', show.labels = T)
plotSelectivityRatio(m1, ncomp = 1, type = 'h', show.labels = T)
plotSelectivityRatio(m1, ncomp = 2, type = 'h', show.labels = T)
-
+
In the example below, I create two other PLS models by excluding variables with VIP score or selectivity ratio below a threshold (I use 0.5 and 1 correspondingly) and show the performance for both.
m3 = pls(X.c, y.c, 5, scale = T, cv = 1, exclcols = getVIPScores(m1, ncomp = 2) < 0.5)
summary(m3)
@@ -216,7 +216,7 @@ Variable selection
plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T)
plotRegcoeffs(mjk, type = 'h', show.labels = T, show.ci = F)
plotRegcoeffs(mjk, ncomp = 2, type = 'h', show.labels = T, show.ci = F)
-
+
Calling function summary()
for regression coefficients allows to get all calculated statistics.
summary(mjk$coeffs, ncomp = 2)
## Regression coefficients for Shoesize