Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
svkucheryavski committed Apr 7, 2018
2 parents c2d4507 + c2a4384 commit e6b949e
Show file tree
Hide file tree
Showing 7 changed files with 68 additions and 68 deletions.
Binary file modified docs/_main_files/figure-html/unnamed-chunk-43-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/calibration-and-validation.html
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ <h1>
<section class="normal" id="section-">
<div id="calibration-and-validation" class="section level2 unnumbered">
<h2>Calibration and validation</h2>
<p>The model calibration is similar to PCA, but there are several additional arguments, which are important for classification. First of all it is a class name. Class name is a string, which can be used later e.g. for identifying class members for testing. The second important argument is a level of significance, <code>alpha</code>. This parameter is used for calculation of statistical limits and can be considered as probability for false negatives. The default value is 0.05. Finally the parameter <code>lim.type</code> allows to select the method for compuring critical limits for the residuals, as it is described in the PCA chapter.</p>
<p>The model calibration is similar to PCA, but there are several additional arguments, which are important for classification. First of all it is a class name. Class name is a string, which can be used later e.g. for identifying class members for testing. The second important argument is a level of significance, <code>alpha</code>. This parameter is used for calculation of statistical limits and can be considered as probability for false negatives. The default value is 0.05. Finally the parameter <code>lim.type</code> allows to select the method for compuring critical limits for the distances, as it is described in the PCA chapter.</p>
<p>In this chapter as well as for describing other classification methods we will use a famous Iris dataset, available in R. The dataset includes 150 measurements of three Iris species: <em>Setosa</em>, <em>Virginica</em> and <em>Versicola</em>. The measurements are length and width of petals and sepals in cm. Use <code>?iris</code> for more details.</p>
<p>Let’s get the data and split it to calibration and test sets.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">data</span>(iris)
Expand Down Expand Up @@ -298,10 +298,10 @@ <h3>Predictions and validation with a test set</h3>
<div id="class-belonging-probabilities" class="section level3 unnumbered">
<h3>Class belonging probabilities</h3>
<p>In addition to the array with predicted class, the object with SIMCA results also contains an array with class beloning probabilities. The probabilities are calculated depending on how close a particular object is to the the critical limit border.</p>
<p>To compute the probability we use the theoretical distribution for Q and T<sup>2</sup> residuals as for computing critical values (defined by the parameter <code>lim.type</code>). The distribution is used to calculate a p-value — chance to get object with given residual distance value or larger. The p-value is then compared with signidicance level, <span class="math inline">\(\alpha\)</span>, and the probability, <span class="math inline">\(\pi\)</span> is calculated as follows:</p>
<p>To compute the probability we use the theoretical distribution for Q and T<sup>2</sup> distances as for computing critical values (defined by the parameter <code>lim.type</code>). The distribution is used to calculate a p-value — chance to get object with given distance value or larger. The p-value is then compared with signidicance level, <span class="math inline">\(\alpha\)</span>, and the probability, <span class="math inline">\(\pi\)</span> is calculated as follows:</p>
<p><span class="math display">\[\pi = 0.5 (p / \alpha) \]</span></p>
<p>So if p-value is the same as significance level (which happens when object is lying exactly on the acceptance line) the probability is 0.5. If p-value is e.g. 0.04, <span class="math inline">\(\pi = 0.4\)</span>, or 40%, and the object will be rejected as a stranger (here we assume that the <span class="math inline">\(\alpha = 0.05\)</span>). If the p-value is e.g. 0.06, <span class="math inline">\(\pi = 0.6\)</span>, or 60%, and the object will be accepted as a member of the class. If p-value is larger than <span class="math inline">\(2\times\alpha\)</span> the probability is set to 1.</p>
<p>In case of rectangular acceptance area (<code>lim.type = 'jm'</code> or <code>'chisq'</code>) the probability is computed separately for Q and T<sup>2</sup> residuals and the smallest of the two is taken. In case of triangular acceptance area (<code>lim.type = 'ddmoments'</code> or <code>'ddrobust'</code>) the probability is calculated for a combination of the residuals.</p>
<p>In case of rectangular acceptance area (<code>lim.type = 'jm'</code> or <code>'chisq'</code>) the probability is computed separately for Q and T<sup>2</sup> values and the smallest of the two is taken. In case of triangular acceptance area (<code>lim.type = 'ddmoments'</code> or <code>'ddrobust'</code>) the probability is calculated for a combination of the distances.</p>
<p>Here is how to show the probability values, that correspond to the predictions shown in the previous code chunk.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">show</span>(res<span class="op">$</span>p.pred[<span class="dv">31</span><span class="op">:</span><span class="dv">40</span>, <span class="dv">1</span><span class="op">:</span><span class="dv">3</span>, <span class="dv">1</span>])</code></pre></div>
<pre><code>## Comp 1 Comp 2 Comp 3
Expand Down
2 changes: 1 addition & 1 deletion docs/plotting-methods.html
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ <h2>Plotting methods</h2>
<span class="kw">mdaplot</span>(m<span class="op">$</span>calres<span class="op">$</span>scores, <span class="dt">type =</span> <span class="st">&#39;p&#39;</span>, <span class="dt">show.labels =</span> T, <span class="dt">show.lines =</span> <span class="kw">c</span>(<span class="dv">0</span>, <span class="dv">0</span>))
<span class="kw">mdaplot</span>(m<span class="op">$</span>loadings, <span class="dt">type =</span> <span class="st">&#39;p&#39;</span>, <span class="dt">show.labels =</span> T, <span class="dt">show.lines =</span> <span class="kw">c</span>(<span class="dv">0</span>, <span class="dv">0</span>))</code></pre></div>
<p><img src="_main_files/figure-html/unnamed-chunk-55-1.png" width="864" /></p>
<p>To simplify this routine, every model and result class also has a number of functions for visualization. Thus for PCA the function list includes scores and loadings plots, explained variance and cumulative explained variance plots, T<sup>2</sup> vs. Q residuals and many others.</p>
<p>To simplify this routine, every model and result class also has a number of functions for visualization. Thus for PCA the function list includes scores and loadings plots, explained variance and cumulative explained variance plots, T<sup>2</sup> distances vs. Q residuals and many others.</p>
<p>A function that does the same for different models and results has always the same name. For example, <code>plotPredictions</code> will show predicted vs. measured plot for PLS model and PLS result, MLR model and MLR result, PCR model and PCR result and so on. The first argument must always be either a model or a result object.</p>
<p>The major difference between plots for model and plots for result is following. A plot for result always shows one set of data objects — one set of points, lines or bars. For example, predicted vs. measured values for calibration set or scores values for test set and so on. For such plots method <code>mdaplot()</code> is used and you can provide any arguments, available for this method (e.g. color group scores for calibration results).</p>
<p>And a plot for a model in most cases shows several sets of data objects, e.g. predicted values for calibration and validation. In this case, a corresponding method uses <code>mdaplotg()</code> and, therefore, you can adjust the plot using arguments described for this method.</p>
Expand Down
96 changes: 48 additions & 48 deletions docs/randomized-pca-algorithms.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,12 +201,12 @@ <h2>Randomized PCA algorithms</h2>
t1 =<span class="st"> </span><span class="kw">system.time</span>({m1 =<span class="st"> </span><span class="kw">pca</span>(D, <span class="dt">ncomp =</span> <span class="dv">2</span>)})
<span class="kw">show</span>(t1)</code></pre></div>
<pre><code>## user system elapsed
## 60.262 3.103 65.328</code></pre>
## 59.397 3.086 62.987</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># randomized SVD with p = 5 and q = 1</span>
t2 =<span class="st"> </span><span class="kw">system.time</span>({m2 =<span class="st"> </span><span class="kw">pca</span>(D, <span class="dt">ncomp =</span> <span class="dv">2</span>, <span class="dt">rand =</span> <span class="kw">c</span>(<span class="dv">5</span>, <span class="dv">1</span>))})
<span class="kw">show</span>(t2)</code></pre></div>
<pre><code>## user system elapsed
## 34.870 3.322 42.607</code></pre>
## 34.448 2.643 37.416</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># compare variances</span>
<span class="kw">summary</span>(m1)</code></pre></div>
<pre><code>##
Expand All @@ -215,8 +215,8 @@ <h2>Randomized PCA algorithms</h2>
## Info:
##
## Eigvals Expvar Cumexpvar
## Comp 1 112.597 62.18 62.18
## Comp 2 49.704 27.45 89.63</code></pre>
## Comp 1 112.699 62.17 62.17
## Comp 2 49.897 27.52 89.69</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(m2)</code></pre></div>
<pre><code>##
## PCA model (class pca) summary
Expand All @@ -226,33 +226,33 @@ <h2>Randomized PCA algorithms</h2>
##
## Parameters for randomized algorithm: q = 5, p = 1
## Eigvals Expvar Cumexpvar
## Comp 1 112.597 62.18 62.18
## Comp 2 49.704 27.45 89.63</code></pre>
## Comp 1 112.699 62.17 62.17
## Comp 2 49.897 27.52 89.69</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># compare loadings</span>
<span class="kw">show</span>(m1<span class="op">$</span>loadings[<span class="dv">1</span><span class="op">:</span><span class="dv">10</span>, ])</code></pre></div>
<pre><code>## Comp 1 Comp 2
## [1,] 3.049489e-06 -0.0000979183
## [2,] -3.900622e-02 0.0692210369
## [3,] -6.844649e-02 0.0747632444
## [4,] -8.140565e-02 0.0121164621
## [5,] -7.440682e-02 -0.0612545176
## [6,] -4.905702e-02 -0.0782478934
## [7,] -1.158937e-02 -0.0229581801
## [8,] 2.871800e-02 0.0533882207
## [9,] 6.184087e-02 0.0801510534
## [10,] 7.973446e-02 0.0329061559</code></pre>
## [1,] 7.799545e-05 -1.616151e-05
## [2,] -3.861521e-02 -6.939509e-02
## [3,] -6.814709e-02 -7.538469e-02
## [4,] -8.140091e-02 -1.254632e-02
## [5,] -7.476187e-02 6.095674e-02
## [6,] -4.946691e-02 7.770438e-02
## [7,] -1.173607e-02 2.264508e-02
## [8,] 2.901043e-02 -5.343363e-02
## [9,] 6.228145e-02 -8.011626e-02
## [10,] 7.983831e-02 -3.260527e-02</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">show</span>(m2<span class="op">$</span>loadings[<span class="dv">1</span><span class="op">:</span><span class="dv">10</span>, ])</code></pre></div>
<pre><code>## Comp 1 Comp 2
## [1,] 3.049489e-06 -0.0000979183
## [2,] -3.900622e-02 0.0692210369
## [3,] -6.844649e-02 0.0747632444
## [4,] -8.140565e-02 0.0121164621
## [5,] -7.440682e-02 -0.0612545176
## [6,] -4.905702e-02 -0.0782478934
## [7,] -1.158937e-02 -0.0229581801
## [8,] 2.871800e-02 0.0533882207
## [9,] 6.184087e-02 0.0801510534
## [10,] 7.973446e-02 0.0329061559</code></pre>
## [1,] 7.799545e-05 1.616151e-05
## [2,] -3.861521e-02 6.939509e-02
## [3,] -6.814709e-02 7.538469e-02
## [4,] -8.140091e-02 1.254632e-02
## [5,] -7.476187e-02 -6.095674e-02
## [6,] -4.946691e-02 -7.770438e-02
## [7,] -1.173607e-02 -2.264508e-02
## [8,] 2.901043e-02 5.343363e-02
## [9,] 6.228145e-02 8.011626e-02
## [10,] 7.983831e-02 3.260527e-02</code></pre>
<p>As you can see the explained variance values, eigenvalues and loadings are identical in the two models and the second method is about twice faster.</p>
<p>It is possible to make PCA decomposition even faster if only loadings and scores are needed. In this case you can use method <code>pca.run()</code> and skip other steps, like calculation of residuals, variances, critical limits and so on. But in this case data matrix must be centered (and scaled if necessary) manually prior to the decomposition. Here is an example using the data generated in previous code.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">D =<span class="st"> </span><span class="kw">scale</span>(D, <span class="dt">center =</span> T, <span class="dt">scale =</span> F)
Expand All @@ -261,37 +261,37 @@ <h2>Randomized PCA algorithms</h2>
t1 =<span class="st"> </span><span class="kw">system.time</span>({P1 =<span class="st"> </span><span class="kw">pca.run</span>(D, <span class="dt">method =</span> <span class="st">&#39;svd&#39;</span>, <span class="dt">ncomp =</span> <span class="dv">2</span>)})
<span class="kw">show</span>(t1)</code></pre></div>
<pre><code>## user system elapsed
## 26.312 0.272 27.052</code></pre>
## 25.966 0.261 26.293</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># randomized SVD with p = 5 and q = 1</span>
t2 =<span class="st"> </span><span class="kw">system.time</span>({P2 =<span class="st"> </span><span class="kw">pca.run</span>(D, <span class="dt">method =</span> <span class="st">&#39;svd&#39;</span>, <span class="dt">ncomp =</span> <span class="dv">2</span>, <span class="dt">rand =</span> <span class="kw">c</span>(<span class="dv">5</span>, <span class="dv">1</span>))})
<span class="kw">show</span>(t2)</code></pre></div>
<pre><code>## user system elapsed
## 2.120 0.045 2.166</code></pre>
## 2.085 0.043 2.130</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># compare loadings</span>
<span class="kw">show</span>(P1<span class="op">$</span>loadings[<span class="dv">1</span><span class="op">:</span><span class="dv">10</span>, ])</code></pre></div>
<pre><code>## [,1] [,2]
## [1,] 3.049489e-06 -0.0000979183
## [2,] -3.900622e-02 0.0692210369
## [3,] -6.844649e-02 0.0747632444
## [4,] -8.140565e-02 0.0121164621
## [5,] -7.440682e-02 -0.0612545176
## [6,] -4.905702e-02 -0.0782478934
## [7,] -1.158937e-02 -0.0229581801
## [8,] 2.871800e-02 0.0533882207
## [9,] 6.184087e-02 0.0801510534
## [10,] 7.973446e-02 0.0329061559</code></pre>
## [1,] 7.799545e-05 -1.616151e-05
## [2,] -3.861521e-02 -6.939509e-02
## [3,] -6.814709e-02 -7.538469e-02
## [4,] -8.140091e-02 -1.254632e-02
## [5,] -7.476187e-02 6.095674e-02
## [6,] -4.946691e-02 7.770438e-02
## [7,] -1.173607e-02 2.264508e-02
## [8,] 2.901043e-02 -5.343363e-02
## [9,] 6.228145e-02 -8.011626e-02
## [10,] 7.983831e-02 -3.260527e-02</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">show</span>(P2<span class="op">$</span>loadings[<span class="dv">1</span><span class="op">:</span><span class="dv">10</span>, ])</code></pre></div>
<pre><code>## [,1] [,2]
## [1,] 3.049489e-06 -0.0000979183
## [2,] -3.900622e-02 0.0692210369
## [3,] -6.844649e-02 0.0747632444
## [4,] -8.140565e-02 0.0121164621
## [5,] -7.440682e-02 -0.0612545176
## [6,] -4.905702e-02 -0.0782478934
## [7,] -1.158937e-02 -0.0229581801
## [8,] 2.871800e-02 0.0533882207
## [9,] 6.184087e-02 0.0801510534
## [10,] 7.973446e-02 0.0329061559</code></pre>
## [1,] 7.799545e-05 1.616151e-05
## [2,] -3.861521e-02 6.939509e-02
## [3,] -6.814709e-02 7.538469e-02
## [4,] -8.140091e-02 1.254632e-02
## [5,] -7.476187e-02 -6.095674e-02
## [6,] -4.946691e-02 -7.770438e-02
## [7,] -1.173607e-02 -2.264508e-02
## [8,] 2.901043e-02 5.343363e-02
## [9,] 6.228145e-02 8.011626e-02
## [10,] 7.983831e-02 3.260527e-02</code></pre>
<p>As you can see the loadings are still the same but the probabilistic algorithm is about 15 times faster.</p>

</div>
Expand Down
Loading

0 comments on commit e6b949e

Please sign in to comment.