1.Rmd

---
output:
  pdf_document: default
  html_document: default
---

\centering
---
title: \vspace{3in}"Machine Learning Project"
author: "Xiaoyan Lu, Anya Conti, Khushbakht Irfanullah, Xueyan Hou'
date: "12 December 2018"
output:
  pdf_document: 
    number_sections: true
---
\newpage
\tableofcontents


\newpage

#### Introduction 

Coordinate descent algorithm is a way of 'solving optimization problems by performing minimization along coordinate directions or coordinate hyperplanes'. It aims to minimize a function over a large set of variables by optimising a single dimension while keeping all the other dimensions fixed. Coordinate desceet has various applications particularly within machine learning because of its easy implementation and scalability. In this report we will be implementing coordinate descent algorithms to solve the Lasso and the Elastic Net problems.

#### Lasso
The Lasso (Least Absolute Shrinkage and Selection Operator) is a powerful method that performs two main tasks: variable selection and regularisation. It does this by applying a shrinking process where it penalizes the coefficients of the regression variables by shrinking some of them to zero. Hence, Lasso yields sparse models by using an l1 norm penalty which forces some of the coefficient estimates to be equal to zero when the tuning parameter, $\lambda$, is large enough. This in turn helps minimise the prediction error of the model. We are now going to use the coordinate descent algorithm to solve the lasso problem, wich requires for us to minimize the following wrt $\beta_j$:

$$\frac{1}{2n}\sum_{i=1}^n(y_i-\sum_{j=1}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^p|\beta_j|$$

The method we are going to be implementing is as follows:

1- Initialising the coefficients for all the variables to zero, ie, 
   $\beta_j$ = 0, j = 1,...p
   
2- Repeating the following steps until convergence: 

a. Computing the partial residuals,ie, $r_{ij}$ = $y_i$ - $\sum_{k=j}x_{ik}\beta_k$
b. Computing the least square coefficients of these residuals,ie, $\beta_j^*=\frac{1}{n}\sum_{i=1}^nx_{ij}r_{ij}$
c. Updating the coefficents by soft thresholding,ie, $\beta_j=sign(\beta_j^*)(|\beta_j^*|-\lambda)_+$

We implemented the Lasso using the following method:
```{r}
coordDescAlg.Lasso <- function(lambda, y, x){
  
  x <- as.matrix(x)
  converge <- FALSE  #initialising coverge as False
  n <- dim(x)[1]     #number of observations for each variable is equal to the no. of rows
  p <- dim(x)[2]     #number of variables is equal to the number of columns
  BetaHat <- rep(0, p) #initialising BetaHat
  BetaHat.star <- rep(0, p) #initialising BetaHat.Star
  newBetaHat <- rep(0,p) #initialising newBetaHat
  r <- matrix(rep(0,n*p), ncol = p) #creating the residual matrix 
  
  while(!converge){
    for(j in 1:p){
      r[,j] <- y - x[,-j]%*%BetaHat[-j] #computing partial residuals
      
      BetaHat.star[j] <- (t(x[,j])%*% r[,j]) / (t(x[,j])%*% x[,j]) #computing least square coefficients of the residuals 
      newBetaHat[j] <- sign(BetaHat.star[j])*pmax(abs(BetaHat.star[j]) - n*(lambda / (t(x[,j])%*% x[,j])), 0) #updating newBetaHat by soft thresholding 
      
      if(all(round(BetaHat, 5) == round(newBetaHat,5))){ #repeating loop until convergence
        converge <- TRUE
        break
      } else {
        BetaHat <- newBetaHat
      }
    }
  }
  return(BetaHat)
}
```

STANDARDISATIONX VS UNSTANDARDIZED() DIVISION BY XX)

The Lasso can be advantageous in many situtations because of its ability to shrink and remove redundant coefficients which help provide good accuracy. Moreover, the Lasso helps increase model interpretability as it eliminates irrelevant variables that are not associated with the response variable, thereby, reducing overfitting. It further improves on the Ridge Regression method, by using the l1 norm penalty to actually provide us a subset of variables from the model, that is, carry out successful variable selection. However, despite these advantages, there are certain situtations in which the Lasso does not perform well.

Firstly, when faced with a very complex model, where the number of dimensions is greater than the number of observations, that is, p > n, the Lasso method ends up selecting at most n variables before it starts saturating. So, this might result in a model with quite a few unnecessary variables being used. Secondly, in situations where a group of variables has very high pairwise correlation between them, the Lasso tends to select only only variable from the group, which might result in poor selection as all the other variables in that group are ignored. Finally, even though the Lasso dominates Ridge Regression in the case of variable selection when n > p, but it falls behind in the prediction performance. 

#### ELASTIC NET

The Elastic Net can enable us to overcome these limitations of the Lasso, by combining methods from both the Lasso and the Ridge Regression. It does this by adding a quadratic element to the penalty by using both the l1 and l2 norm penalties which removes the limitation on the selected variables number and also stabilizes the selection from grouped variables. Hence, like Lasso it provides sparse solutions but it also has the advantage of performing well with highly correlated variables. 

We are now going to use the coordinate descent algorithm to solve the Elastic Net problem which follows a very similar method to solving the Lasso. However, for Elastic net, we need to minimize the following wrt $\beta_j$:

$$\frac{1}{2n}\sum_{i=1}^n(y_i-\sum_{j=1}x_{ij}\beta_j)^2 + \lambda_1\sum_{j=1}^p|\beta_j| + \lambda_2\sum_{j=1}^p\beta_j^2$$

which gives us 

$\frac{1}{n}\sum_{i=1}^n(y_i - \sum_{i=1}^px_{ij}\beta_j)x_{ij} + \lambda_1 + 2\lambda_2\beta_j = 0$

$\frac{1}{n}\sum_{i=1}^nr_{ij}x_{ij} + \lambda_1 + 2\lambda_2\beta_j = 0$

$ \beta_j = \frac{1/n\sum_{i=1}^nr_{ij}x_{ij}}{2\lambda_2}$

Updating $\beta_j$ by soft thresholding (J. R. Statist. Soc. B (2005) 67, Part 2, pp. 301???320, Regularization and variable selection via the elastic net) then gives us:

$\beta_j=\frac{sign(\beta_j^*)(|\beta_j^*|-\lambda_1)_+}{1 + 2\lambda_2}$

We then implemented this Elastic Net Algorithm in R with the following code:
```{r}
coordDescAlg.ElNet <- function(lambda, y, x){ 
  #lamda has two dimensions 
  x <- as.matrix(x)
  converge <- FALSE #initialising coverge as False
  n <- dim(x)[1]    #number of observations for each variable is equal to the no. of rows
  p <- dim(x)[2]    #total number of variables
  BetaHat <- rep(0, p)  #initialising BetaHat
  BetaHat.star <- rep(0, p) #initialising BetaHat.Star
  newBetaHat <- rep(0,p) #initialising newBetaHat
  r <- matrix(rep(0,n*p), ncol = p) #creating the residual matrix 
  
  while(!converge){
    for(j in 1:p){
      r[,j] <- y - x[,-j]%*%BetaHat[-j] #computing partial residuals
      
      BetaHat.star[j] <- (t(x[,j])%*% r[,j]) / (t(x[,j])%*% x[,j]) #computing least square coefficients of the residuals 
      newBetaHat[j] <- (sign(BetaHat.star[j])*pmax(abs(BetaHat.star[j]) - ((n*lambda[1] / (t(x[,j])%*% x[,j]))/(1+2*lambda[2])),0)) #updating newBetaHat by soft thresholding 
      
      if(all(round(BetaHat, 5) == round(newBetaHat,5))){ #repeating loop until convergence
        converge <- TRUE
        break
      } else {
        BetaHat <- newBetaHat
      }
    }
  }
  return(BetaHat)
}
```
 
RESULTSSS OF ELASTIC NET


#### Selecting the Regularisation parameters for Lasso and Elastic Net

To implement the Lasso and the Elastic Net as efficient feature selection approaches requires a method to select the value for the tuning parameter ($\lambda$ for Lasso and $\lambda_1$ and $\lambda_2$ for the Elastic Net problem). The way we have done this is by using the validation set approach for both the Lasso and the Elastic Net problem. Under this approach, we divided the dataset into a training, validation, and testing set and then used the training set to fit the model. We do not do a random selection because the dataset is already random. We then fit the estimated model to the validation set for different values of $\lambda$, and use this to calculate the MSE. Then we find the value of $\lambda$ to minimize the MSE using our validation function below.  For the Lasso, we used the optim() function to find the $\lambda$ with the minimum MSE by using the Brent method which focuses on minimising one-dimensional problems. Whereas, for the Elastic Net we use the L-BFGS-B method for optimisation as it is two-dimensional. This function thereby allows us to find the optimal values for our tuning paramaters for both the Lasso($\lambda$) and the Elastic Net($\lambda_1,\lambda_2$) problems and returns the values of the BetaHat and the MSE.


```{r}
#Called by the Validation function. It calculates the value of the MSE for a given 
MSEValid.Lasso <- function(lambda, x, y, sizes){
  testSize <- sizes[1] #size of the testing dataset
  validSize <- sizes[2] #size of the validation dataset
  ytrain = y[1:testSize] #selecting the training data 
  xtrain = x[1:testSize,] 
  yvalid = y[(testSize+1):(testSize + validSize)] #selecting the validation set
  xvalid = x[(testSize+1):(testSize + validSize),] 
  BetaHat <- coordDescAlg.Lasso(lambda, ytrain, xtrain) #predicting the value of lambda
  MSE <- mean((yvalid - xvalid %*% BetaHat)^2) #calculating the MSE for given lamda
  return(MSE)
}

#Called by Validation, calculates the value of MSE for a given lambda
MSEValid.ElasticNet <- function(lambda, x, y, sizes){
  testSize <- sizes[1] #size of testing dataset
  validSize <- sizes[2] #size of the validation dataset
  ytrain = y[1:testSize] #selecting the training data 
  xtrain = x[1:testSize,]
  yvalid = y[(testSize+1):(testSize + validSize)] #selecting the validation set
  xvalid = x[(testSize+1):(testSize + validSize),]
  BetaHat <- coordDescAlg.ElNet(lambda, ytrain, xtrain) #predicting the value of lambda
  MSE <- mean((yvalid - xvalid %*% BetaHat)^2) #calculating the MSE for given lamda
  return(MSE)
}

Validation <- function(x,y,sizes){
  testSize <- sizes[1]
  
  MinMSE.Lasso <- optim(c(0.0001), MSEValid.Lasso,  x = x, y = y, sizes = sizes, method = "Brent", lower = 0, upper = 4) #minimize the MSE to find lamda with lowest MSE for Lasso
  BetaHat.Lasso <- coordDescAlg.Lasso(MinMSE.Lasso$par, y[1:testSize], x[1:testSize,])
  
   MinMSE.ElasticNet <- optim(c(0.0001, 0.0001), MSEValid.ElasticNet,  x = x, y = y, sizes = sizes,method = 'L-BFGS-B',lower=c(0,0),upper = c(4,1)) #minimize the MSE to find lamda with lowest MSE for Elastic Net
  BetaHat.ElasticNet <- coordDescAlg.ElNet(MinMSE.ElasticNet$par, y[1:testSize], x[1:testSize,])
  
  return(list(BetaHat.Lasso,MinMSE.Lasso$value, BetaHat.ElasticNet, MinMSE.ElasticNet$value))
}

```

#### Simulations

The Elastic net dominates the Lasso not only in prediction accuracy but also performs well in variable selection as it aims to overcome the limitations of the Lasso. Therfore, in order to show this in action, we aimed to carry out several different simulations.

In order to carry out the simulation process, first we simulated data (our function ???mySample???). This randomly generates a multivariate sample $\pmb X $ where each column is a standard normal distribution.  We use that the correlation between two variables is given by $corr(i,j) = \rho^{|i-j|}$.  In the baseline case, we set $\rho = 0.5$.  The covariance matrix is equivalent to our correlation matrix in this case since each variable is standard normal distribution (our function ???covMatrix???).  Our dependent variable is given by $\pmb y =  \pmb{X\Beta} + \sigma\epsilon$ for $\epsilon ~ (0, \pmb I_n)$ where $\epsilon$ is the error, $\sigma$ is its standard deviation.  Our baseline case will be $\beta$ = (3, 1.5, 0, 0, 2, 0, 0, 0)^{T} and $\sigma$ = 3.  We will generate a sample of 240 observations, where 20 are part of the training set, 20 are part of the validation set, and the remaining are part of the validation set.  We do not need to randomly select, since each individual observation is independently identically distributed.  We then create a function called ???Testing??? which enables us to run one round of simulation, generating one data set and estimating the training data using our Lasso and Elastic Net functions, optimizing the parameters $\lambda_1$ and $\lambda_2$ using the validation data, and returning the MSE of using the model to estimate the test set, and also returning the number of non-zero coefficients.  Finally we generated the 'Simulation' function which would enable us to run multiple simulations at one time and return the number of non-zero coefficients and the MSE of the test data for each respective model. In our baseline case, we will run 50 simulations.  

The parameters which we have are as follows, with the baseline values given: 
\begin{table}
\caption{Baseline Values}
\centering
\begin{tabular}{c c}
\hline\hline
Paramters & Values \\[0.5ex]
\hline
Number of simulations & 50 \\
Number of observations & 200 \\ 
Size of Training Set & 20 \\
Size of Validation Set & 20 \\ 
$\beta$ & (3, 1.5, 0, 0, 2, 0, 0, 0)^{T}\\
$\sigma$ & 3 \\
$\rho$ & 0.5 \\[1ex]
\hline
\end{tabular}
\end{table}

```{r}
Simulation <- function(times, obs, trainsize, validsize, rho, errorSD, betaVals){
  Output <- NULL
  for(i in 1:times){
    Output <-rbind(Output, Testing(n = obs, sizes = c(trainsize, validsize), rho = rho, errorSD = errorSD, betaVals = betaVals))
    print(i)
  }
  return(Output)
}

#Called by Simulation, runs one round of the simulation, returns the number of non-zero coefficients and MSE from optimal lambda that minimizes MSE
Testing <- function(n, sizes, rho, errorSD, betaVals){
  SampleOut <- mySample(n, rho, errorSD, betaVals)
  x <- SampleOut[[1]]
  y <- SampleOut[[2]]
  Output <- Validation(x = x, y = y, sizes = sizes)
  NumbNonZeros.Lasso <- sum(Output[[1]] != 0) #non-zero coefficients from the Lasso 
  NumbNonZeros.ElNet <- sum(Output[[3]] != 0) #non-zero coefficients from Elastic Net
  MSE.Lasso <- Output[[2]]
  MSE.ElNet <- Output[[4]]
  return(c(NumbNonZeros.Lasso, MSE.Lasso, NumbNonZeros.ElNet, MSE.ElNet))
}

        #Called by Testing, randomly generates the data set
        mySample <- function(n, rho, errorSD, betaVals){
        p <- dim(betaVals)[1]
        SigmaMat <- covMatrix(rho, p)
        mu <- rep(0,p)
        x <- mvrnorm(n = n, mu = mu, Sigma = SigmaMat)
        error <- rnorm(n)
        y <- x %*% betaVals + errorSD*error
        return(list(x,y))
        }
        
          #Called by mySample, creates variance-covariance matrix to measure pairwise correlation
          covMatrix <- function(rho, p){
            covMatrix <- matrix(rep(0,(p^2)), ncol = p)#initializing covMatrix
            for(i in 1:p){
                for(j in 1:p){
                    covMatrix[i,j] <- rho^abs(i-j) 
                }
              }
             return(covMatrix)
            }

```
#Baseline Results
The baseline results are shown in the table below: 

\begin{table}[!htbp] \centering 
  \caption{} 
  \label{} 
\begin{tabular}{@{\extracolsep{5pt}} ccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
 & Mean & Var \\ 
\hline \\[-1.8ex] 
NonZeroCoefs.Lasso & $3.560$ & $4.415$ \\ 
MSE.Lasso & $12.671$ & $25.843$ \\ 
NonZeroCoefs.ElNet & $5.080$ & $6.034$ \\ 
MSE.ElNet & $12.119$ & $28.183$ \\ 
\hline \\[-1.8ex] 
\end{tabular} 
\end{table}

Our baseline results show how the Elastic Net outperforms Lasso in the variable selection process as we got a lower mean-squared error for our simulations when implementing the Elastic Net as compared to the Lasso. This might be because of the Elastic net's grouped selection property which tends to deal with highly correlated similarly, whether it's removing or keeping them.  In addition, the number of non-zero coefficients for elastic net was higher, which is consistent with the format of the penalty.  

We are then curious about how each specific parameter might affect the results.  Thus, we run  simulations on various values of one parameter, while keeping all the others constant.  Then we graph the changes in MSE and number of non-zero coefficients as a function of the changing parameter.  The graphs will have the mean of the Lasso in red, and the mean of the Elastic Net in blue, with a shaded area of one standard error around the mean.  For the most part, the values of the other parameters will be the same as the baseline simulation for consistency.  

As we will see in all results, the mean of the MSE for Lasso is generally higher than than of Elastic Net, however this is not always the case, and they are always very close.  The difference in the number of non-zero coefficients is more apparent.  Lasso is always lower.  


##### a. Changing the number of simulations
We used the number of simulations in the range of 0 to 100, in intervals of 5 simulations.  The other parameters are all set to the baseline values.  The results are shown below:

ADD PLOT


The number of simulations does not appear to have a large effect on either the MSE or the number of non-zero coefficients.  While it does make sense that it would not have a large effect, we believed the results would stabilize more at high levels of simulations.


##### b. Changing the number of observations 

We used the number of observations in the range of 50 to 1000, in intervals of 25 observations.  The other parameters are all set to the baseline values.  The results are shown below:

ADD PLOT


The number of observations does not appear to have a large effect on either the MSE or the number of non-zero coefficients.  This makes sense because we are still training on the same size of the data-set, which is what the model is based on.  

##### c. Changing the size of the training dataset

We used the size of the training set data from 15 to 210 in intervals of 5 observations.  The other parameters are all set to the baseline values.  The results are shown below:


ADD PLOT


The MSE of both drop off at the start as the size of the training set increases, but then it levels off after a while.  This makes sense as a larger training set will mean that we can better estimate the true model, and will minimize the effects of any outliers.  If most of the observations in a small training set happen to be similar but not representative of the overall distribution, the estimation will be skewed.  However, as the training size increases, the law of large numbers means that this is less likely.  The number of non-zero coefficients increases as the size of the training set increases, but again this levels off after a while.  This is probably for similar reasons.  


##### d. Changing the size of the validation dataset
We used the size of the validation set data from 15 to 210 in intervals of 5 observations.  The other parameters are all set to the baseline values. The results are shown below:


The size of the validation set does not appear to have a large effect on either the MSE or the number of non-zero coefficients.  This makes sense because we are still training on the same size of the data-set, which is what the model is based on.  


##### e. Changing sparsity levels of Beta

We used the following $\beta$ values: $\beta$ = (3, 1.5, -2, 4, 2, -1.3, 6, -3)??? where each successive iteration set one additional value of beta to 0 to increase the sparsity starting with the last.  For example, our second run used $\beta$ = (3, 1.5, -2, 4, 2, -1.3, 6, 0)???, our third run used $\beta$ = (3, 1.5, -2, 4, 2, -1.3, 0, 0)???, etc.  For most of the other parameters, we used the baseline values, with the exception of $\rho$.  To prevent the correlation between variables from affecting the results, we set this as 0 so that each variable is independent.  The results are shown below.  


For low sparsity levels, the MSE of Lasso is much higher than Elastic Net, but this goes down as the sparsity increases until they are essentially the same when $\beta$ = (0,0,0,0,0,0,0,0)???.  In addition, the standard error of the MSE of lasso is very large at low sparsity levels, and slowly decreases to being much lower, which makes sense because of the penalty term where lasso estimates high numbers of 0 coefficients.  The penalty term for elastic net in contrast does not estimate as many 0 coefficients.  Elastic Net seems to be a lot more stable for various sparsity levels, and definitely outperforms Lasso at low sparsity levels.  At high sparsity levels, they are very similar.  For the number of non-zero coefficients, elastic net is very close to 0 non-zero coefficients at low sparsity levels as it should be.  In addition, the standard error is very small.  As the sparsity level increases, the number of non-zero coefficients decreases as it should, though the standard error increases.  In contrast, at low sparsity levels the Lasso has lower numbers of non-zero coefficients compared to elastic net which is not good, and also results in extremely high standard errors.  As the sparsity level increases, the number of non-zero coefficients goes to 0 as it should, and the standard error greatly decreases.  


##### f. Changing sigma

We used $\sigma$ values ranging from -10 to 10 in intervals of 0.5.  The other parameters were set to the baseline values.  The results are shown below.  

ADD PLOT

The change in $\sigma$ seemed to have fairly similar effects on the MSE of both models, where when $|\sigma|$ is high, they have larger MSE???s with higher standard errors as expected since the error will have a larger effect on $\pmb y$.  The two models are very close, though Elastic Net seems to actually have a slightly higher MSE.  As $|\sigma|$ gets closer to 0, the MSE and their standard errors both decrease in an almost parabolic shape, and with Elastic Net having the lower MSE.  

For the number of non-zero coefficients, this is lower at high levels of $|\sigma|$, though both are very similar to each other.  As the value of $|\sigma|$ decreases, the number of non-zero coefficients increases, though for elastic net far more than lasso.  In addition, the standard errors decrease for elastic net closer to $\sigma = 0$, but not for lasso.  


##### g. Changing Rho


For $\rho$, we used values ranging from -0.95 to 0.95 in intervals of 0.05.  All the other parameters are set to the baseline values.  The results are shown below.  

ADD PLOT

There does not appear to be a very obvious pattern in the MSE???s.  At very high levels of $|\rho|$, the MSE does appear to decrease slightly for both models, though a bit more so for Lasso in comparison to Elastic Net.  For the number of non-zero coefficients, this again does not appear to have a very obvious pattern, though the mean of both increase as $\rho$ increases and the standard error decreases a bit for both as $\rho$ increases.  For elastic net, this might be because of the grouped selection property mentioned above.  


#### Conclusion

In general, Elastic Net seems to have a lower mean squared error compared to Lasso, and Lasso tends to estimate higher numbers of $\beta$ coefficients estimated as zero.  However, this can change depending on the circumstance as we show in our simulations where we change the parameters one at a time and show how this affects the MSE and the number of non-zero coefficients.  A number of the parameters did not change the outcome much such as the number of simulations, the number of observations, and the size of the validation set.  The value of $\rho$ had a slight impact on MSE and the number of non-zero $\beta$ coefficients, but not much.  The parameters that had the largest effect were the size of the training data set, the sparsity levels of $\beta$, and the value of $\sigma$.  

Unfortunately, a lot of these parameters are unknown in the real world, so we will almost never know which situation we are in, and thus which model would be better.  However, we can always increase the size of the training set compared to the size of the validation and test sets which improves the model accuracy for both the algorithms.  We could estimate the variance-covariance matrix so we have some idea of how the variables are correlated with each other.  However, the true sparsity level of $\beta$ and the value of $\sigma$ will never be known unless it is a simulated data set such as this.  

If we interested in decreasing the number of variables, lasso tends to be a better choice, and if we are interested in model accuracy or preserving the effects of more variables to analyse then elastic net tends to be better.  

\newpage

####References

* J. R. Statist. Soc. B (2005) 67, Part 2, pp. 301???320, Regularization and variable selection via the elastic net)
* ISLR Chapter 6
* Friedman, J., Hastie, T. and Hofling, H. (2007). Pathwise coordinate optimization, TheAnnals of Applied Statistics 1: 302???332.
* Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of Royal Statistical Society, Series B 58: 267???288.
* Tseng, P. (1988). Coordinate ascent for maximizing nondifferentiable concave functions,Technical Report .
* Tseng, P. (2001). Convergence of block coordinate descent method for nondifferentiable maximiation, J. Opt. Theory Appl. 109: 474???494.
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/#R5

\newpage

####Appendix

```{r}
#Called by the Validation function. It calculates the value of the MSE for a given 
coordDescAlg.Lasso <- function(lambda, y, x){
  
  x <- as.matrix(x)
  converge <- FALSE  #initialising coverge as False
  n <- dim(x)[1]     #number of observations for each variable is equal to the no. of rows
  p <- dim(x)[2]     #number of variables is equal to the number of columns
  BetaHat <- rep(0, p) #initialising BetaHat
  BetaHat.star <- rep(0, p) #initialising BetaHat.Star
  newBetaHat <- rep(0,p) #initialising newBetaHat
  r <- matrix(rep(0,n*p), ncol = p) #creating the residual matrix 
  
  while(!converge){
    for(j in 1:p){
      r[,j] <- y - x[,-j]%*%BetaHat[-j] #computing partial residuals
      
      BetaHat.star[j] <- (t(x[,j])%*% r[,j]) / (t(x[,j])%*% x[,j]) #computing least square coefficients of the residuals 
      newBetaHat[j] <- sign(BetaHat.star[j])*pmax(abs(BetaHat.star[j]) - n*(lambda / (t(x[,j])%*% x[,j])), 0) #updating newBetaHat by soft thresholding 
      
      if(all(round(BetaHat, 5) == round(newBetaHat,5))){ #repeating loop until convergence
        converge <- TRUE
        break
      } else {
        BetaHat <- newBetaHat
      }
    }
  }
  return(BetaHat)
}

coordDescAlg.ElNet <- function(lambda, y, x){ 
  #lamda has two dimensions 
  x <- as.matrix(x)
  converge <- FALSE #initialising coverge as False
  n <- dim(x)[1]    #number of observations for each variable is equal to the no. of rows
  p <- dim(x)[2]    #total number of variables
  BetaHat <- rep(0, p)  #initialising BetaHat
  BetaHat.star <- rep(0, p) #initialising BetaHat.Star
  newBetaHat <- rep(0,p) #initialising newBetaHat
  r <- matrix(rep(0,n*p), ncol = p) #creating the residual matrix 
  
  while(!converge){
    for(j in 1:p){
      r[,j] <- y - x[,-j]%*%BetaHat[-j] #computing partial residuals
      
      BetaHat.star[j] <- (t(x[,j])%*% r[,j]) / (t(x[,j])%*% x[,j]) #computing least square coefficients of the residuals 
      newBetaHat[j] <- (sign(BetaHat.star[j])*pmax(abs(BetaHat.star[j]) - ((n*lambda[1] / (t(x[,j])%*% x[,j]))/(1+2*lambda[2])),0)) #updating newBetaHat by soft thresholding 
      
      if(all(round(BetaHat, 5) == round(newBetaHat,5))){ #repeating loop until convergence
        converge <- TRUE
        break
      } else {
        BetaHat <- newBetaHat
      }
    }
  }
  return(BetaHat)
}

Simulation <- function(times, obs, trainsize, validsize, rho, errorSD, betaVals){
  Output <- NULL
  for(i in 1:times){
    Output <-rbind(Output, Testing(n = obs, sizes = c(trainsize, validsize), rho = rho, errorSD = errorSD, betaVals = betaVals))
    print(i)
  }
  return(Output)
}

#Called by Simulation, runs one round of the simulation, returns the number of non-zero coefficients and MSE from optimal lambda that minimizes MSE
Testing <- function(n, sizes, rho, errorSD, betaVals){
  SampleOut <- mySample(n, rho, errorSD, betaVals)
  x <- SampleOut[[1]]
  y <- SampleOut[[2]]
  Output <- Validation(x = x, y = y, sizes = sizes)
  NumbNonZeros.Lasso <- sum(Output[[1]] != 0) #non-zero coefficients from the Lasso 
  NumbNonZeros.ElNet <- sum(Output[[3]] != 0) #non-zero coefficients from Elastic Net
  MSE.Lasso <- Output[[2]]
  MSE.ElNet <- Output[[4]]
  return(c(NumbNonZeros.Lasso, MSE.Lasso, NumbNonZeros.ElNet, MSE.ElNet))
}

        #Called by Testing, randomly generates the data set
        mySample <- function(n, rho, errorSD, betaVals){
          p <- dim(betaVals)[1]
          SigmaMat <- covMatrix(rho, p)
          mu <- rep(0,p)
          x <- mvrnorm(n = n, mu = mu, Sigma = SigmaMat)
          error <- rnorm(n) 
          y <- x %*% betaVals + errorSD*error 
          return(list(x,y))
        }
        
  #Called by mySample, creates variance-covariance matrix to measure pairwise correlation
          covMatrix <- function(rho, p){
            covMatrix <- matrix(rep(0,(p^2)), ncol = p)#initializing covMatrix
            for(i in 1:p){
                for(j in 1:p){
                    covMatrix[i,j] <- rho^abs(i-j) 
                }
              }
             return(covMatrix)
          }
          
Validation <- function(x,y,sizes){
  testSize <- sizes[1]
  
  MinMSE.Lasso <- optim(c(0.0001), MSEValid.Lasso,  x = x, y = y, sizes = sizes, method = "Brent", lower = 0, upper = 4) #minimize the MSE to find lamda with lowest MSE for Lasso
  BetaHat.Lasso <- coordDescAlg.Lasso(MinMSE.Lasso$par, y[1:testSize], x[1:testSize,])
  
   MinMSE.ElasticNet <- optim(c(0.0001, 0.0001), MSEValid.ElasticNet,  x = x, y = y, sizes = sizes,method = 'L-BFGS-B',lower=c(0,0),upper = c(4,1)) #minimize the MSE to find lamda with lowest MSE for Elastic Net
  BetaHat.ElasticNet <- coordDescAlg.ElNet(MinMSE.ElasticNet$par, y[1:testSize], x[1:testSize,])
  
  return(list(BetaHat.Lasso,MinMSE.Lasso$value, BetaHat.ElasticNet, MinMSE.ElasticNet$value))
}

MSEValid.Lasso <- function(lambda, x, y, sizes){
  testSize <- sizes[1] #size of the testing dataset
  validSize <- sizes[2] #size of the validation dataset
  ytrain = y[1:testSize] #selecting the training data 
  xtrain = x[1:testSize,] 
  yvalid = y[(testSize+1):(testSize + validSize)] #selecting the validation set
  xvalid = x[(testSize+1):(testSize + validSize),] 
  BetaHat <- coordDescAlg.Lasso(lambda, ytrain, xtrain) #predicting the value of lambda
  MSE <- mean((yvalid - xvalid %*% BetaHat)^2) #calculating the MSE for given lamda
  return(MSE)
}

#Called by Validation, calculates the value of MSE for a given lambda
MSEValid.ElasticNet <- function(lambda, x, y, sizes){
  testSize <- sizes[1] #size of testing dataset
  validSize <- sizes[2] #size of the validation dataset
  ytrain = y[1:testSize] #selecting the training data 
  xtrain = x[1:testSize,]
  yvalid = y[(testSize+1):(testSize + validSize)] #selecting the validation set
  xvalid = x[(testSize+1):(testSize + validSize),]
  BetaHat <- coordDescAlg.ElNet(lambda, ytrain, xtrain) #predicting the value of lambda
  MSE <- mean((yvalid - xvalid %*% BetaHat)^2) #calculating the MSE for given lamda
  return(MSE)
}

#Baseline 

output <- Simulation(times = 50, obs = 240, trainsize = 20, validsize = 20, rho = 0.5, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))


mysummary <- cbind(colMeans(output),apply(output, 2, var))
rownames(mysummary) <- c("NonZeroCoefs.Lasso", "MSE.Lasso", "NonZeroCoefs.ElNet", "MSE.ElNet")
colnames(mysummary) <- c("Mean", "Var")
mysummary

meanAndvar <- function(x){
  cbind(colMeans(x),apply(x, 2, var))
}

#Change number of simulations
NSims <- seq(5,100,5)

outputSims <- lapply(NSims, Simulation, obs = 240, trainsize = 20, validsize = 20, rho = 0.5, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputSims, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("NSims" = NSims, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("NSims" = NSims, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("NSims" = NSims, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("NSims" = NSims, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Number of Simulations") + 
  geom_line(aes(x = NSims, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= NSims, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)


ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Number of Simulations") + 
  geom_line(aes(x = NSims, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= NSims, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change number of observations
Nobs <- seq(50,1000,25)

outputNobs <- lapply(Nobs, Simulation, times = 50, trainsize = 20, validsize = 20, rho = 0.5, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputNobs, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("Nobs" = Nobs, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("Nobs" = Nobs, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("Nobs" = Nobs, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("Nobs" = Nobs, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Number of Observations") + 
  geom_line(aes(x = Nobs, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= Nobs, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Number of Observations") + 
  geom_line(aes(x = Nobs, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= Nobs, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change size of training data
TrainNumb <- seq(15,210,5)

outputTrainNumb <- lapply(TrainNumb, Simulation, times = 50, obs = 240, validsize = 20, rho = 0.5, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputTrainNumb, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("TrainNumb" = TrainNumb, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("TrainNumb" = TrainNumb, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("TrainNumb" = TrainNumb, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("TrainNumb" = TrainNumb, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Size of Training Set") + 
  geom_line(aes(x = TrainNumb, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= TrainNumb, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Size of Training Set") + 
  geom_line(aes(x = TrainNumb, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= TrainNumb, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change size of validation data

VaidNumb <- seq(15,210,5)

outputValidNumb <- lapply(ValidNumb, Simulation, times = 50, obs = 240, trainsize = 20, rho = 0.5, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputValidNumb, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("ValidNumb" = ValidNumb, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("ValidNumb" = ValidNumb, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("ValidNumb" = ValidNumb, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("ValidNumb" = ValidNumb, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Size of Validation Set") + 
  geom_line(aes(x = ValidNumb, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= ValidNumb, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Size of Validation Set") + 
  geom_line(aes(x = ValidNumb, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= ValidNumb, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change sparsity levels of beta
BetaVals.Sparsity <- list(matrix(c(0,0,0,0,0,0,0,0), ncol = 1),
                          matrix(c(3,0,0,0,0,0,0,0), ncol = 1),
                          matrix(c(3,1.5,0,0,0,0,0,0), ncol = 1),
                          matrix(c(3,1.5,-2,0,0,0,0,0), ncol = 1),
                          matrix(c(3,1.5,-2,4,0,0,0,0), ncol = 1),
                          matrix(c(3,1.5,-2,4,2,0,0,0), ncol = 1),
                          matrix(c(3,1.5,-2,4,2,-1.3,0,0), ncol = 1),
                          matrix(c(3,1.5,-2,4,2,-1.3,6,0), ncol = 1),
                          matrix(c(3,1.5,-2,4,2,-1.3,6,-3), ncol = 1))
SparsityLevels <- seq(1,0, by = -1/8)

outputSparsity <- lapply(BetaVals.Sparsity, Simulation, times = 20, obs = 240, trainsize = 20, validsize = 20, rho = 0, errorSD = 3)
mysummary <- lapply(outputSparsity, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("SparsityLevels" = SparsityLevels, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("SparsityLevels" = SparsityLevels, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("SparsityLevels" = SparsityLevels, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("SparsityLevels" = SparsityLevels, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Sparsity Levels") + 
  geom_line(aes(x = SparsityLevels, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= SparsityLevels, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Sparsity Levels") + 
  geom_line(aes(x = SparsityLevels, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= SparsityLevels, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change rho
RhoVals <- seq(-.95, .95, .05)

outputrho <- lapply(RhoVals, Simulation, times = 50, obs = 240, trainsize = 20, validsize = 20, errorSD = 3, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputrho, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("RhoVals" = RhoVals, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("RhoVals" = RhoVals, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("RhoVals" = RhoVals, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("RhoVals" = RhoVals, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Correlation between Variables") + 
  geom_line(aes(x = RhoVals, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= RhoVals, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Correlation between Variables") + 
  geom_line(aes(x = RhoVals, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= RhoVals, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

#Change sigma

SigVals <- seq(-10, 10, .5)

outputsigma <- lapply(SigVals, Simulation, times = 50, obs = 240, trainsize = 20, validsize = 20, rho = 0.5, betaVals = matrix(c(3,1.5,0,0,2,0,0,0), ncol = 1))
mysummary <- lapply(outputsigma, meanAndvar)

NonZeroCoefsSummary.Lasso <- data.frame("SigVals" = SigVals, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 1,1), "Var" = sapply(mysummary, "[[", 1,2))
NonZeroCoefsSummary.ElNet <- data.frame("SigVals" = SigVals, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 3,1), "Var" = sapply(mysummary, "[[", 3,2))
NonZeroCoefsSummary <- rbind(NonZeroCoefsSummary.Lasso, NonZeroCoefsSummary.ElNet)

MSEsummary.Lasso <- data.frame("SigVals" = SigVals, Mod = "Lasso", "Mean" = sapply(mysummary, "[[", 2,1), "Var" = sapply(mysummary, "[[", 2,2))
MSEsummary.ElNet <- data.frame("SigVals" = SigVals, Mod = "ElNet", "Mean" = sapply(mysummary, "[[", 4,1), "Var" = sapply(mysummary, "[[", 4,2))
MSEsummary <- rbind(MSEsummary.Lasso, MSEsummary.ElNet)

ggplot(MSEsummary) + ggtitle("MSE by Standard Deviation of Error") + 
  geom_line(aes(x = SigVals, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= SigVals, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)

ggplot(NonZeroCoefsSummary) + ggtitle("Non-Zero Coefficients by Standard Deviation of Error") + 
  geom_line(aes(x = SigVals, y = Mean, color = Mod)) + 
  geom_ribbon(aes(x= SigVals, ymin = Mean - sqrt(Var), ymax = Mean + sqrt(Var), fill = Mod), alpha = 0.3)


```

## Including Plots

#You can also embed plots, for example:

#```{r pressure, echo=FALSE}
#plot(pressure)
#```

#Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.