-
Notifications
You must be signed in to change notification settings - Fork 1
/
en.search-data.min.96c4a6b4929f363b7b43a2f020adec7bd5927052e6435c8189209da1938f7c12.json
1 lines (1 loc) · 327 KB
/
en.search-data.min.96c4a6b4929f363b7b43a2f020adec7bd5927052e6435c8189209da1938f7c12.json
1
[{"id":0,"href":"/docs/lassopack/help/lasso2_help/","title":"help lasso2","section":"Help files","content":" ---------------------------------------------------------------------------------------------------------------------------------- help lasso2 lassopack v1.4.2 ---------------------------------------------------------------------------------------------------------------------------------- Title lasso2 -- Program for lasso, square-root lasso, elastic net, ridge, adaptive lasso and post-estimation OLS Syntax Full syntax lasso2 depvar regressors [if exp] [in range] [, alpha(real) sqrt adaptive adaloadings(string) adatheta(real) ols lambda(numlist) lcount(integer) lminratio(real) lmax(real) lglmnet notpen(varlist) partial(varlist) psolver(string) norecover ploadings(string) unitloadings prestd stdcoef fe noftools noconstant tolopt(real) tolzero(real) maxiter(int) plotpath(method) plotvar(varlist) plotopt(string) plotlabel ic(string) lic(string) ebicgamma(real) noic long displayall postall postresults verbose vverbose wnorm] Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transformation; the speed gains using this package can be large. See help ftools or click on ssc install ftools to install. Estimators Description ---------------------------------------------------------------------------------------------------------------------------- alpha(real) elastic net parameter, which controls the degree of L1-norm (lasso-type) to L2-norm (ridge-type) penalization. alpha=1 corresponds to the lasso (the default estimator), and alpha=0 to ridge regression. alpha must be in the interval [0,1]. sqrt square-root lasso estimator. adaptive adaptive lasso estimator. The penalty loading for predictor j is set to 1/abs(beta0(j))^theta where beta0(j) is the OLS estimate or univariate OLS estimate if p\u0026gt;n. Theta is the adaptive exponent, and can be controlled using the adatheta(real) option. adaloadings(string) alternative initial estimates, beta0, used for calculating adaptive loadings. For example, this could be the vector e(b) from an initial lasso2 estimation. The elements of the vector are raised to the power -theta (note the minus). See adaptive option. adatheta(real) exponent for calculating adaptive penalty loadings. See adaptive option. Default=1. ols post-estimation OLS. If lambda is a list, post-estimation OLS results are displayed and returned in e(betas). If lambda is a scalar, post-estimation OLS is always displayed, and this option controls whether standard or post-estimation OLS results are stored in e(b). ---------------------------------------------------------------------------------------------------------------------------- See overview of estimation methods. Lambda(s) Description ---------------------------------------------------------------------------------------------------------------------------- lambda(numlist) a scalar lambda value or list of descending lambda values. Each lambda value must be greater than 0. If not specified, the default list is used which is given by exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see mf_range). lcount(integer)† number of lambda values for which the solution is obtained. Default is 100. lminratio(real)† ratio of minimum to maximum lambda. lminratio must be between 0 and 1. Default is 1/1000. lmax(real)† maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the square-root lasso (where X is the pre-standardized regressor matrix and y is the vector of the response variable). fdev minimum fractional change in deviance (R-sq) to stop looping through lambdas (path) devmax maximum fraction of explained deviance (R-sq) to stop looping through lambdas (path) nodevcrit override criteria to exit path; loop through all lambdas in list lic(string) after first lasso2 estimation using list of lambdas, estimate model corresponding to minimum information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion. ebicgamma(real) controls the xi parameter of the EBIC. xi needs to lie in the [0,1] interval. xi=0 is equivalent to the BIC. The default choice is xi=1-log(n)/(2*log(p)). postresults Used in combination with lic(). Stores estimation results of the model selected by information criterion in e(). lglmnet Use the parameterizations for lambda, alpha, standardization, etc. employed by glmnet by Friedman et al. (2010). ---------------------------------------------------------------------------------------------------------------------------- † Not applicable if lambda() is specified. Loadings \u0026amp; standardization Description ---------------------------------------------------------------------------------------------------------------------------- notpen(varlist) sets penalty loadings to zero for predictors in varlist. Unpenalized predictors are always included in the model. partial(varlist) variables in varlist are partialled out prior to estimation. psolver(string) override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol; default=qrxx) norecover suppresses recovery of partialled out variables after estimation. ploadings(matrix) a row-vector of penalty loadings; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2))). The size of the vector should equal the number of predictors (excluding partialled out variables and excluding the constant). unitloadings penalty loadings set to a vector of ones; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2)). nostd is a synonym for unitloadings. prestd dependent variable and predictors are standardized prior to estimation rather than standardized \"on the fly\" using penalty loadings. See here for more details. By default the coefficient estimates are un-standardized (i.e., returned in original units). stdcoef return coefficients in standard deviation units, i.e., don't un-standardize. stdall return all results (coefficients, information criteria, norms, etc.) in standardized units. ---------------------------------------------------------------------------------------------------------------------------- See discussion of standardization. FE \u0026amp; constant Description ---------------------------------------------------------------------------------------------------------------------------- fe within-transformation is applied prior to estimation. Requires data to be xtset. noftools do not use ftools package for fixed-effects transform (slower; rarely used) noconstant suppress constant from estimation. Default behaviour is to partial the constant out (i.e., to center the regressors). ---------------------------------------------------------------------------------------------------------------------------- Optimization Description ---------------------------------------------------------------------------------------------------------------------------- tolopt(real) tolerance for lasso shooting algorithm (default=1e-10) tolzero(real) minimum below which coeffs are rounded down to zero (default=1e-4) maxiter(int) maximum number of iterations for the lasso shooting algorithm (default=10,000) ---------------------------------------------------------------------------------------------------------------------------- Plotting options* Description ---------------------------------------------------------------------------------------------------------------------------- plotpath(method) plots the coefficients path as a function of the L1-norm (norm), lambda (lambda) or the log of lambda (lnlambda) plotvar(varlist) list of variables to be included in the plot plotopt(string) additional plotting options passed on to line. For example, use plotopt(legend(off)) to turn off the legend. plotlabel displays variable labels in graph. ---------------------------------------------------------------------------------------------------------------------------- * Plotting is not available if lambda is a scalar value. Display options Description ---------------------------------------------------------------------------------------------------------------------------- displayall* display full coefficient vectors including unselected variables (default: display only selected, unpenalized and partialled-out) postall* post full coefficient vector including unselected variables in e(b) (default: e(b) has only selected, unpenalized and partialled-out) long† show long output; instead of showing only the points at which predictors enter or leave the model, all models are shown. verbose show additional output vverbose show even more output ic(string)† controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the default' are allowed). Note the lower case spelling. See Information criteria for the definition of each information criterion. noic† suppresses the calculation of information criteria. This will lead to speed gains if alpha\u0026lt;1, since calculation of effective degrees of freedom requires one inversion per lambda. wnorm† displays L1 norm of beta estimates weighted by penalty loadings, i.e., ||Psi*beta||(1) instead of ||beta||(1), which is the default. Note that this also affects plotting if plotpath(norm)} is specified. ---------------------------------------------------------------------------------------------------------------------------- * Only applicable if lambda is a scalar value. † Only applicable if lambda is a list (the default). Replay syntax lasso2 [, plotpath(method) plotvar(varlist) plotopt(string) plotlabel long postresults lic(string) ic(string) wnorm] Replay options Description ---------------------------------------------------------------------------------------------------------------------------- long show long output; instead of showing only the points at which predictors enter or leave the model, all models are shown. ic(string) controls which information criterion is shown in the output. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion. lic(string) estimate model corresponding to minimum information criterion. 'aic', 'bic', 'aicc', and 'ebic' (the default) are allowed. Note the lower case spelling. See Information criteria for the definition of each information criterion. postresults store estimation results in e() if lic(string) is used plotpath(method) see Plotting options above plotvar(varlist) see Plotting options above plotopt(string) see Plotting options above plotlabel see Plotting options above ---------------------------------------------------------------------------------------------------------------------------- Only applicable if lambda was a list in the previous lasso2 estimation. Postestimation: predict [type] newvar [if] [in] [, xb residuals u e ue xbu ols lambda(real) lid(int) approx noisily postresults] Predict options Description ---------------------------------------------------------------------------------------------------------------------------- xb compute predicted values (the default) residuals compute residuals e generate overall error component e(it). Only after fe. ue generate combined residuals, i.e., u(i) + e(it). Only after fe. xbu prediction including fixed effect, i.e., a + xb + u(i). Only after fe. u fixed effect, i.e., u(i). Only after fe. ols use post-estimation OLS for prediction lambda(real)‡ lambda value for prediction. Ignored if lasso2 was called with scalar lambda value. lid(int)‡ index of lambda value for prediction. lic(string) selects which information criterion to use for prediction. approx‡ linear approximation is used instead of re-estimation. Faster, but only exact if coefficient path is piecewise linear. Only supported in combination with lambda(). noisily displays beta used for prediction. postresults‡ store estimation results in e() if re-estimation is used ---------------------------------------------------------------------------------------------------------------------------- ‡ Only applicable if lambda was a list in the previous lasso2 estimation. lasso2 may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or xtset. All varlists may contain time-series operators or factor variables; see help varlist. Contents Description Coordinate descent algorithm Penalization level Standardization of variables Information criteria Estimators lasso2 vs. Friedman et al.'s glmnet and StataCorp's lasso Examples and demonstration --Data set --General demonstration --Information criteria --Plotting --Predicted values --Standardization --Penalty loadings and notpen() --Partialling vs penalization --Adaptive lasso --Replication of glmnet and StataCorp's lasso Saved results References Website Installation Acknowledgements Citation of lassopack Description lasso2 solves the following problem 1/N RSS + lambda/N*alpha*||Psi*beta||[1] + lambda/(2*N)*(1-alpha)*||Psi*beta||[2], where RSS = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares, beta is a p-dimensional parameter vector, lambda is the overall penalty level, ||.||[j] denotes the L(j) vector norm for j=1,2; alpha the elastic net parameter, which determines the relative contribution of L1 (lasso-type) to L2 (ridge-type) penalization. Psi is a p by p diagonal matrix of predictor-specific penalty loadings. Note that lasso2 treats Psi as a row vector. N number of observations Note: the above lambda and alpha differ from the definitions used in parts of the lasso and elastic net literature, e.g., the R package glmnet by Friedman et al. (2010). We have here adopted an objective function following Belloni et al. (2012). See below and below for more discussion and examples of how to use the lglmnet option to replicate glmnet output. In addition, if the option sqrt is specified, lasso2 estimates the square-root lasso (sqrt-lasso) estimator, which is defined as the solution to the following objective function: sqrt(1/N*RSS) + lambda/N*||Psi*beta||[1]. Coordinate descent algorithm lasso2 implements the elastic net and sqrt-lasso using coordinate descent algorithms. The algorithm (then referred to as \"shooting\") was first proposed by Fu (1998) for the lasso, and by Van der Kooij (2007) for the elastic net. Belloni et al. (2011) implement the coordinate descent for the sqrt-lasso, and have kindly provided Matlab code. Coordinate descent algorithms repeatedly cycle over predictors j=1,...,p and update single coefficient estimates until convergence. Suppose the predictors are centered and standardized to have unit variance. In that case, the update for coefficient j is obtained using univariate regression of the current partial residuals (i.e., excluding the contribution of predictor j) against predictor j. The algorithm requires an initial beta estimate for which the Ridge estimate is used. If the coefficient path is obtained for a list of lambda values, lasso2 starts from the largest lambda value and uses previous estimates as warm starts. See Friedman et al. (2007, 2010), and references therein, for further information. Penalization level: choice of lambda (and alpha) Penalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters that control the degree and type of penalization. The estimation methods implemented in lasso2 use two tuning parameters: lambda, which controls the general degree of penalization, and alpha, which determines the relative contribution of L1-type to L2-type penalization. lasso2 obtains elastic net and sqrt-lasso solutions for a given lambda value or a list of lambda values, and for a given alpha value (default=1). lassopack offers three approaches for selecting the \"optimal\" lambda (and alpha) value: (1) The penalty level may be chosen by cross-validation in order to optimize out-of-sample prediction performance. K-fold cross-validation and rolling cross-validation (for panel and time-series data) are implemented in cvlasso. cvlasso also supports cross-validation across alpha. (2) Theoretically justified and feasible penalty levels and loadings are available for the lasso and sqrt-lasso via the separate command rlasso. The penalization is chosen to dominate the noise of the data-generating process (represented by the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter estimation. Since the error variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible) lasso that relies on an iterative algorithm for estimating the optimal penalization and is valid in the presence of non-Gaussian and heteroskedastic errors. Belloni et al. (2016) extend the framework to the panel data setting. In the case of the sqrt-lasso under homoskedasticity, the optimal penalty level is independent of the unknown error variance, leading to a practical advantage and better performance in finite samples (see Belloni et al., 2011, 2014). See help rlasso for more details. (3) Lambda can also be selected using information criteria. lasso2 calculates four information criteria: Akaike Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC; Schwarz, 1978), Extended Bayesian information criterion (EBIC; Chen \u0026amp; Chen, 2008) and the corrected AIC (AICc; Sugiura, 1978, and Hurvich, 1989). By default, lasso2 displays EBIC in the output, but all four information criteria are stored in e(aic), e(bic), e(ebic) and e(aicc). See section Information criteria for more information. Standardization of variables Standard practice is for predictors to be \"standardized\", i.e., normalized to have mean zero and unit variance. By default lasso2 achieves this by incorporating the standardization into the penalty loadings. We refer to this method as standardization \"on the fly\", as standardization occurs during rather than before estimation. Alternatively, the option prestd causes the predictors to be standardized prior to the estimation. Standardizing \"on the fly\" via the penalty loadings and pre-standardizing the data prior to estimation are theoretically equivalent. The default standardizing \"on the fly\" is often faster. The prestd option can lead to improved numerical precision or more stable results in the case of difficult problems; the cost is the computation time required to pre-standardize the data. Estimators Ridge regression (Hoerl \u0026amp; Kennard, 1970) The ridge estimator can be written as betahat(ridge) = (X'X+lambda*I(p))^(-1)X'y. Thus, even if X'X is not full rank (e.g. because p\u0026gt;n), the problem becomes nonsingular by adding a constant to the diagonal of X'X. Another advantage of the ridge estimator over least squares stems from the variance-bias trade-off. Ridge regression may improve over ordinary least squares by inducing a mild bias while decreasing the variance. For this reason, ridge regression is a popular method in the context of multicollinearity. In contrast to estimators relying on L1-penalization, the ridge does not yield sparse solutions and keeps all predictors in the model. Lasso estimator (Tibshirani, 1996) The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of coefficient estimates. Tibshirani (1996) motivates the lasso with two major advantages over least squares. First, due to the nature of the L1-penalty, the lasso tends to produce sparse solutions and thus facilitates model interpretation. Secondly, similar to ridge regression, lasso can outperform least squares in terms of prediction due to lower variance. Another advantage is that the lasso is computationally attractive due to its convex form. This is in contrast to model selection based on AIC or BIC (which employ L0 penalization) where each possible sub-model has to be estimated. Elastic net (Zou \u0026amp; Hastie, 2005) The elastic net applies a mix of L1 (lasso-type) and L2 (ridge-type) penalization. It combines some of the strengths of lasso and ridge regression. In the presence of groups of correlated regressors, the lasso selects typically only one variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated variables. On the other hand, the ridge does not yield sparse solutions impeding model interpretation. The elastic net is able to produce sparse solutions (for some alpha greater than zero) and retains (or drops) correlated variables jointly. Adaptive lasso (Zou, 2006) The lasso is only variable selection consistent under the rather strong \"irrepresentable condition\", which imposes constraints on the degree of correlation between predictors in the true model and predictors outside of the model (see Zhao \u0026amp; Yu, 2006; Meinshausen \u0026amp; Bühlmann, 2006). Zou (2006) proposes the adaptive lasso which uses penalty loadings of 1/abs(beta0(j))^theta where beta0 is an initial estimator. The adaptive lasso is variable-selection consistent for fixed p under weaker assumptions than the standard lasso. If p\u0026lt;n, OLS can be used as the initial estimator. Huang et al. (2008) suggest to use univariate OLS if p\u0026gt;n. Other initial estimators are possible. Square-root lasso (Belloni et al., 2011, 2014) The sqrt-lasso is a modification of the lasso that minimizes (RSS)^(1/2) instead of RSS, while also imposing an L1-penalty. The main advantage of the sqrt-lasso over the standard lasso is that the theoretically grounded, data-driven optimal lambda is independent of the unknown error variance under homoskedasticity. See rlasso. Post-estimation OLS Penalized regression methods induce a bias that can be alleviated by post-estimation OLS, which applies OLS to the predictors selected by the first-stage variable selection method. For the case of the lasso, Belloni and Chernozhukov ( 2013) have shown that the post-lasso OLS performs at least as well as the lasso under mild additional assumptions. For further information on the lasso and related methods, see for example the textbooks by Hastie et al. (2009, 2015; both available for free) and Bühlmann \u0026amp; Van de Geer (2011). Information criteria The information criteria supported by lasso2 are the Akaike information criterion (AIC, Akaike, 1974), the Bayesian information criterion (BIC, Schwarz, 1978), the corrected AIC (Sugiura, 1978; Hurvich, 1989), and the Extended BIC (Chen \u0026amp; Chen, 2008). These are given by (omitting dependence on lambda and alpha): AIC = N*log(RSS/N) + 2*df BIC = N*log(RSS/N) + df*log(N) AICc = N*log(RSS/N) + 2*df*N/(N-df) EBIC = BIC + 2*xi*df*log(p) where RSS(lambda,alpha) is the residual sum of squares and df(lambda,alpha) is the effective degrees of freedom, which is a measure of model complexity. In the linear regression model, the degrees of freedom is simply the number of regressors. Zou et al. (2007) show that the number of non-zero coefficients is an unbiased and consistent estimator of df(lambda,alpha) for the lasso. More generally, the degrees of freedom of the elastic net can be calculated as the trace of the projection matrix. With an unbiased estimator for df available, the above information criteria can be employed to select tuning parameters. The BIC is known to be model selection consistent if the true model is among the candidate models, whereas the AIC tends to yield an overfitted model. On the other hand, the AIC is loss efficient in the sense that it selects the model that minimizes the squared average prediction error, while the BIC does not possess this property. Zhang et al. (2010) show that these principles also apply when AIC and BIC are employed to select the tuning parameter for penalized regression. Both AIC and BIC tend to overselect regressors in the small-N-large-p case. The AICc corrects the small sample bias of the AIC which can be especially severe in the high-dimensional context. Similarily, the EBIC addresses the shortcomings of the BIC when p is large by imposing a larger penalty on the number of coefficients. Chen \u0026amp; Chen (2008) show that the EBIC performs better in terms of false discovery rate at the cost of a negligible reduction in the positive selection rate. The EBIC depends on an additional parameter, xi (denoted as gamma in the original article), which can be controlled using ebicgamma(real). gamma=0 is equivalent to the BIC. We follow Chen \u0026amp; Chen (2008, p. 768) and use xi=1-log(n)/(2*log(p)) as the default choice. An upper and lower threshold is applied to ensure that xi lies in the [0,1] interval. The EBIC is displayed in the output of lasso2 by default (if lambda is a list), but all four information criteria are returned in e(). The lambda values that minimize the information criteria for a given alpha are returned in e(laic), e(lbic), e(laicc) and e(lebic), respectively. To change the default display, use the ic(string) option. noic suppresses the calculation of information criteria, which leads to a speed gain if alpha\u0026lt;1. lasso2 vs. Hastie et al.'s (2010) glmnet and StataCorp's lasso The parameterization used by lasso2 differs from StataCorp's lasso in only one respect: lambda(StataCorp) = (1/2N)*lambda(lasso2). The elastic net parameter alpha is the same in both parameterizations. See below for examples. The parameterization used by Hastie et al.'s (2010) glmnet uses the same convention as StataCorp for lambda: lambda(glmnet) = (1/2N)*lambda(lasso2). However, the glmnet treatment of the elastic net parameter alpha differs from both lasso2 and StataCorp's lasso. The glmnet objective function is defined such that the dependent variable is assumed already to have been standardized. Because the L2 norm is nonlinear, this affects the interpretation of alpha. Specifically, the default lasso2 and StataCorp's lasso parameterization means that alpha is not invariant changes in the scale of the dependent variable. The glmnet parameterization of alpha, however, is scale-invariant - a useful feature. lasso2 provides an lglmnet option that enables the user to employ the glmnet parameterization for alpha and lambda. See below for examples of its usage and how to replicate glmnet output. We recommend the use of the lglmnet option in particular with cross-validation over alpha (see cvlasso). Example using prostate cancer data (Stamey et al., 1989) Data set The data set is available through Hastie et al. (2009) on the authors' website. The following variables are included in the data set of 97 men: Predictors lcavol log(cancer volume) lweight log(prostate weight) age patient age lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) gleason Gleason score pgg45 percentage Gleason scores 4 or 5 Outcome lpsa log(prostate specific antigen) Load prostate cancer data. . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear tab General demonstration Estimate coefficient lasso path over (default) list of lambda values. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 The replay syntax can be used to re-display estimation results. . lasso2 User-specified lambda list. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10) The list of returned e() objects depends on whether lambda() is a list (the default) or a scalar value. For example, if lambda is a scalar, one vector of coefficient estimates is returned. If lambda is a list, the whole coefficient path for a range of lambda values is obtained. The last row of e(betas) is equal to the row vector e(b). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(100 50 10) . ereturn list . mat list e(betas) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10) . ereturn list . mat list e(b) Sqrt-lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt Ridge regression. All predictors are included in the model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0) Elastic net with alpha=0.1. Even though alpha is close to zero (Ridge regression), the elastic net can produce sparse solutions. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) The option ols triggers the use of post-estimation OLS. OLS alleviates the shrinkage bias induced by L1 and L2 norm penalization. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ols . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt ols . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0.1) ols Information criteria lasso2 calculates four information criteria: AIC, BIC, EBIC and AICc. The EBIC is shown by default in the output along with the R-squared. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 To see another information criterion in the outout, use the ic(string) option where string can be replaced by aic, bic, ebic or aicc (note the lower case spelling). For example, to display AIC: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, ic(aic) In fact, there is no need to re-run the full model. We can make use of the replay syntax: . lasso2, ic(aic) The long option triggers extended output; instead of showing only the points at which predictors enter or leave the model, all models are shown. An asterisk marks the model (i.e., the value of lambda) that minimizes the information criterion (here, AIC). . lasso2, ic(aic) long To estimate the model corresponding to the minimum information criterion, click on the link at the bottom of the output or type one of the following: . lasso2, lic(aic) . lasso2, lic(ebic) . lasso2, lic(bic) . lasso2, lic(aicc) To store the estimation results of the selected model, add the postresults option. . lasso2, lic(ebic) . ereturn list . lasso2, lic(ebic) postres . ereturn list The same can also be achieved in one line without using the replay syntax. lasso2 first obtains the full coefficient path for a list lambda values, and then runs the model selected by AIC. Again, postresults can be used to store results of the selected model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lic(aic) postres Plotting Plot coefficients against lambda: As lambda increases, the coefficient estimates are shrunk towards zero. Lambda=0 corresponds to OLS and if lambda is sufficiently large the model is empty. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(lambda) Plot coefficients against L1 norm. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, plotpath(norm) The replay syntax can also be used for plotting. . lasso2, plotpath(norm) Only selected variables are plotted. . lasso2, plotpath(norm) plotvar(lcavol svi) The variable names can be displayed directly next to each series using plotlabel. plotopt(legend(off)) suppresses the legend. . lasso2, plotpath(lambda) plotlabel plotopt(legend(off)) . lasso2, plotpath(norm) plotlabel plotopt(legend(off)) Predicted values xbhat1 is generated by re-estimating the model for lambda=10. The noisily option triggers the display of the estimation results. xbhat2 is generated by linear approximation using the two beta estimates closest to lambda=10. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . cap drop xbhat1 . predict double xbhat1, xb l(10) noisily . cap drop xbhat2 . predict double xbhat2, xb l(10) approx The model is estimated explicitly using lambda=10. If lasso2 is called with a scalar lambda value, the subsequent predict command requires no lambda() option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(10) . cap drop xbhat3 . predict double xbhat3, xb All three methods yield the same results. However note that the linear approximation is only exact for the lasso which is piecewise linear. . sum xbhat1 xbhat2 xbhat3 It is also possible to obtain predicted values by referencing a specific lambda ID using the lid() option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . cap drop xbhat4 . predict double xbhat4, xb lid(21) . cap drop xbhat5 . predict double xbhat5, xb l(25.45473900468241) . sum xbhat4 xbhat5 Standardization By default lasso2 standardizes the predictors to have unit variance. Standardization is done by default \"on the fly\" via penalty loadings. The coefficient estimates are returned in original units. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) Instead of standardizing \"on the fly\" by setting penalization loadings equal to standardization loadings, we can standardize the regressors prior to estimation with the prestd option. Both methods are equivalent in theory. Standardizing \"on the fly\" tends to be faster, but pre-standardization may lead to more stable results in the case of difficult problems. See here for more information. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) . mat list e(Psi) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd . mat list e(Psi) The used penalty loadings are stored in e(Psi). In the first case above, the standardization loadings are returned. In the second case the penalty loadings are equal to one for all regressors. To get the coefficients in standard deviation units, stdcoef can be specified along with the prestd option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) prestd stdcoef We can override any form of standardization with the unitloadings options, which sets the penalty loadings to a vector of 1s. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) unitloadings The same logic applies to the sqrt-lasso (and elastic net). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt prestd Penalty loadings and notpen() By default the penalty loading vector is a vector of standard deviations. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) . mat list e(Psi) We can set the penalty loading for specific predictors to zero, implying no penalization. Unpenalized predictor are always included in the model. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) . mat list e(Psi) We can specify custom penalty loadings. The option ploadings expects a row vector of size p where p is the number of regressors (excluding the constant, which is partialled out). Because we pre-standardize the data (and we are using the lasso) the results are equivalent to the results above (standardizing on the fly and specifying lcavol as unpenalized). . mat myloadings = (0,1,1,1,1,1,1,1) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) ploadings(myloadings) prestd . mat list e(Psi) Partialling vs penalization If lambda and the penalty loadings are kept constant, partialling out and not penalizing of variables yields the same results for the included/penalized regressors. Yamada (2017) shows that the equivalence of partialling out and not penalizing holds for lasso and ridge regression. The examples below suggest that the same result also holds for the elastic net in general and the sqrt-lasso. Note that the equivalence only holds if the regressor matrix and other penalty loadings are the same. Below we use the unitloadings option to achieve this; alternatively we could use the ploadings(.) option. Lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) partial(lcavol) unitload Sqrt-lasso. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) sqrt partial(lcavol) unitload Ridge regression. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) partial(lcavol) unitload Elastic net. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) notpen(lcavol) unitload . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0.5) partial(lcavol) unitload Partialling-out is implemented in Mata using one of Mata's solvers. In cases where the variables to be partialled out are collinear or nearly so, different solvers may generate different results. Users may wish to check the stability of their results in such cases. The psolver(.) option can be used to specify the Mata solver used. The default behavior for solving AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx. Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates using svsolve((A'A),(A'B)). lasso2 will warn if collinear variables are dropped when partialling out. Adaptive lasso The adaptive lasso relies on an initial estimator to calculate the penalty loadings. The penalty loadings are given by 1/abs(beta0(j))^theta, where beta0(j) denotes the initial estimate for predictor j. By default, lasso2 uses OLS as the initial estimator as originally suggested by Zou (2006). If the number of parameters exceeds the numbers of observations, univariate OLS is used; see Huang et al. (2008). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive . mat list e(Psi) See the OLS estimates for comparison. . reg lpsa lcavol lweight age lbph svi lcp gleason pgg45 Theta (the exponent for calculating the adaptive loadings) can be changed using the adatheta() option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adat(2) . mat list e(Psi) Other initial estimators such as ridge regression are possible. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, l(10) alpha(0) . mat bhat_ridge = e(b) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge) . mat list e(Psi) Replication of glmnet and StataCorp's lasso Use Stata's auto dataset with missing data dropped. The variable price1000 is used to illustrate scaling effects. . sysuse auto, clear . drop if rep78==. . gen double price1000 = price/1000 To load the data into R for comparison with glmnet, use the following commands. The packages haven and tidyr need to be installed. auto \u0026lt;- haven::read_dta(\"http://www.stata-press.com/data/r9/auto.dta\") auto \u0026lt;- tidyr::drop_na() n \u0026lt;- nrow(auto) price \u0026lt;- auto$price X \u0026lt;- auto[, c(\"mpg\", \"rep78\", \"headroom\", \"trunk\", \"weight\", \"length\", \"turn\", \"displacement\", \"gear_ratio\", \"foreign\")] X$foreign \u0026lt;- as.integer(X$foreign) X \u0026lt;- as.matrix(X) Replication of StataCorp's lasso and elasticnet requires only the rescaling of lambda by 2N. N=69 so the lasso2 lambda becomes 138000/(2*69) = 1000 . lasso2 price mpg-foreign, lambda(138000) . lasso linear price mpg-foreign, grid(1, min(1000)) . lassoselect lambda = 1000 . lassocoef, display(coef, penalized) . lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . elasticnet linear price mpg-foreign, alphas(0.6) grid(1, min(1000)) . lassoselect alpha = 0.6 lambda = 1000 . lassocoef, display(coef, penalized) glmnet uses the same definition of the lasso L0 penalty as StataCorp's lasso, so lasso2's default parameterization again requires only rescaling by 2N. When the lglmnet option is used with the lglmnet option, the L0 penalty should be provided using the glmnet definition. To estimate in R, load glmnet with library(\"glmnet\") and use the following command: r\u0026lt;-glmnet(X,price,alpha=1,lambda=1000,thresh=1e-15) . lasso2 price mpg-foreign, lambda(138000) . lasso2 price mpg-foreign, lambda(1000) lglmnet The R code below uses glmnet to estimate an elastic net model. lasso2 with the lglmnet option will replicate it. r\u0026lt;-glmnet(X,price,alpha=0.6,lambda=1000,thresh=1e-15) . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet lasso2's default parameterization of the elastic net (like StataCorp's elasticnet) is not invariant to scaling: . lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(138) When lasso2 uses the glmnet parameterization of the elastic net via the glmnet options, results are invariant to scaling: the only difference is that the coefficients change by the same factor of proportionality as the dependent variable. . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet The reason that the default lasso2/StataCorp parameterization is not invariant to scaling is because the penalty on L2 norm is influenced by scaling, and this in turn affects the relative weights on the L1 and L2 penalties. The example below shows how to reparameterize so that the default lasso2 parameterization for the elastic net replicates the glmnet parameterization. The example using the scaling above, where the dependent variable is price1000 and the glmnet lambda=1. The large-sample standard deviation of price1000 = 2.8912586. . qui sum price1000 . di r(sd) * 1/sqrt( r(N)/(r(N)-1)) The lasso2 alpha = alpha(lglmnet)*SD(y) / (1-alpha(glmnet) + alpha(glmnet)*SD(y)). In this example, alpha = (0.6*2.8912586)/( 1-0.6 + 0.6*2.89125856) = 0.81262488. . di (0.6*2.8912586)/( 1-0.6 + 0.6*2.8912586) The lasso2 lambda = 2N*lambda(lglmnet) * (alpha(lglmnet) + (1-alpha(lglmnet))/SD(y)). In this example, lambda = 2*69*1 * (0.6 + (1-0.6)/2.8912586) = 101.89203. . di 2*69*( 0.6 + (1-0.6)/2.8912586) lasso2 using the glmnet and then replicated using the lasso2/StataCorp parameterization: . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet . lasso2 price1000 mpg-foreign, alpha(.81262488) lambda(101.89203) Saved results The set of returned e-class objects depends on whether lambda is a scalar or a list (the default). scalars e(N) sample size e(cons) =1 if constant is present, 0 otherwise e(fe) =1 if fixed effects model is used, 0 otherwise e(alpha) elastic net parameter e(sqrt) =1 if the sqrt-lasso is used, 0 otherwise e(ols) =1 if post-estimation OLS results are returned, 0 otherwise e(adaptive) =1 if adaptive loadings are used, 0 otherwise e(p) number of penalized regressors in model e(notpen_ct) number of unpenalized variables e(partial_ct) number of partialled out regressors (incl constant) e(prestd) =1 if pre-standardized e(lcount) number of lambda values scalars (only if lambda is a list) e(lmax) largest lambda value e(lmin) smallest lambda value scalars (only if lambda is a scalar) e(lambda) penalty level e(r2) R-sq for lasso estimation e(rmse) root mean squared error e(rmseOLS) root mean squared error of post-estimation OLS e(objfn) minimized objective function e(k) number of selected and unpenalized/partialled-out regressors including constant (if present) e(s) number of selected regressors e(s0) number of selected and unpenalized regressors including constant (if present) e(df) L0 norm (\"effective degrees of freedom\") e(niter) number of iterations e(maxiter) maximum number of iterations e(tss) total sum of squares e(aicmin) minimum AIC e(bicmin) minimum BIC e(aiccmin) minimum AICc e(ebicmin) minimum EBIC e(laic) lambda corresponding to minimum AIC e(lbic) lambda corresponding to minimum BIC e(laicc) lambda corresponding to minimum AICc e(lebic) lambda corresponding to minimum EBIC macros e(cmd) command name e(depvar) name of dependent variable e(varX) all predictors e(varXmodel) penalized predictors e(partial) partialled out predictors e(notpen) unpenalized predictors e(method) estimation method macros (only if lambda is a scalar) e(selected) selected predictors e(selected0) selected predictors excluding constant matrices e(Psi) row vector used penalty loadings e(stdvec) row vector of standardization loadings matrices (only if lambda is a list) e(lambdamat) column vector of lambdas used for estimations e(lambdamat0) full initial vector of lambdas (includes lambdas not used for estimation) e(l1norm) column vector of L1 norms for each lambda value (excludes the intercept) e(wl1norm) column vector of weighted L1 norms for each lambda value (excludes the intercept), see wnorm e(betas) matrix of estimates, where each row corresponds to one lambda value. The intercept is stored in the last column. e(IC) matrix of information criteria (AIC, AICC, BIC, EEBIC) for each lambda value (NB: All the matrices above but in standardized units are also saved with the same name preceded by \"s\".) e(dof) column vector of L0 norm for each lambda value (excludes the intercept) e(ess) column vector of explained sum of squares for each lambda value e(rss) column vector of residual sum of squares for each lambda value e(rsq) column vector of R-squared for each lambda value matrices (only if lambda is a scalar) e(beta) coefficient vector e(betaOLS) coefficient vector of post-estimation OLS e(betaAll) full coefficient vector including omitted, factor base variables, etc. e(betaAllOLS) full post-estimation OLS coefficient vector including omitted, factor base variables, etc. (NB: All the matrices above but in standardized units are also saved with the same name preceded by \"s\".) e(b) posted coefficient vector (see postall and displayall). Used for prediction. functions e(sample) estimation sample References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705 Belloni, A., Chernozhukov, V., \u0026amp; Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4), 791–806. https://doi.org/10.1093/biomet/asr043 Belloni, A., Chen, D., Chernozhukov, V., \u0026amp; Hansen, C. (2012). Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain. Econometrica 80(6), 2369–2429. https://doi.org/10.3982/ECTA9626 Belloni, A., \u0026amp; Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2), 521–547. https://doi.org/10.3150/11-BEJ410 Belloni, A., Chernozhukov, V., \u0026amp; Wang, L. (2014). Pivotal estimation via square-root Lasso in nonparametric regression. The Annals of Statistics 42(2), 757–788. https://doi.org/10.1214/14-AOS1204 Belloni, A., Chernozhukov, V., Hansen, C., \u0026amp; Kozbur, D. (2016). Inference in High Dimensional Panel Models with an Application to Gun Control. Journal of Business \u0026amp; Economic Statistics 34(4), 590–605. Methodology. https://doi.org/10.1080/07350015.2015.1102733 Bühlmann, P., \u0026amp; Meinshausen, N. (2006). High-dimensional graphs and variable selection with the Lasso. {it:The Annals of Statistics], 34(3), 1436–1462. http://doi.org/10.1214/009053606000000281 Bühlmann, P., \u0026amp; Van de Geer, S. (2011). Statistics for High-Dimensional Data. Berlin, Heidelberg: Springer-Verlag. Chen, J., \u0026amp; Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771. https://doi.org/10.1093/biomet/asn034 Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets. https://ideas.repec.org/c/boc/bocode/s458213.html Fu, W. J. (1998). Penalized Regressions: The Bridge Versus the Lasso. Journal of Computational and Graphical Statistics 7(3), 397–416. https://doi.org/10.2307/1390712 Friedman, J., Hastie, T., Höfling, H., \u0026amp; Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332. https://doi.org/10.1214/07-AOAS131 Friedman, J., Hastie, T., \u0026amp; Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01 Hastie, T., Tibshirani, R., \u0026amp; Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). New York: Springer-Verlag. https://web.stanford.edu/~hastie/ElemStatLearn/ Hastie, T., Tibshirani, R., \u0026amp; Wainwright, M. J. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press, Taylor \u0026amp; Francis. https://www.stanford.edu/~hastie/StatLearnSparsity/ Hoerl, A. E., \u0026amp; Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 12(1), 69–82. https://doi.org/10.1080/00401706.1970.10488635 Huang, J., Ma, S., \u0026amp; Zhang, C.-H. (2008). Adaptive Lasso for Sparse High-Dimensional Regression Models Supplement. Statistica Sinica 18, 1603–1618. https://doi.org/10.2307/24308572 Hurvich, C. M., \u0026amp; Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2), 297–307. http://doi.org/10.1093/biomet/76.2.297 Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136 Stamey, T. A., Kabalin, J. N., Mcneal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., \u0026amp; Yang, N. (1989). Prostate Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate. II. Radical Prostatectomy Treated Patients. The Journal of Urology 141(5), 1076–1083. https://doi.org/10.1016/S0022-5347(17)41175-X Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite corrections. Communications in Statistics - Theory and Methods, 7(1), 13–26. http://doi.org/10.1080/03610927808827599 Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288. https://doi.org/10.2307/2346178 Van der Kooij A (2007). Prediction Accuracy and Stability of Regrsssion with Optimal Scaling Transformations. Ph.D. thesis, Department of Data Theory, University of Leiden. http://hdl.handle.net/1887/12096 Yamada, H. (2017). The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression. Communications in Statistics - Theory and Methods 46(21), 10897–10902. https://doi.org/10.1080/03610926.2016.1252403 Zhang, Y., Li, R., \u0026amp; Tsai, C.-L. (2010). Regularization Parameter Selections via Generalized Information Criterion. Journal of the American Statistical Association, 105(489), 312–323. http://doi.org/10.1198/jasa.2009.tm08013 Zhao, P., \u0026amp; Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2541–2563. http://dl.acm.org/citation.cfm?id=1248547.1248637 Zou, H., \u0026amp; Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B: Statistical Methodology 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735 Zou, H., Hastie, T., \u0026amp; Tibshirani, R. (2007). On the \"degrees of freedom\" of the lasso. Ann. Statist., 35(5), 2173–2192. https://doi.org/10.1214/009053607000000127 Website Please check our website https://statalasso.github.io/ for more information. Installation lasso2 is part of the lassopack package. To get the latest stable version of lassopack from our website, check the installation instructions at https://statalasso.github.io/installation/. We update the stable website version more frequently than the SSC version. Earlier versions of lassopack are also available from the website. To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which requires whichpkg to be installed; ssc install whichpkg). Acknowledgements Thanks to Alexandre Belloni, who provided Matlab code for the square-root lasso estimator, Sergio Correia for advice on the use of the FTOOLS package, and Jan Ditzen. Citation of lasso2 lasso2 is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it as such: Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397. Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E. Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.uk Also see Help: cvlasso, rlasso, lassologit, ivlasso, pdslasso (if installed). "},{"id":1,"href":"/docs/lassopack/","title":"LASSOPACK","section":"Docs","content":" When would you want to use lassopack? # lassopack is a suite of programs for regularized regression methods suitable for the high-dimensional setting where the number of predictors, \\(p\\) , may be large and possibly greater than the number of observations, \\(N\\) .\nHigh-dimensional models # The regularized regression methods implemented in lassopack can deal with situations where the number of regressors is large or may even exceed the number of observations under the assumption of sparsity.\nHigh-dimensionality can arise when (see Belloni et al., 2014):\nThere are many variables available for each unit of observation. For example, in cross-country regressions the number of observations is naturally limited by the number of countries, whereas the number of potentially relevant explanatory variables is often large. There are only few observed variables, but the functional form through which these regressors enter the model is unknown. We can then use a large set of transformations (e.g. dummy variables, interaction terms and polynomials) to approximate the true functional form. Model selection # Identifying the true model is a fundamental problem in applied econometrics. A standard approach is to use hypothesis testing to identify the correct model (e.g. general-to-specific approach). However, this is problematic if the number of regressors is large due to many false positives. Furthermore, sequential hypothesis testing induces a pre-test bias.\nLasso, elastic net and square-root lasso set some coefficient estimates to exactly zero, and thus allow for simultaneous estimation and model selection. The adaptive lasso is known to exhibit good properties as a model selector as shown by Zou (2006).\nPrediction # If there are many predictors, OLS is likely to suffer from overfitting: good in-sample fit (large \\(R^2\\) ), but poor out-of-sample prediction performance. Regularized regression methods tend to outperform OLS in terms of out-of-sample prediction.\nRegularization techniques exploit the variance-bias-tradeoff: they reduce the complexity of the model (through shrinkage or by dropping variables). In doing so, they introduce a bias, but also reduce the variance of the prediction, which can result in improved prediction performance.\nForecasting with time-series or panel data # lassopack can also applied to time-series or panel data. For example, Medeiros \u0026amp; Mendes (2016) prove model selection consistency of the adaptive lasso when applied to time-series data with non-Gaussian, heteroskedastic errors.\n"},{"id":2,"href":"/docs/ddml/models/","title":"Model overview","section":"DDML","content":" Supported models # Throughout we use \\(Y\\) to denote the outcome variable, \\(X\\) to denote confounders, \\(Z\\) to denote instrumental variable(s), and \\(D\\) to denote the treatment variable(s) of interest.\nFor a full discussion, please check our working paper.\nPartial linear model [partial] # \\[ Y = a.D \u0026#43; g(X) \u0026#43; U \\\\ D = m(X) \u0026#43; V \\quad\\quad~~\\] where the aim is to estimate \\(a\\) while flexibly controlling for \\(X\\) . ddml allows for multiple treatment variables, which may be binary or continuous.\nInteractive model [interactive] # \\[ Y = g(X,D) \u0026#43; U\\\\ D = m(X) \u0026#43; V\\quad\\] which relaxes the assumption that \\(X\\) and \\(D\\) are separable. \\(D\\) is a binary treatment variable. We are interested in the Average Treatment Effect or Average Treatment Effect of the Treated.\nPartial linear IV model [iv] # \\[ Y = a.D \u0026#43; g(X) \u0026#43; U\\\\ Z = m(X) \u0026#43; V\\quad\\quad~~\\] The parameter of interest is \\(a\\) . We leverage instrumental variables \\(Z\\) to identify \\(a\\) , while flexibly controlling for \\(X\\) .\nFlexible IV model [fiv] # \\[ Y = a.D \u0026#43; g(X) \u0026#43; U ~~~\\\\ D= m(Z) \u0026#43; g(X) \u0026#43; V \\] As in the Partial Linear Model, we are interested in \\(a\\) . The Flexible Partially Linear IV Model allows for approximation of optimal instruments, but relies on a stronger independence assumption than the Partially Linear IV Model.\nInteractive IV model [interactiveiv] # \\[ Y = g(Z,X) \u0026#43; U\\\\ D = h(Z,X) \u0026#43; V\\\\ Z = m(X) \u0026#43; E~~~\\] where the aim is to estimate the local average treatment effect, while flexibly controlling for \\(X\\) . Both \\(Z\\) and \\(D\\) are binary.\n"},{"id":3,"href":"/docs/pdslasso/pdslasso_models/","title":"Models","section":"PDSLASSO","content":" Many instruments # Belloni et al. (2012, Econometrica) consider the model\n\\[y_i = \\alpha d_i \u0026#43; \\varepsilon_i \\\\ d_i = z_i\u0026#39;\\delta \u0026#43; u_i\\] where \\(y_i\\) is the dependent variable, \\(d_i\\) is an endogenous regressors and \\(z_i\\) is a \\(p_z\\) -dimensional vector of instruments. \\(p_z\\) is allowed to be large and may even exceed the sample size. We refer to \\(z_i\\) as high-dimensional. The interest lies in estimating the causal effect of endogenous variable \\(d_i\\) on the outcome variable \\(y_i\\) .\nThe choice and specification of instruments is crucial for the estimation of \\(\\alpha\\) . However, often it is a priori not clear how to select or specify instruments. The situation of many instruments can arise because there are simply many instruments available and/or because we need to consider a large number of transformations of elementary variables to approximate the complex relationship between endogenous regressor \\(d_i\\) and instruments \\(z_i\\) .\nBelloni et al. suggest to apply the lasso with theory-driven penalization to the equation \\(d_i = z_i\u0026#39;\\delta \u0026#43; u_i\\) . Under the assumption of (approximate) sparsity, the rigorous lasso (or square-root lasso) can be applied to select appropriate instruments and to predict \\(d_i\\) . \\(\\hat{d}_i=z_i\u0026#39;\\hat\\delta\\) is then used as a as estimate of the optimal instrument, where \\(\\hat\\delta\\) is either the lasso, square-root lasso, post-lasso or post square-root lasso estimator. Instrument selection using lasso and square-root lasso is implemented in ivlasso.\nMany controls # Next, we consider the case where \\(d_i\\) is exogenous, but there are many control variables.\n\\(y_i = \\alpha d_i \u0026#43; x_i\u0026#39;\\beta \u0026#43; \\varepsilon_i\\) In this setting, we allow the \\(p_x\\) -dimensional vector of controls, \\(x_i\\) to be high-dimensional. The problem the researcher faces is that the \u0026ldquo;right\u0026rdquo; set of controls is not known. In traditional practice, this presents her with a difficult choice: use too few controls, or the wrong ones, and omitted variable bias will be present; use too many, and the model will suffer from overfitting.\nThe post-double-selection (PDS) methodology introduced in Belloni, Chernozhukov and Hansen (2014) uses the lasso estimator to select the controls. Specifically, the lasso is used twice:\nestimate a lasso regression with \\(y_i\\) as the dependent variable and the control variables \\(x_i\\) as regressors;\nestimate a lasso regression with \\(d_i\\) as the dependent variable and again the control variables \\(x_i\\) as regressors. The lasso estimator achieves a sparse solution, i.e., most coefficients are set to zero. The final choice of control variables to include in the OLS regression of \\(y_i\\) on \\(d_i\\) is the union of the controls selected selected in steps 1. and 2., hence the name post-double selection for the methodolgy.\nThe post-regularization or CHS methodology is closely related. Instead of using the lasso-selected controls in a post-regularization OLS estimation, the selected variables are used to construct orthogonalized versions of the dependent variable and the exogenous causal variables of interest. The orthogonalized versions are based either on the lasso or post-lasso estimated coefficients; the post-lasso is OLS applied to lasso-selected variables. See Chernozhukov, Hansen \u0026amp; Spindler (2015) for details.\nThe post-double-selection and post-regularization approach for many controls are implemented in pdslasso.\nMany controls and many instruments # Chernozhukov, Hansen \u0026amp; Spindler (2015) also consider the case where we have both many instruments and many controls:\n\\[y_i = \\alpha d_i \u0026#43; x_i\u0026#39;\\beta \u0026#43;\\varepsilon_i\\\\ d_i = x_i\u0026#39;\\gamma \u0026#43; z_i\u0026#39;\\delta \u0026#43; u_i\\] where \\(p_x\\gg N\\) and/or \\(p_z\\gg N\\) are allowed. The above model can be estimated using ivlasso, which allows for low and/or high-dimensional sets of instruments.\nTo summarise, ivlasso and pdslasso implement methods for:\nendogenous and/or exogenous regressors, low and high-dimensional instruments, low and high-dimensional control variables. "},{"id":4,"href":"/docs/lassopack/package_overview/","title":"Package overview","section":"LASSOPACK","content":" The package consists of the following programs: # lasso2 implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso and post-estimation OLS. The lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996), the square-root-lasso (Belloni et al. 2011) and the adaptive lasso (Zou 2006) are regularization methods that use \\(\\ell_1\\) norm penalization to achieve sparse solutions: of the full set of \\(p\\) predictors, typically most will have coefficients set to zero. Ridge regression (Hoerl \u0026amp; Kennard 1970) relies on \\(\\ell_2\\) norm penalization; the elastic net (Zou \u0026amp; Hastie 2005) uses a mix of \\(\\ell_1\\) and \\(\\ell_2\\) penalization.\ncvlasso supports \\(K\\) -fold cross-validation and h-step ahead rolling cross-validation (for time-series and panel data) to choose the penalization parameters for all the implemented estimators.\nrlasso implements theory-driven penalization for the lasso and square-root lasso that can be applied to cross-section and panel data. rlasso uses the theory-driven penalization methodology of Belloni et al. (2012, 2013, 2014, 2016) for the lasso and square-root lasso. In addition, rlasso implements the Chernozhukov et al. (2013) sup-score test of joint significance of the regressors that is suitable for the high-dimensional setting.\nlassologit, cvlassologit and rlassologit for logistic regression.\n"},{"id":5,"href":"/docs/pdslasso/","title":"PDSLASSO","section":"Docs","content":" When would you want to use pdslasso? # pdslasso and ivlasso are routines for estimating structural parameters in linear models with many controls and/or many instruments. The routines use methods for estimating sparse high-dimensional models, specifically the lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) and the square-root-lasso (Belloni et al. 2011, 2014).\nThe purpose of pdslasso is to improve causal inference when the aim is to assess the effect of one or a few (possibly endogenous) regressors on the outcome variable. pdslasso allows to select control variables and/or instruments.\nMany control variables # The primary interest in an econometric analysis often lies in one or a few regressors, for which we want to estimate the causal effect on an outcome variable. However, to allow for a causal interpretation we need to control for confounding factors. Lasso-type techniques can be employed to appropriately select controls and thus improve the robustness of causal inference.\nMany instruments # High-dimensional instruments can arise when there is inherently large number of potentially relevant instruments or when it\u0026rsquo;s unclear how these instruments should be specified (e.g. dummy variables, interaction effects).\nMethods # Two approaches are implemented in pdslasso and ivlasso:\nThe post-double-selection methodology of Belloni et al. (2012, 2013, 2014, 2015, 2016). The post-regularization methodology of Chernozhukov, Hansen and Spindler (2015). For instrumental variable estimation, ivlasso implements weak-identification-robust hypothesis tests and confidence sets using the Chernozhukov et al. (2013) sup-score test.\nThe implemention of these methods in pdslasso and ivlasso require the Stata program rlasso (available in the separate Stata module lassopack), which provides lasso and square root-lasso estimation with data-driven penalization.\n"},{"id":6,"href":"/docs/pystacked/","title":"PYSTACKED","section":"Docs","content":" Stacked generalization with pystacked # pystacked implements stacked generalization (Wolpert, 1992) via scikit-learn\u0026rsquo;s sklearn.ensemble.StackingRegressor and sklearn.ensemble.StackingClassifier. Stacking is a way of combining predictions from multiple supervised machine learners (the \u0026ldquo;base learners\u0026rdquo;) into a final prediction to improve performance. The currently-supported base learners are:\nLinear regression Logistic regression Lasso, ridge and elastic net Support vector machines Gradient boosted trees Random forest Neural nets (Multi-layer Perceptron) pystacked can also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn\u0026rsquo;s machine learning algorithms.\npystacked has just been released (October 2021). Please try it out and let us know if you run into problems. Feedback welcome and appreciated. "},{"id":7,"href":"/docs/ddml/crossfit/","title":"Algorithm","section":"DDML","content":" DDML Algorithm # DDML estimators proceed in two stages:\nCross-fitting to estimate conditional expectation functions. Second stage estimation based on Neyman orthogonal scores. Chernozhukov et al. 2018) show that cross-fitting ensures that we can leverage a large class of machine learners for causal inference \u0026ndash; including popular machine learners such as random forests or gradient boosting. Cross-fitting ensures independence between the estimation error from the first step and the regression residual in the second stage.\nTo illustrate the estimation methodology, let us consider the Partially Linear Model: \\[ Y = a.D \u0026#43; g(X) \u0026#43; U \\\\ D = m(X) \u0026#43; V \\quad\\quad~~\\] Under conditional orthogonality, we can write \\[a = \\frac{E\\left[\\big(Y - \\ell(\\bm{X})\\big)\\big(D - m(\\bm{X})\\big)\\right]}{E\\left[(D - m(\\bm{X}))^2\\right]}.\\] where \\(m(\\bm{X})\\equiv E[D\\vert X] \\) and \\(\\ell(\\bm{X})\\equiv E[Y\\vert X]\\) .\nDDML uses cross-fitting to estimate the conditional expectation functions, which are then used to obtain the DDML estimate of \\(a\\) .\nTo implement cross-fitting, we randomly split the sample into \\(K\\) evenly-sized folds, denoted as \\(I_1,\\ldots, I_K\\) . For each fold \\(k\\) , the conditional expectations \\(\\ell_0\\) and \\(m_0\\) are estimated using only observations not in the \\(k\\) th fold \u0026ndash; i.e., in \\(I^c_k\\equiv I \\setminus I_k\\) \u0026ndash; resulting in \\(\\hat{\\ell}_{I^c_{k}}\\) and \\(\\hat{m}_{I^c_{k}}\\) , respectively, where the subscript \\({I^c_{k}}\\) indicates the subsample used for estimation. The out-of-sample predictions for an observation \\(i\\) in the \\(k\\) th fold are then computed via \\(\\hat{\\ell}_{I^c_{k}}(\\bm{X}_i)\\) and \\(\\hat{m}_{I^c_{k}}(\\bm{X}_i)\\) . Repeating this procedure for all \\(K\\) folds then allows for computation of the DDML estimator for \\(a\\) : \\[ \\hat{a}_n = \\frac{\\frac{1}{n}\\sum_{i=1}^n \\big(Y_i-\\hat{\\ell}_{I^c_{k_i}}(\\bm{X}_i)\\big)\\big(D_i-\\hat{m}_{I^c_{k_i}}(\\bm{X}_i)\\big)}{\\frac{1}{n}\\sum_{i=i}^n \\big(D_i-\\hat{m}_{I^c_{k_i}}(\\bm{X}_i)\\big)^2},\\] where \\(k_i\\) denotes the fold of the \\(i\\) th observation.\n"},{"id":8,"href":"/docs/pdslasso/pdslasso_demo/","title":"Demonstration","section":"PDSLASSO","content":" Demonstration # We demonstrate the use of pdslasso and ivlasso using the data set of Acemoglu, Robinson \u0026amp; Johnson (2001).\n. clear . use https://statalasso.github.io/dta/AJR.dta Basic OLS # We are interested in the effect of institutions (measured by avexpr) on income (logpgp95). We ignore endogeneity issues for now and begin with a simple regression of logpgp95 against avexpr:\n. reg logpgp95 avexpr Source | SS df MS Number of obs = 64 -------------+---------------------------------- F(1, 62) = 72.82 Model | 37.0420118 1 37.0420118 Prob \u0026gt; F = 0.0000 Residual | 31.5397067 62 .508704946 R-squared = 0.5401 -------------+---------------------------------- Adj R-squared = 0.5327 Total | 68.5817185 63 1.08859871 Root MSE = .71324 ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. t P\u0026gt;|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .522107 .061185 8.53 0.000 .3997999 .6444142 _cons | 4.660383 .4085062 11.41 0.000 3.843791 5.476976 ------------------------------------------------------------------------------ Exogenous regressors and many controls # We have 24 control variables that control for geography (latitude, continent dummies). This doesn\u0026rsquo;t seem a lot, but we only have 64 observations! At the same time, we are only interested in avexpr. The problem is that the \u0026ldquo;right\u0026rdquo; set of controls is not known \u0026ndash; use too few controls, or the wrong ones, and omitted variable bias will be present; use too many, and the model will suffer from overfitting.\nSo, we treat the remaining variables as high-dimensional controls by placing them into parentheses and let the lasso decide which controls are important.\n. pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres) OLS using CHS lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .4262511 .0540552 7.89 0.000 .3203049 .5321974 ------------------------------------------------------------------------------ OLS using CHS post-lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .391257 .0574894 6.81 0.000 .2785799 .503934 ------------------------------------------------------------------------------ OLS with PDS-selected variables and full regressor set ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .3913455 .0561862 6.97 0.000 .2812225 .5014684 edes1975 | .0091289 .003184 2.87 0.004 .0028883 .0153694 avelf | -.9974943 .2474453 -4.03 0.000 -1.482478 -.5125104 zinc | -.0079226 .0280604 -0.28 0.778 -.0629201 .0470748 _cons | 5.764133 .3773706 15.27 0.000 5.024501 6.503766 ------------------------------------------------------------------------------ Three different estimation results are presented, which correspond to three different approaches:\npost-regularization with the lasso: (1) we obtain the lasso residuals from regressing logpgp95 against the set of controls; (2) we obtain the lasso residuals from regressing avexpr against the set of controls; (3) OLS regression using the orthogonalized versions of logpgp95 and avexpr. post-regularization with the post-lasso: same as above but using post-lasso residuals instead lasso residuals. post-double-selection: OLS of logpgp95 against avexpr and the set of controls selected in regression (1) and (2). All three approaches are valid.\nEndogenous regressor and all controls # Since the relationship between income and institutions suffers from reverse causality, we use settler mortality (logem4) as an instrument as suggested by Acemoglu et al. The rationale for using logem4 is that disease environment (malaria, yellow fever, etc.) is exogenous because diseases were almost always fatal to settlers (no immunity), but less serious for natives (some degree of immunity).\nWe also need to control for other highly persistent factors that are related to institutions \u0026amp; GDP. For now, we include all control variables:\n. pdslasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres (avexpr=logem4) IV using CHS lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | 1.174461 .3166948 3.71 0.000 .5537506 1.795172 ... (output omitted) ------------------------------------------------------------------------------ IV using CHS post-lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | 1.065556 .2492286 4.28 0.000 .5770768 1.554035 ... (output omitted) ------------------------------------------------------------------------------ IV with PDS-selected variables and full regressor set ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .7126678 .1649034 4.32 0.000 .389463 1.035873 ... (output omitted) ------------------------------------------------------------------------------ Selected instruments and selected controls: # Inclusing all controls seems inefficient. Thus, we use the lasso to select controls in the IV regression. To this end, we place our high-dimensional controls in parantheses as above.\n. ivlasso logpgp95 (lat_abst edes1975 avelf temp* humid* steplow-oilres) (avexpr=logem4) IV using CHS lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .7710621 .1502209 5.13 0.000 .4766344 1.06549 ------------------------------------------------------------------------------ IV using CHS post-lasso-orthogonalized vars ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .8798503 .2727401 3.23 0.001 .3452896 1.414411 ------------------------------------------------------------------------------ IV with PDS-selected variables and full regressor set ------------------------------------------------------------------------------ logpgp95 | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- avexpr | .8413527 .2487658 3.38 0.001 .3537807 1.328925 edes1975 | .0019949 .0058535 0.34 0.733 -.0094777 .0134675 avelf | -.8777934 .3557117 -2.47 0.014 -1.574975 -.1806113 zinc | -.0739391 .0526534 -1.40 0.160 -.1771378 .0292597 _cons | 2.975816 1.555107 1.91 0.056 -.0721371 6.02377 ------------------------------------------------------------------------------ More # More information can be found in the help file:\nhelp ivlasso help pdslasso "},{"id":9,"href":"/docs/lassopack/estimators/","title":"Estimation methods","section":"LASSOPACK","content":" Ridge regression # The ridge estimator (Hoerl \u0026amp; Kennard, 1970) can be written as\n\\[\\hat{\\beta}_{Ridge} = (X\u0026#39;X\u0026#43;\\lambda I_p)^{-1}X\u0026#39;y.\\] Thus, even if the regressor matrix is not full rank (e.g. because \\(p\u0026gt;N\\) ), the problem becomes nonsingular by adding a constant to the diagonal of \\(X\u0026#39;X\\) . Another advantage of the ridge estimator over least squares stems from the variance-bias trade-off. Ridge regression may improve over ordinary least squares by inducing a mild bias while decreasing the variance. For this reason, ridge regression is a popular method in the context of multicollinearity. In contrast to estimators relying on \\(\\ell_1\\) -penalization, the ridge does not yield sparse solutions and keeps all predictors in the model.\nLasso estimator # The lasso minimizes the residual sum of squares (RSS) subject to a constraint on the absolute size of coefficient estimates. Tibshirani (1996) motivates the lasso with two major advantages over least squares. First, due to the nature of the \\(\\ell_1\\) -penalty, the lasso tends to produce sparse solutions and thus facilitates model interpretation. Secondly, similar to ridge regression, lasso can outperform least squares in terms of prediction due to lower variance. Another advantage is that the lasso is computationally attractive due to its convex form. This is in contrast to model selection based on AIC or BIC (which employ \\(\\ell_0\\) penalization) where each possible sub-model has to be estimated.\nElastic net # The elastic net applies a mix of \\(\\ell_1\\) (lasso-type) and \\(\\ell_2\\) (ridge-type) penalization. It combines some of the strengths of lasso and ridge regression. In the presence of groups of correlated regressors, the lasso selects typically only one variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated variables. On the other hand, the ridge does not yield sparse solutions impeding model interpretation. The elastic net is able to produce sparse solutions (for some \\(\\alpha\\) greater than zero) and retains (or drops) correlated variables jointly. (Zou \u0026amp; Hastie, 2005)\nAdaptive lasso # The lasso is only variable selection consistent under the rather strong irrepresentable condition, which imposes constraints on the degree of correlation between predictors in the true model and predictors outside of the model (see Zhao \u0026amp; Yu, 2006; Meinshausen \u0026amp; Bühlmann, 2006). Zou (2006) proposes the adaptive lasso which uses penalty loadings \\(|\\hat\\beta_{0,j}|^{-\\theta}\\) for \\(j=1,...,p\\) where \\(\\hat\\beta_{0,j}\\) is an initial estimator. The adaptive lasso is variable-selection consistent for fixed \\(p\\) under weaker assumptions than the standard lasso. If \\(p\u0026lt;N\\) , OLS can be used as the initial estimator. Huang et al. (2008) suggest to use univariate OLS if \\(p\u0026gt;N\\) . Other initial estimators are possible. (Zou, 2006)\nSquare-root lasso # The sqrt-lasso is a modification of the lasso that minimizes sqrt(RSS) instead of RSS, while also imposing an \\(\\ell_1\\) -penalty. The main advantage of the sqrt-lasso over the standard lasso is that the theoretically grounded, data-driven optimal \\(\\lambda\\) is independent of the unknown error variance under homoskedasticity. (Belloni et al., 2011, 2014)\nPost-estimation OLS # Penalized regression methods induce a bias that can be alleviated by post-estimation OLS, which applies OLS to the predictors selected by the first-stage variable selection method. For the case of the lasso, Belloni and Chernozhukov (2013) have shown that the post-lasso OLS performs at least as well as the lasso under mild additional assumptions.\nFor further information on the lasso and related methods, see for example the textbooks by Hastie et al. (2009, 2015; both available for free) and Bühlmann \u0026amp; Van de Geer (2011).\n"},{"id":10,"href":"/docs/lassopack/help/cvlasso_help/","title":"help cvlasso","section":"Help files","content":" ---------------------------------------------------------------------------------------------------------------------------------- help cvlasso lassopack v1.4.2 ---------------------------------------------------------------------------------------------------------------------------------- Title cvlasso -- Program for cross-validation using lasso, square-root lasso, elastic net, adaptive lasso and post-OLS estimators Syntax Full syntax cvlasso depvar regressors [if exp] [in range] [, alpha(numlist) alphacount(int) sqrt adaptive adaloadings(string) adatheta(real) ols lambda(real) lcount(integer) lminratio(real) lmax(real) lopt lse lglmnet notpen(varlist) partial(varlist) psolver(string) ploadings(string) unitloadings prestd fe noftools noconstant tolopt(real) tolzero(real) maxiter(int) nfolds(int) foldvar(varname) savefoldvar(varname) rolling h(int) origin(int) fixedwindow seed(real) plotcv plotopt(string) saveest(string)] Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transform; the speed gains using this package can be large. See help ftools or click on ssc install ftools to install. Estimators Description ---------------------------------------------------------------------------------------------------------------------------- alpha(numlist) a scalar elastic net parameter or an ascending list of elastic net parameters. If the number of alpha values is larger than 1, cross-validation is conducted over alpha (and lambda). The default is alpha=1, which corresponds to the lasso estimator. The elastic net parameter controls the degree of L1-norm (lasso-type) to L2-norm (ridge-type) penalization. Each alpha value must be in the interval [0,1]. alphacount(real) number of alpha values used for cross-validation across alpha. By default, cross-validation is only conducted across lambda, but not over alpha. Ignored if alpha() is specified. sqrt square-root lasso estimator. adaptive adaptive lasso estimator. The penalty loading for predictor j is set to 1/abs(beta0(j))^theta where beta0(j) is the OLS estimate or univariate OLS estimate if p\u0026gt;n. Theta is the adaptive exponent, and can be controlled using the adatheta(real) option. adaloadings(string) alternative initial estimates, beta0, used for calculating adaptive loadings. For example, this could be the vector e(b) from an initial lasso2 estimation. The elements of the vector are raised to the power -theta (note the minus). See adaptive option. adatheta(real) exponent for calculating adaptive penalty loadings. See adaptive option. Default=1. ols post-estimation OLS. Note that cross-validation using OLS will in most cases lead to no unique optimal lambda (since MSPE is a step function over lambda). ---------------------------------------------------------------------------------------------------------------------------- See overview of estimation methods. Lambda(s) Description ---------------------------------------------------------------------------------------------------------------------------- lambda(numlist) a scalar lambda value or list of descending lambda values. Each lambda value must be greater than 0. If not specified, the default list is used which is given by exp(rangen(log(lmax),log(lminratio*lmax),lcount)) (see mf_range). lcount(integer)† number of lambda values for which the solution is obtained. Default is 100. lminratio(real)† ratio of minimum to maximum lambda. lminratio must be between 0 and 1. Default is 1/1000. lmax(real)† maximum lambda value. Default is 2*max(X'y), and max(X'y) in the case of the square-root lasso (where X is the pre-standardized regressor matrix and y is the vector of the response variable). lopt after cross-validation, estimate model with lambda that minimized the mean-squared prediction error lse after cross-validation, estimate model with largest lambda that is within one standard deviation from lopt lglmnet use the parameterizations for lambda, alpha, standardization, etc. employed by glmnet by Friedman et al. (2010). ---------------------------------------------------------------------------------------------------------------------------- † Not applicable if lambda() is specified. Loadings \u0026amp; standardization Description ---------------------------------------------------------------------------------------------------------------------------- notpen(varlist) sets penalty loadings to zero for predictors in varlist. Unpenalized predictors are always included in the model. partial(varlist) variables in varlist are partialled out prior to estimation. psolver(string) override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol; default=qrxx) ploadings(matrix) a row-vector of penalty loadings; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2))). The size of the vector should equal the number of predictors (excluding partialled out variables and excluding the constant). unitloadings penalty loadings set to a vector of ones; overrides the default standardization loadings (in the case of the lasso, =sqrt(avg(x^2)). prestd dependent variable and predictors are standardized prior to estimation rather than standardized \"on the fly\" using penalty loadings. See here for more details. By default the coefficient estimates are un-standardized (i.e., returned in original units). ---------------------------------------------------------------------------------------------------------------------------- See discussion of standardization in the lasso2 help file. Also see Section Data transformations in cross-validation below. FE \u0026amp; constant Description ---------------------------------------------------------------------------------------------------------------------------- fe within-transformation is applied prior to estimation. Requires data to be xtset. noftools do not use FTOOLS package for fixed-effects transform (slower; rarely used) noconstant suppress constant from estimation. Default behaviour is to partial the constant out (i.e., to center the regressors). ---------------------------------------------------------------------------------------------------------------------------- Optimization Description ---------------------------------------------------------------------------------------------------------------------------- tolopt(real) tolerance for lasso shooting algorithm (default=1e-10) tolzero(real) minimum below which coeffs are rounded down to zero (default=1e-4) maxiter(int) maximum number of iterations for the lasso shooting algorithm (default=10,000) ---------------------------------------------------------------------------------------------------------------------------- Fold variable options Description ---------------------------------------------------------------------------------------------------------------------------- nfolds(integer) the number of folds used for K-fold cross-validation. Default is 10. foldvar(varname) user-specified variable with fold IDs, ranging from 1 to #folds. If not specified, fold IDs are randomly generated such that each fold is of approximately equal size. savefoldvar(varname) saves the fold ID variable. Not supported in combination with rolling. rolling uses rolling h-step ahead cross-validation. Requires the data to be tsset. h(integer)‡ changes the forecasting horizon. Default is 1. origin(integer)‡ controls the number of observations in the first training dataset. fixedwindow‡ ensures that the size of the training dataset is always the same. seed(real) set seed for the generation of a random fold variable. Only relevant if fold variable is randomly generated. ---------------------------------------------------------------------------------------------------------------------------- ‡ Only applicable with rolling option. Plotting options Description ---------------------------------------------------------------------------------------------------------------------------- plotcv plots the estimated mean-squared prediction error as a function of ln(lambda) plotopt(varlist) overwrites the default plotting options. All options are passed on to line. ---------------------------------------------------------------------------------------------------------------------------- Display options Description ---------------------------------------------------------------------------------------------------------------------------- omitgrid suppresses the display of mean-squared prediction errors ---------------------------------------------------------------------------------------------------------------------------- Store lasso2 results Description ---------------------------------------------------------------------------------------------------------------------------- saveest(string) saves lasso2 results from each step of the cross-validation in string1, ..., stringK where K is the number of folds. Intermediate results can be restored using estimates restore. ---------------------------------------------------------------------------------------------------------------------------- cvlasso may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or xtset. All varlists may contain time-series operators or factor variables; see help varlist. Replay syntax cvlasso [, lopt lse postresults plotcv(method) plotopt(string)] Replay options Description ---------------------------------------------------------------------------------------------------------------------------- lopt show estimation results using the model corresponding to lambda=e(lopt) lse show estimation results using the model corresponding to lambda=e(lse) postresults post lasso2 estimation results (to be used in combination with lse or lopt) plotcv(method) see plotting options above plotopt(string) see plotting options above ---------------------------------------------------------------------------------------------------------------------------- Postestimation: predict [type] newvar [if] [in] [, xb u e ue xbu residuals lopt lse noisily] Predict options Description ---------------------------------------------------------------------------------------------------------------------------- xb compute predicted values (the default) residuals compute residuals e generate overall error component e(it). Only after fe. ue generate combined residuals, i.e., u(i) + e(it). Only after fe. xbu prediction including fixed effect, i.e., a + xb + u(i). Only after fe. u fixed effect, i.e., u(i). Only after fe. lopt use lambda that minimized the mean-squared prediction error lse use the largest lambda that is within one standard deviation from lopt noisily displays beta used for prediction. ---------------------------------------------------------------------------------------------------------------------------- Contents Description Partitioning of folds Data transformations in cross-validation cvlasso vs. Friedman et al.'s glmnet and StataCorp's lasso Examples of usage --General demonstration --Rolling cross-validation with time-series data --Rolling cross-validation with panel data Saved results References Website Installation Acknowledgements Citation of lassopack Description cvlasso implements K-fold cross-validation and h-step ahead rolling cross-validation for the following estimators: lasso, square-root lasso, adaptive lasso, ridge regression, elastic net. See lasso2 for more information about these estimators. The purpose of cross-validation is to assess the out-of-sample prediction performance of the estimator. The steps for K-fold cross-validation over lambda can be summarized as follows: 1. Split the data into K groups, referred to as folds, of approximately equal size. Let n(k) denote the number of observations in the kth data partition with k=1,...,K. 2. The first fold is treated as the validation dataset and the remaining K-1 parts constitute the training dataset. The model is fit to the training data for a given value of lambda. The resulting estimate is denoted as betahat(1,lambda). The mean-squared prediction error for group 1 is computed as MSPE(1,lambda)=1/n(1)*sum([y(i) - x(i)'betahat(1,lambda)]^2) for all i in the first data partition. The procedure is repeated for k=2,...,K. Thus, MSPE(2,lambda), ..., MSPE(K,lambda) are calculated. 3. The K-fold cross-validation estimate of the MSPE, which serves as a measure of prediction performance, is CV(lambda)=1/K*sum(MSPE(k,lambda)). 4. Step 2 and 3 are repeated for a range of lambda values. h-step ahead rolling cross-validation proceeds in a similar way, except that the partitioning of training and validation takes account of the time-series structure. Specifically, the training window is iteratively extended (or moved forward) by one step. See below for more details. Partitioning of folds cvlasso supports K-fold cross-validation and cross-validation using rolling h-step ahead forecasts. K-fold cross-validation is the standard approach and relies on a fold ID variable. Rolling h-step ahead cross-validation is applicable with time-series data, or panels with large time dimension. K-fold cross-validation The fold ID variable marks the observations which are used as validation data. For example, a fold ID variable (with three folds) could have the following structure: +------------------+ | fold y x | |------------------| | 3 y1 x1 | | 2 y2 x2 | | 1 y3 x3 | | 3 y4 x4 | | 1 y5 x5 | | 2 y6 x6 | +------------------+ It is instructive to illustrate the cross-validation process implied by the above fold ID variable. Let T denote a training observation and V denote a validation point. The division of folds can be summarized as follows: Step 1 2 3 +- -+ 1 | T T V | 2 | T V T | 3 | V T T | i 4 | T T V | 5 | V T T | 6 | T V T | +- -+ In the first step, the 3rd and 5th observation are in the validation dataset and remaining data constitute the training dataset. In the second step, the validation dataset includes the 2nd and 6th observation, etc. By default, the fold ID variable is randomly generated such that each fold is of approximately equal size. The default number of folds is equal to 10, but can be changed using the nfolds() option. Rolling h-step ahead cross-validation To allow for time-series data, cvlasso supports cross-validation using rolling h-step forecasts (option rolling); see Hyndman, 2016. To use rolling cross-validation, the data must be tsset or xtset. The options h() and origin() control the forecasting horizon and the starting point of the rolling forecast, respectively. The following matrix illustrates the division between training and validation data over the course of the cross-validation for the case of 1-step ahead forecasting (the default when rolling is specified). Step 1 2 3 4 5 +- -+ 1 | T T T T T | 2 | T T T T T | 3 | T T T T T | t 4 | V T T T T | 5 | . V T T T | 6 | . . V T T | 7 | . . . V T | 8 | . . . . V | +- -+ In the first iteration (illustrated in the first column), the first three observations are in the training dataset, which corresponds to origin(3). The option h() controls the forecasting horizon used for cross-validation (the default is 1). If h(2) is specified, which corresponds to 2-step ahead forecasting, the structure changes to: Step 1 2 3 4 5 +- -+ 1 | T T T T T | 2 | T T T T T | 3 | T T T T T | 4 | . T T T T | t 5 | V . T T T | 6 | . V . T T | 7 | . . V . T | 8 | . . . V . | 9 | . . . . V | +- -+ The fixedwindow option ensures that the size of the training dataset is always the same. In this example (using h(1)), each step uses three data points for training: Step 1 2 3 4 5 +- -+ 1 | T . . . . | 2 | T T . . . | 3 | T T T . . | t 4 | V T T T . | 5 | . V T T T | 6 | . . V T T | 7 | . . . V T | 8 | . . . . V | +- -+ Data transformations in cross-validation An important principle in cross-validation is that the training dataset should not contain information from the validation dataset. This mimics the real-world situation where out-of-sample predictions are made not knowing what the true response is. The principle applies not only to individual observations (the training and validation data do not overlap) but also to data transformations. Specifically, data transformations applied to the training data should not use information from the validation data or full dataset. In particular, standardization using the full sample violates this principle. cvlasso implements this principle for all data transformations supported by lasso2: data standardization, fixed effects and partialling-out. In most applications using the estimators supported by cvlasso, predictors are standardized to have mean zero and unit variance. The above principle means that the standardization applied to the training data is based only on observations in the training data; further, the standardization transformation applied to the validation data will also be based only on the means and variances of the observations in the training data. The same applies to the fixed effects transformation: the group means used to implement the within transformation to both the training data and the validation data are calculated using only the training data. Similarly, the projection coefficients used to \"partial out\" variables are estimated using only the training data and are applied to both the training dataset and the validation dataset. cvlasso vs. Hastie et al.'s (2010) glmnet and StataCorp's lasso The parameterization used by cvlasso and lasso2 differs from StataCorp's lasso in only one respect: lambda(StataCorp) = (1/2N)*lambda(lasso2). The elastic net parameter alpha is the same in both parameterizations. See the lasso2 help file for examples. The parameterization used by Hastie et al.'s (2010) glmnet uses the same convention as StataCorp for lambda: lambda(glmnet) = (1/2N)*lambda(lasso2). However, the glmnet treatment of the elastic net parameter alpha differs from both cvlasso/lasso2 and StataCorp's lasso. The glmnet objective function is defined such that the dependent variable is assumed already to have been standardized. Because the L2 norm is nonlinear, this affects the interpretation of alpha. Specifically, the default cvlasso/lasso2 and StataCorp's lasso parameterization means that alpha is not invariant changes in the scale of the dependent variable. The glmnet parameterization of alpha, however, is scale-invariant - a useful feature. cvlasso and lasso2 provide an lglmnet option that enables the user to employ the glmnet parameterization for alpha and lambda. See the lasso2 help file for examples of its usage and how to replicate glmnet output. We recommend the use of the lglmnet option in particular with cross-validation over alpha; see below for an example. General introduction using K-fold cross-validation Dataset The dataset is available through Hastie et al. (2015) on the authors' website. The following variables are included in the dataset of 97 men: Predictors lcavol log(cancer volume) lweight log(prostate weight) age patient age lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) gleason Gleason score pgg45 percentage Gleason scores 4 or 5 Outcome lpsa log(prostate specific antigen) Load prostate cancer data. . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, clear tab General demonstration 10-fold cross-validation across lambda. The lambda value that minimizes the mean-squared prediction error is indicated by an asterisk (*). A hat (^) marks the largest lambda at which the MSPE is within one standard error of the minimal MSPE. The former is returned in e(lopt), the latter in e(lse). We use seed(123) throughout this demonstration for replicability of folds. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . di e(lopt) . di e(lse) Estimate the full model Estimate the the full model with either e(lopt) or e(lse). cvlasso internally calls lasso2 with lambda=lopt or lse, respectively. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lopt seed(123) . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lse seed(123) The same as above can be achieved using the replay syntax. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cvlasso, lopt . cvlasso, lse If postresults is specified, cvlasso posts the lasso2 estimation results. . cvlasso, lopt postres . ereturn list Cross-validation over lambda and alpha alpha() can be a scalar or list of elastic net parameters. Each alpha value must lie in the interval [0,1]. If alpha() is a list longer than 1, cvlasso cross-validates over lambda and alpha. The table at the end of the output indicates the alpha value that minimizes the empirical MSPE. We recommend using the glmnet parameterization of the elastic net because alpha in this parameterization is invariant to scaling (see above for discussion and the lasso2 help file for illustrative examples). . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, alpha(0 0.1 0.5 1) lc(10) lglmnet seed(123) Alternatively, the alphacount() option can be used to control the number of alpha values used for cross-validation. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, alphac(3) lc(10) lglmnet seed(123) Plotting We can plot the estimated mean-squared prediction error over lambda. Note that the plotting feature is not supported if we cross-validate over alpha. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) plotcv Prediction The predict postestimation command allows to obtain predicted values and residuals for lambda=e(lopt) or lambda=e(lse). . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cap drop xbhat1 . predict double xbhat1, lopt . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cap drop xbhat2 . predict double xbhat2, lse Store intermediate steps cvlasso calls internally lasso2. To see intermediate estimation results, we can use the saveest(string) option. . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) nfolds(3) saveest(step) . estimates dir . estimates restore step1 . estimates replay step1 Time-series example using rolling h-step ahead cross-validation Load airline passenger data. . webuse air2, clear There are 144 observations in the sample. origin() controls the sample range used for training and validation. In this example, origin(130) implies that data up to and including t=130 are used for training in the first iteration. Data points t=131 to 144 are successively used for validation. The notation `a-b (v)' indicates that data a to b are used for estimation (training), and data point v is used for forecasting (validation). Note that the training dataset starts with t=13 since 12 lags are used as predictors. . cvlasso air L(1/12).air, rolling origin(130) The optimal model includes lags 1, 11 and 12. . cvlasso, lopt The option h() controls the forecasting horizon (default=1). . cvlasso air L(1/12).air, rolling origin(130) h(2) In the above examples, the size of the training dataset increases by one data point each step. To keep the size of the training dataset fixed, specify fixedwindow. . cvlasso air L(1/12).air, rolling origin(130) fixedwindow Cross-validation over alpha with alpha={0, 0.1, 0.5, 1}. . cvlasso air L(1/12).air, rolling origin(130) alpha(0 0.1 0.5 1) Plot mean-squared prediction errors against ln(lambda). . cvlasso air L(1/12).air, rolling origin(130) . cvlasso, plotcv Panel data example using rolling h-step ahead cross-validation Rolling cross-validation can also be applied to panel data. For demonstration, load Grunfeld data. . webuse grunfeld, clear Apply 1-step ahead cross-validation. . cvlasso mvalue L(1/10).mvalue, rolling origin(1950) The model selected by cross-validation: . cvlasso, lopt Same as above with fixed size of training data. . cvlasso mvalue L(1/10).mvalue, rolling origin(1950) fixedwindow Saved results cvlasso saves the following in e(): scalars e(N) sample size e(nfolds) number of folds e(lmax) largest lambda e(lmin) smallest lambda e(lcount) number of lambdas e(sqrt) =1 if sqrt-lasso, 0 otherwise e(adaptive) =1 if adaptive loadings are used, 0 otherwise e(ols) =1 if post-estimation OLS, 0 otherwise e(partial_ct) number of partialled out predictors e(notpen_ct) number of not penalized predictors e(prestd) =1 if pre-standardized, 0 otherwise e(nalpha) number of alphas e(h) forecasting horizon for rolling forecasts (only returned if rolling is specified) e(origin) number of observations in first training dataset (only returned if rolling is specified) e(lopt) optimal lambda (may be missing if no unique minimum MSPE) e(lse) lambda se (may be missing if no unique minimum MSPE) e(mspemin) minimum MSPE macros e(cmd) cvlasso e(method) indicates which estimator is used (e.g. lasso, elastic net) e(cvmethod) indicates whether K-fold or rolling cross-validation is used e(varXmodel) predictors (excluding partialled-out variables) e(varX) predictors e(partial) partialled out predictors e(notpen) not penalized predictors e(depvar) dependent variable matrices e(lambdamat) column vector of lambda values functions e(sample) estimation sample In addition, if cvlasso cross-validates over alpha and lambda: scalars e(alphamin) optimal alpha, i.e., the alpha that minimizes the empirical MSPE macros e(alphalist) list of alpha values matrices e(mspeminmat) minimum MSPE for each alpha In addition, if cvlasso cross-validates over lambda only: scalars e(alpha) elastic net parameter matrices e(mspe) matrix of MSPEs for each fold and lambda where each column corresponds to one lambda value and each row corresponds to one fold. e(mmspe) column vector of MSPEs for each lambda e(cvsd) column vector standard deviation of MSPE (for each lambda) e(cvupper) column vector equal to MSPE + 1 standard deviation e(cvlower) column vector equal to MSPE - 1 standard deviation References Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets. https://ideas.repec.org/c/boc/bocode/s458213.html Hyndman, Rob J. (2016). Cross-validation for time series. Hyndsight blog, 5 December 2016. https://robjhyndman.com/hyndsight/tscv/ See lasso2 for further references. Website Please check our website https://statalasso.github.io/ for more information. Installation cvlasso is part of the lassopack package. To get the latest stable version of lassopack from our website, check the installation instructions at https://statalasso.github.io/installation/. We update the stable website version more frequently than the SSC version. Earlier versions of lassopack are also available from the website. To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which requires whichpkg to be installed; ssc install whichpkg). Acknowledgements Thanks to Sergio Correia for advice on the use of the FTOOLS package. Citation of cvlasso cvlasso is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it as such: Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397. Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E. Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.uk Also see Help: lasso2, lassologit, rlasso (if installed) "},{"id":11,"href":"/docs/lassopack/regularized_reg/","title":"Regularized regression","section":"LASSOPACK","content":" Regularized regression # lasso2 solves the elastic net problem\n\\[\\frac{1}{N} (y_i - x_i\u0026#39;\\beta)^2 \u0026#43; \\frac{\\lambda}{N} \\alpha ||\\Psi\\beta ||_1 \u0026#43; \\frac{\\lambda}{2N}(1-\\alpha)||\\Psi\\beta||_2\\] where\n\\((y_i - x_i\u0026#39;\\beta)^2\\) is the residual sum of squares (RSS), \\(\\beta\\) is a \\(p\\) -dimensional parameter vector, \\(\\lambda\\) is the overall penalty level, which controls the general degree of penalization, \\(\\alpha\\) is the elastic net parameter, which determines the relative contribution of \\(\\ell_1\\) (lasso-type) to \\(\\ell_2\\) (ridge-type) penalization. \\(\\alpha=1\\) corresponds to the lasso; \\(\\alpha=0\\) is ridge regression. \\(\\Psi\\) is a \\(p\\) by \\(p\\) diagonal matrix of predictor-specific penalty loadings. \\(N\\) is the number of observations In addition, lasso2 estimates the square-root lasso (sqrt-lasso) estimator, which is defined as the solution to the following objective function:\n\\[\\sqrt{\\frac{1}{N} (y_i - x_i\u0026#39;\\beta)^2} \u0026#43; \\frac{\\lambda}{N} \\alpha ||\\Psi\\beta ||_1\\] lasso2 implements the elastic net and sqrt-lasso using coordinate descent algorithms. The algorithm (then referred to as \u0026ldquo;shooting\u0026rdquo;) was first proposed by Fu (1998) for the lasso, and by Van der Kooij (2007) for the elastic net. Belloni et al. (2011) implement the coordinate descent for the sqrt-lasso, and have kindly provided Matlab code.\nPenalized regression methods, such as the elastic net and the sqrt-lasso, rely on tuning parameters that control the degree and type of penalization. The estimation methods implemented in lasso2 use two tuning parameters: \\(\\lambda\\) and \\(\\alpha\\) . How to select lambda # lassopack offers three approaches for selecting the \u0026ldquo;optimal\u0026rdquo; \\(\\lambda\\) and \\(\\alpha\\) value, which are implemented in lasso2, cvlasso and rlasso, respectively.\nCross-validation: The penalty level \\(\\lambda\\) may be chosen by cross-validation in order to optimize out-of-sample prediction performance. \\(K\\) -fold cross-validation and rolling cross-validation (for panel and time-series data) are implemented in cvlasso. cvlasso also supports cross-validation across \\(\\alpha\\) . Theory-driven: Theoretically justified and feasible penalty levels and loadings are available for the lasso and sqrt-lasso via the separate command rlasso. The penalization is chosen to dominate the noise of the data-generating process (represented by the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter estimation. Since the error variance is in practice unknown, Belloni et al. (2012) introduce the rigorous (or feasible) lasso that relies on an iterative algorithm for estimating the optimal penalization and is valid in the presence of non-Gaussian and heteroskedastic errors. Belloni et al. (2016) extend the framework to the panel data setting. In the case of the sqrt-lasso under homoskedasticity, the optimal penalty level is independent of the unknown error variance, leading to a practical advantage and better performance in finite samples (see Belloni et al., 2011, 2014). Information criteria: \\(\\lambda\\) can also be selected using information criteria. lasso2 calculates four information criteria: Akaike Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC; Schwarz, 1978), Extended Bayesian information criterion (EBIC; Chen \u0026amp; Chen, 2008) and the corrected AIC (AICc; Sugiura, 1978, and Hurvich, 1989). "},{"id":12,"href":"/docs/lassopack/lasso2/","title":"Getting started","section":"LASSOPACK","content":" Load data # For demonstration purposes we use the prostate cancer data set, which has been widely applied to demonstrate the lasso and related techniques.\nTo load prostate cancer data:\n. insheet using /// https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, /// clear tab General demonstration # By default, lasso2 uses the lasso estimator (i.e., alpha(1)). Like Stata\u0026rsquo;s regress, lasso2 expects the dependent variable to be named first followed by a list of predictors.\n. lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 Knot| ID Lambda s L1-Norm EBIC R-sq | Entered/removed ------+---------------------------------------------------------+---------------- 1| 1 163.62492 1 0.00000 31.41226 0.0000 | Added _cons. 2| 2 149.08894 2 0.06390 26.66962 0.0916 | Added lcavol. 3| 9 77.73509 3 0.40800 -12.63533 0.4221 | Added svi. 4| 11 64.53704 4 0.60174 -18.31145 0.4801 | Added lweight. 5| 21 25.45474 5 1.35340 -42.20238 0.6123 | Added pgg45. 6| 22 23.19341 6 1.39138 -38.93672 0.6175 | Added lbph. 7| 29 12.09306 7 1.58269 -39.94418 0.6389 | Added age. 8| 35 6.92010 8 1.71689 -38.84649 0.6516 | Added gleason. 9| 41 3.95993 9 1.83346 -35.69248 0.6567 | Added lcp. Use 'long' option for full output. Type e.g. 'lasso2, lic(ebic)' to run the model selected by EBIC. lasso2 obtains the solution for a list of \\(\\lambda\u0026lt;span\u0026gt; \\( values (see third column). As \u0026amp;lt;span\u0026amp;gt; \\(\\lambda\u0026amp;amp;lt;span\u0026amp;amp;gt; \\( decreases, predictors are added to the model. The last column on the right indicates which predictors enter or leave the active set. ## Plotting We can plot the *coefficient path* using `plotpath()`. The option accepts `lnlambda`, `lambda` and `norm`, which asks Stata to plot the coefficient estimates against \u0026amp;amp;amp;lt;span\u0026amp;amp;amp;gt; \\(\\ln\\lambda\u0026amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;gt; \\(, \u0026amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;gt; \\(\\lambda\u0026amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;gt; \\( or the \u0026amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;gt; \\(\\ell_1\u0026amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(-norm, respectively. For example, to plot the coefficients against \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\ln\\lambda\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// plotpath(lnlambda) /// plotopt(legend(off)) /// plotlabel `plotopt()` allows to specify additional plotting options that are passed on to Stata\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#39;s `line` command. In the example, `legend(off)` is used to suppress the legend. `plotlabel` triggers the display of variable labels next to the line. The resulting graph looks as follows: ![](/img/plotpath.png#center) To plot the coefficients against the \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\ell_1\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(-norm, we can use: . lasso2, plotpath(norm) /// plotlabel /// plotopt(legend(off)) By omitting the variable names (before the comma), we make use of the *replay syntax*. `lasso2` uses the previous `lasso2` results which avoids time-consuming re-estimation. This only works if `lasso2` results are in memory. This creates the following graph: ![]({\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;/img/plotpath_norm.png#center) ## The lambda() option We can use the `lambda()` option to estimate the model that corresponds to a specific value of \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;link rel=\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;stylesheet\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34; href=\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;/katex/katex.min.css\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34; /\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;script defer src=\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;/katex/katex.min.js\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/script\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;script defer src=\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;/katex/auto-render.min.js\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34; onload=\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;renderMathInElement(document.body);\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/script\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; . . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// lambda(10) --------------------------------------------------- Selected | Lasso Post-est OLS ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- lcavol | 0.5000819 0.5234981 lweight | 0.5144276 0.6152349 age | -0.0036627 -0.0190343 lbph | 0.0468469 0.0954908 svi | 0.5695171 0.6358643 pgg45 | 0.0017981 0.0035248 ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- Partialled-out*| ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- _cons | -0.0014767 0.5214696 --------------------------------------------------- The output shows the lasso and post-estimation OLS estimates corresponding to \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda=10\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; . However, note that there is no justification for using \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda=10\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; over any other positive value. To pick the \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;optimal\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34; value for \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; , we can use cross-validation (see `cvlasso`), theory-driven penalization (`rlasso`) or information criteria as discussed below. ## Estimators The default estimator of `lasso2` is the lasso, which corresponds to `alpha(1)`. The `alpha()` option controls the elastic net parameter, which determines the relative contribution of \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\ell_1\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; (lasso-type) to \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\ell_2\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; (ridge-type) penalization. For ridge regression, use `alpha(0)`: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// alpha(0) /// lambda(500) --------------------------------------------------- Selected | Ridge Post-est OLS ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- lcavol | 0.1497346 0.5643413 lweight | 0.2497274 0.6220198 age | 0.0016636 -0.0212482 lbph | 0.0294061 0.0967125 svi | 0.2913161 0.7616733 lcp | 0.0687956 -0.1060509 gleason | 0.0771692 0.0492279 pgg45 | 0.0023278 0.0044575 ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- Partialled-out*| ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- _cons | 0.6322244 0.1815609 --------------------------------------------------- In contrast to the lasso, the ridge includes all predictors in the model. It does not perform variable selection, even if \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; is large, whereas the elastic net can yield sparse solutions even when \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\alpha\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; is small: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// alpha(0.1) /// lambda(500) --------------------------------------------------- Selected | Elastic net Post-est OLS | (alpha=0.100) ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- lcavol | 0.1222143 0.5378499 lweight | 0.1236962 0.6620155 svi | 0.1854247 0.6991923 lcp | 0.0424339 -0.0813594 gleason | 0.0116789 0.0322875 pgg45 | 0.0008686 0.0036868 ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- Partialled-out*| ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- _cons | 1.7319371 -1.1240093 --------------------------------------------------- Lastly, to employ square-root lasso, use the `sqrt` option: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// sqrt /// lambda(20) Note that the `alpha()` and `sqrt` option are incompatible. ## Information criteria To select the \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34;best\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#34; value of lambda, `lasso2` offers four information criteria: - Akaike Information Criterion (\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(AIC\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; ), - Bayesian Information Criterion (\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(BIC\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; ), - Extended Bayesian Information Criterion (\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(EBIC\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; ) - and Corrected Akaike Information Criterion (\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(AICc\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; ) By default, the \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(EBIC\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; is shown in the output. To estimate the model corresponding to the minimum information criterion, the `lic()` is used which accepts `aic`, `bic`, `ebic` and `aicc`. `lic()` can either be specified in the first `lasso2` call or using the replay syntax (to avoid re-estimation). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// lic(aic) The same can be achieved in two steps using the replay syntax: . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . lasso2, lic(aic) Use lambda=7.594796178345335 (selected by AIC). --------------------------------------------------- Selected | Lasso Post-est OLS ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- lcavol | 0.5057140 0.5234981 lweight | 0.5386738 0.6152349 age | -0.0073599 -0.0190343 lbph | 0.0585468 0.0954908 svi | 0.5854749 0.6358643 pgg45 | 0.0022134 0.0035248 ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- Partialled-out*| ------------------\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#43;-------------------------------- _cons | 0.1243026 0.5214696 --------------------------------------------------- As indicated in the output, the AIC selects `lambda=7.59`. ## Predicted values The `predict` post-estimation command allows to return residuals and predicted values. In the output below, `xbhat1` is generated by re-estimating the model for \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda=10\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; . The `noisily` option triggers the display of the estimation results. `xbhat2` is generated by linear approximation using the two beta estimates closest to `lambda=10`. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45 . cap drop xbhat1 . predict double xbhat1, xb l(10) noisily . cap drop xbhat2 . predict double xbhat2, xb l(10) approx Alternatively, we can explicitly run the model using `lambda(10)`. If `lasso2` is called with a scalar \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; \\(\\lambda\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; value, the subsequent `predict` command requires no `lambda()` option. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// lambda(10) . cap drop xbhat3 . predict double xbhat3, xb All three methods yield the same results. However note that the linear approximation is only exact for the lasso which is piecewise linear. ## Adaptive lasso The adaptive lasso relies on an initial estimator to calculate the penalty loadings. By default, `lasso2` uses OLS as the initial estimator as originally suggested by Zou (2006). If the number of parameters exceeds the numbers of observations, univariate OLS is used; see Huang et al. (2008). . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// adaptive . mat list e(Ups) See the OLS estimates for comparison. . reg lpsa lcavol lweight age lbph svi lcp gleason pgg45 Other initial estimators such as ridge regression are possible. . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// l(10) alpha(0) . mat bhat_ridge = e(b) . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, adaptive adaloadings(bhat_ridge) . mat list e(Ups) ## More Please check the help file for more information and examples. . help lasso2\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;amp;lt;/span\u0026amp;amp;amp;amp;gt;\\) \u0026amp;amp;amp;lt;/span\u0026amp;amp;amp;gt;\\) \u0026amp;amp;lt;/span\u0026amp;amp;gt;\\) \u0026amp;lt;/span\u0026amp;gt;\\) \u0026lt;/span\u0026gt;\\) "},{"id":13,"href":"/docs/pystacked/getting_started/","title":"Getting started","section":"PYSTACKED","content":" Getting started # Before we get into stacking, let\u0026rsquo;s first use pystacked as a \u0026ldquo;regular\u0026rdquo; program for machine learning.\nGradient boosting # We load the example data set and randomly split the data in training/test sample.\n. clear all . insheet using /// https://statalasso.github.io/dta/housing.csv, /// clear comma . set seed 789 . gen train=runiform() . replace train=train\u0026gt;.75 As an example, we run pystacked with gradient boosting:\n. pystacked medv crim-lstat if train, /// type(regress) method(gradboost) Single base learner: no stacking done. Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- gradboost | 1.0000000 Python seed\nSince pystacked uses Python, we also need to take the Python seed into account to ensure replicability. pystacked does this automatically for you. The default (pyseed(-1)) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated automatically. We use the type option to indicate that we consider a regression task rather than a classification task. method(gradboost) selects gradient boosting with regression trees. We will later see that we can specify more than one learner in method().\nThe output isn\u0026rsquo;t particularly informative since\u0026ndash;for now\u0026ndash;we only consider one method. Yet, pystacked has fitted 100 boosted trees in the background. We can use predict in the Stata-typical way to obtain predicted values:\n. predict double ygb if !train Options # We can pass options to scitkit-learn and fine-tune our gradient booster. For this, we use the cmdopt1() option. We need to use cmdopt1() because gradboost is the first (and only) method we are using with pystacked.\nThe options of each base learner are listed when you type:\n. _pyparse, type(reg) method(gradboost) print Default options: loss(squared_error) learning_rate(.1) n_estimators(100) subsample(1) criterion(friedman_mse) min_samples_split(2) min_samples_leaf(1) min_weight_fraction_leaf(0) max_depth(3) min_impurity_decrease(0) init(None) random_state(rng) max_features(None) alpha(.9) max_leaf_nodes(None) warm_start(False) validation_fraction(.1) n_iter_no_change(None) tol(.0001) ccp_alpha(0) Check the scikit-learn documentation here to get more information about each option.\nscikit-learn documentation\nPlease study the scikit-learn documentation carefully when using pystacked. For demonstation, we try two things: (1) we reduce the learning rate from 0.1 to 0.01, and (2) we reduce the learning rate and, in addition, increase the number of trees to 1000.\n. pystacked medv crim-lstat if train, /// type(regress) method(gradboost) /// cmdopt1(learning_rate(.01)) . predict double ygb2 if !train . pystacked medv crim-lstat if train, /// type(regress) method(gradboost) /// cmdopt1(learning_rate(.01) n_estimators(1000)) . predict double ygb3 if !train We can then compare the performance across the three models:\n. gen double rgb_sq=(medv-ygb)^2 . gen double rgb2_sq=(medv-ygb2)^2 . gen double rgb3_sq=(medv-ygb3)^2 . sum rgb* if!train Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- rgb_sq | 361 14.19902 38.15371 .0000138 469.2815 rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858 rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962 Reducing the learning rate without increasing the number of trees has deteriorated prediction performance; yet, if we also increase the number of trees, we might be able to improve prediction performance.\nPipelines # We can make use of pipelines to pre-process our predictors. This will become especially useful in the context of stacking when we want to pass 2nd-order polynomials to one method, but not the other. Here, we use lasso with and without the poly2 pipeline:\n. pystacked medv crim-lstat if train, /// type(regress) method(lassocv) . predict double ylasso1 if !train . pystacked medv crim-lstat if train, /// type(regress) method(lassocv) /// pipe1(poly2) . predict double ylasso2 if !train . gen double rlasso1_sq=(medv-ylasso1)^2 . gen double rlasso2_sq=(medv-ylasso2)^2 . sum *sq if !train Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- rgb_sq | 361 14.19902 38.15371 .0000138 469.2815 rgb2_sq | 361 28.66256 50.82527 .0019822 437.9858 rgb3_sq | 361 13.38564 38.00065 8.70e-07 546.1962 rlasso1_sq | 361 28.03915 76.37714 .0006696 897.5598 rlasso2_sq | 361 22.85806 76.73137 4.81e-06 952.1228 The interactions and squared terms improve the performance of the lasso. However, gradient boosting performance much better in this application.\n"},{"id":14,"href":"/docs/lassopack/help/rlasso_help/","title":"help rlasso","section":"Help files","content":" ---------------------------------------------------------------------------------------------------------------------------------- help rlasso lassopack v1.4.2 ---------------------------------------------------------------------------------------------------------------------------------- Title rlasso -- Program for lasso and sqrt-lasso estimation with data-driven penalization Syntax rlasso depvar regressors [weight] [if exp] [in range] [ , sqrt partial(varlist) pnotpen(varlist) psolver(string) norecover noconstant fe noftools robust cluster(varlist) bw(int) kernel(string) center xdependent numsim(int) prestd tolopt(real) tolpsi(real) tolzero(real) maxiter(int) maxpsiiter(int) maxabsx lassopsi corrnumber(int) lalternative gamma(real) maq c(real) c0(real) supscore ssnumsim(int) testonly seed(real) displayall postall ols verbose vverbose ] Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transform; the speed gains using this package can be large. See help ftools or click on ssc install ftools to install. General options Description ---------------------------------------------------------------------------------------------------------------------------- sqrt use sqrt-lasso (default is standard lasso) noconstant suppress constant from regression (cannot be used with aweights or pweights) fe fixed-effects model (requires data to be xtset) noftools do not use FTOOLS package for fixed-effects transform (slower; rarely used) partial(varlist) variables partialled-out prior to lasso estimation, including the constant (if present); to partial-out just the constant, specify partial(_cons) pnotpen(varlist) variables not penalized by lasso psolver(string) override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol; default=qrxx) norecover suppress recovery of partialled out variables after estimation. robust lasso penalty loadings account for heteroskedasticity cluster(varlist) lasso penalty loadings account for clustering; both standard (1-way) and 2-way clustering supported bw(int) lasso penalty loadings account for autocorrelation (AC) using bandwidth int; use with robust to account for both heteroskedasticity and autocorrelation (HAC) kernel(string) kernel used for HAC/AC penalty loadings (one of: bartlett, truncated, parzen, thann, thamm, daniell, tent, qs; default=bartlett) center center moments in heteroskedastic and cluster-robust loadings lassopsi use lasso or sqrt-lasso residuals to obtain penalty loadings (psi) (default is post-lasso) corrnumber(int) number of high-correlation regressors used to obtain initial residuals; default=5; if =0, then depvar is used in place of residuals prestd standardize data prior to estimation (default is standardize during estimation via penalty loadings) seed(real) set Stata's random number seed prior to xdep and supscore simulations (default=leave state unchanged) Lambda Description ---------------------------------------------------------------------------------------------------------------------------- xdependent penalty level is estimated depending on X numsim(int) number of simulations used for the X-dependent case (default=5000) lalternative alternative (less sharp) lambda0 = 2c*sqrt(N)*sqrt(2*log(2*p/gamma)) (sqrt-lasso = replace 2c with c) gamma(real) \"gamma\" in lambda0 function (default = 0.1/log(N); cluster-lasso = 0.1/log(N_clust)) maq (HAC/AC with truncated kernel only) \"gamma\" in lambda0 function = 0.1/log(N/(bw+1)); mimics cluster-robust c(real) \"c\" in lambda0 function (default = 1.1) c0(real) (rarely used) \"c\" in lambda0 function in first iteration only when iterating to obtain penalty loadings (default = 1.1) Optimization Description ---------------------------------------------------------------------------------------------------------------------------- tolopt(real) tolerance for lasso shooting algorithm (default=1e-10) tolpsi(real) tolerance for penalty loadings algorithm (default=1e-4) tolzero(real) minimum below which coeffs are rounded down to zero (default=1e-4) maxiter(int) maximum number of iterations for the lasso shooting algorithm (default=10k) maxpsiiter(int) maximum number of lasso-based iterations for penalty loadings (psi) algorithm (default=2) maxabsx (sqrt-lasso only) use max(abs(x_ij)) as initial penalty loadings as per Belloni et al. (2014) Sup-score test Description ---------------------------------------------------------------------------------------------------------------------------- supscore report sup-score test of statistical significance testonly report only sup-score test; do not estimate lasso regression ssgamma(real) test level for conservative critical value for the sup-score test (default = 0.05, i.e., 5% significance level) ssnumsim(int) number of simulations for sup-score test multiplier bootstrap (default=500; 0 =\u0026gt; do not simulate) Display and post Description ---------------------------------------------------------------------------------------------------------------------------- displayall display full coefficient vectors including unselected variables (default: display only selected, unpenalized and partialled-out) postall post full coefficient vector including unselected variables in e(b) (default: e(b) has only selected, unpenalized and partialled-out) ols post OLS coefs using lasso-selected variables in e(b) (default is lasso coefs) verbose show additional output vverbose show even more output dots show dots corresponding to repetitions in simulations (xdep and supscore) ---------------------------------------------------------------------------------------------------------------------------- Postestimation: predict [type] newvar [if] [in] [ , xb u e ue xbu resid lasso noisily ols ] predict is not currently supported after fixed-effects estimation. Options Description ---------------------------------------------------------------------------------------------------------------------------- xb generate fitted values (default) residuals generate residuals e generate overall error component e(it). Only after fe. ue generate combined residuals, i.e., u(i) + e(it). Only after fe. xbu prediction including fixed effect, i.e., a + xb + u(i). Only after fe. u fixed effect, i.e., u(i). Only after fe. noisily displays beta used for prediction. lasso use lasso coefficients for prediction (default is posted e(b) matrix) ols use OLS coefficients based on lasso-selected variables for prediction (default is posted e(b) matrix) ---------------------------------------------------------------------------------------------------------------------------- Replay: rlasso [ , displayall ] Options Description ---------------------------------------------------------------------------------------------------------------------------- displayall display full coefficient vectors including unselected variables (default: display only selected, unpenalized and partialled-out) ---------------------------------------------------------------------------------------------------------------------------- rlasso may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or xtset. aweights and pweights are supported; see help weights. pweights is equivalent to aweights + robust. All varlists may contain time-series operators or factor variables; see help varlist. Contents Description Estimation methods Penalty loadings Sup-score test of joint significance Computational notes Miscellaneous Version notes Examples of usage Saved results References Website Installation Acknowledgements Citation of lassopack Description rlasso is a routine for estimating the coefficients of a lasso or square-root lasso (sqrt-lasso) regression where the lasso penalization is data-dependent and where the number of regressors p may be large and possibly greater than the number of observations. The lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) is a regression method that uses regularization and the L1 norm. rlasso implements a version of the lasso that allows for heteroskedastic and clustered errors; see Belloni et al. (2012, 2013, 2014, 2016). For an overview of rlasso and the theory behind it, see Ahrens et al. (2020) The default estimator implemented by rlasso is the lasso. An alternative that does not involve estimating the error variance is the square-root-lasso (sqrt-lasso) of Belloni et al. (2011, 2014), available with the sqrt option. The lasso and sqrt-lasso estimators achieve sparse solutions: of the full set of p predictors, typically most will have coefficients set to zero and only s\u0026lt;\u0026lt;p will be non-zero. The \"post-lasso\" estimator is OLS applied to the variables with non-zero lasso or sqrt-lasso coefficients, i.e., OLS using the variables selected by the lasso or sqrt-lasso. The lasso/sqrt-lasso and post-lasso coefficients are stored in e(beta) and e(betaOLS), respectively. By default, rlasso posts the lasso or sqrt-lasso coefficients in e(b). To post in e(b) the OLS coefficients based on lasso- or sqrt-lasso-selected variables, use the ols option. Estimation methods rlasso solves the following problem min 1/N RSS + lambda/N*||Psi*beta||_1, where RSS = sum(y(i)-x(i)'beta)^2 denotes the residual sum of squares, beta is a p-dimensional parameter vector, lambda is the overall penalty level, ||.||_1 denotes the L1-norm, i.e., sum_i(abs(a[i])); Psi is a p by p diagonal matrix of predictor-specific penalty loadings. Note that rlasso treats Psi as a row vector. N number of observations If the option sqrt is specified, rlasso estimates the sqrt-lasso estimator, which is defined as the solution to: min sqrt(1/N*RSS) + lambda/N*||Psi*beta||_1. Note: the above lambda differs from the definition used in parts of the lasso and elastic net literature; see for example the R package glmnet by Friedman et al. (2010). The objective functions here follow the format of Belloni et al. (2011, 2012). Specifically, lambda(r)=2*N*lambda(GN) where lambda(r) is the penalty level used by rlasso and lambda(GN) is the penalty level used by glmnet. rlasso obtains the solutions to the lasso sqrt-lasso using coordinate descent algorithms. The algorithm was first proposed by Fu (1998) for the lasso (then referred to as \"shooting\"). For further details of how the lasso and sqrt-lasso solutions are obtained, see lasso2. rlasso first estimates the lasso penalty level and then uses the coordinate descent algorithm to obtain the lasso coefficients. For the homoskedastic case, a single penalty level lambda is applied; in the heteroskedastic and cluster cases, the penalty loadings vary across regressors. The methods are discussed in detail in Belloni et al. (2012, 2013, 2014, 2016) and are described only briefly here. For a detailed discussion of an R implementation of rlasso, see Spindler et al. (2016). For compatibility with the wider lasso literature, the documentation here uses \"lambda\" to refer to the penalty level that, combined with the possibly regressor-specific penalty loadings, is used with the estimation algorithm to obtain the lasso coefficients. \"lambda0\" refers to the component of the overall lasso penalty level that does not depend on the error variance. Note that this terminology differs from that in the R implementation of rlasso by Spindler et al. (2016). The default lambda0 for the lasso is 2c*sqrt(N)*invnormal(1-gamma/(2p)), where p is the number of penalized regressors and c and gamma are constants with default values of 1.1 and 0.1/log(N), respectively. In the cluster-lasso (Belloni et al. 2016) the default gamma is 0.1/log(N_clust), where N_clust is the number of clusters (saved in e(N_clust)). The default lambda0s for the sqrt-lasso are the same except replace 2c with c. The constant c\u0026gt;1.0 is a slack parameter; gamma controls the confidence level. The alternative formula lambda0 = 2c*sqrt(N)*sqrt(2*log(2p/gamma)) is available with the lalt option. The constants c and gamma can be set using the c(real) and gamma(real) options. The xdep option is another alternative that implements an \"X-dependent\" penalty level lambda0; see Belloni and Chernozhukov (2011) and Belloni et al. (2013) for discussion. The default lambda for the lasso in the i.i.d. case is lambda0*rmse, where rmse is an estimate of the standard deviation of the error variance. The sqrt-lasso differs from the standard lasso in that the penalty term lambda is pivotal in the homoskedastic case and does not depend on the error variance. The default for the sqrt-lasso in the i.i.d. case is lambda=lambda0=c*sqrt(N)*invnormal(1-gamma/(2*p)) (note the absence of the factor of \"2\" vs. the lasso lambda). Penalty loadings As is standard in the lasso literature, regressors are standardized to have unit variance. By default, standardization is achieved by incorporating the standard deviations of the regressors into the penalty loadings. In the default homoskedastic case, the penalty loadings are the vector of standard deviations of the regressors. The normalized penalty loadings are the penalty loadings normalized by the SDs of the regressors. In the homoskedastic case the normalized penalty loadings are a vector of 1s. rlasso saves the vector of penalty loadings, the vector of normalized penalty loadings, and the vector of SDs of the regressors X in e(.) macros. Penalty loadings are constructed after the partialling-out of unpenalized regressors and/or the FE (fixed-effects) transformation, if applicable. A alternative to partialling-out unpenalized regressors with the partial(varlist) option is to give them penalty loadings of zero with the pnotpen(varlist) option. By the Frisch-Waugh-Lovell Theorem for the lasso (Yamada 2017), the estimated lasso coefficients are the same in theory (but see below) whether the unpenalized regressors are partialled-out or given zero penalty loadings, so long as the same penalty loadings are used for the penalized regressors in both cases. Note that the calculation of the penalty loadings in both the partial(.) and pnotpen(.) cases involves adjustments for the partialled-out variables. This is different from the lasso2 handling of unpenalized variables specified in the lasso2 option notpen(.), where no such adjustment of the penalty loadings is made (and is why the two no-penalization options are named differently). Regressor-specific penalty loadings for the heteroskedastic and clustered cases are derived following the methods described in Belloni et al. (2012, 2013, 2014, 2015, 2016). The penalty loadings for the heteroskedastic-robust case have elements of the form sqrt[avg(x^2e^2)]/sqrt[avg(e^2)] where x is a (demeaned) regressor, e is the residual, and sqrt[avg(e^2)] is the root mean squared error; the normalized penalty loadings have elements sqrt[avg(x^2e^2)]/(sqrt[avg(x^2)]sqrt[avg(e^2)]) where the sqrt(avg(x^2) in the denominator is SD(x), the standard deviation of x. This corresponds to the presentation of penalty loadings in Belloni et al. (2014; see Algorithm 1 but note that in their presentation, the predictors x are assumed already to be standardized). NB: in the presentation we use here, the penalty loadings for the lasso and sqrt-lasso are the same; what differs is the overall penalty term lambda. The cluster-robust case is similar to the heteroskedastic case except that numerator sqrt[avg(x^2e^2)] in the heteroskedastic case is replaced by sqrt[avg(u_i^2)], where (using the notation of the Stata manual's discussion of the _robust command) u_i is the sum of x_ij*e_ij over the j members of cluster i; see Belloni et al. (2016). Again in the presentation used here, the cluster-lasso and cluster-sqrt-lasso penalty loadings are the same. The unit vector is again the benchmark for the standardized penalty loadings. NB: also following _robust, the denominator of avg(u_i^2) and Tbar is (N_clust-1). cluster(varname1 varname2) implements two-way cluster-robust penalty loadings (Cameron et al. 2011; Thompson 2011). \"Two-way cluster-robust\" means the penalty loadings accommodate arbitrary within-group correlation in two distinct non-nested categories defined by varname1 and varname2. Note that the asymptotic justification for the two-way cluster-robust approach requires both dimensions to be \"large\" (go off to infinity). Autocorrelation-consistent (AC) and heteroskedastic and autocorrelation-consistent (HAC) penalty loadings can be obtained by using the bw(int) option on its own (AC) or in combination with the robust option (HAC), where int specifies the bandwidth; see Chernozhukov et al. (2018, 2020) and Ahrens et al. (2020). Syntax and usage follows that used by ivreg2; see the ivreg2 help file for details. The default is to use the Bartlett kernel; this can be changed using the kernel option. The full list of kernels available is (abbreviations in parentheses): Bartlett (bar); Truncated (tru); Parzen (par); Tukey-Hanning (thann); Tukey-Hamming (thamm); Daniell (dan); Tent (ten); and Quadratic-Spectral (qua or qs). AC and HAC penalty loadings can also be used for (large T) panel data; this requires the dataset to be xtset. Note that for some kernels it is possible in finite samples to obtain negative variances and hence undefined penalty loadings; the same is true of two-way cluster-robust. Intutively, this arises because the covariance term in a calculation like var+var-2cov is \"too big\". When this happens, rlasso issues a warning and (arbitrarily) replaces 2cov with cov. The center option centers the x_ij*e_ij terms (or in the cluster-lasso case, the u_i terms) prior to calculating the penalty loadings. Sup-score test of joint significance rlasso with the supscore option reports a test of the null hypothesis H0: beta_1 = ... = beta_p = 0. i.e., a test of the joint significance of the regressors (or, alternatively, a test that H0: s=0; of the full set of p regressors, none is in the true model). The test follows Chernozhukov et al. (2013, Appendix M); see also Belloni et al. (2012, 2013). (The variables are assumed to be rescaled to be centered and with unit variance.) If the null hypothesis is correct and the rest of the model is well-specified (including the assumption that the regressors are orthogonal to the disturbance e), then E(e*x_j) = E((y-beta_0)*x_j) = 0, j=1...p where beta_0 is the intercept. The sup-score statistic is S=sqrt(N)*max_j(abs(avg((y-b_0)*x_j))/(sqrt(avg(((y-b_0)*x_j)^2)))), where: (a) the numerator abs(avg((y-b_0)*x_j)) is the absolute value of the average score for regressor x_j and b_0 is sample mean of y; (b) the denominator sqrt(avg(((y-b_0)*x_j)^2)) is the sample standard deviation of the score; (c) the statistic is sqrt(N) times the maximum across the p regressors of the ratio of (a) to (b). The p-value for the sup-score test is obtained by a multiplier bootstrap procedure simulating the statistic W, defined as W=sqrt(N)*max_j(abs(avg((y-b_0)*x_j*u))/(sqrt(avg(((y-b_0)*x_j)^2)))) where u is an iid standard normal variate independent of the data. The ssnumsim(int) option controls the number of simulated draws (default=500); ssnumsim(0) requests that the sup-score statistic is reported without a simulation-based p-value. rlasso also reports a conservative critical value (asymptotic bound) as per Belloni et al. (2012, 2013), defined as c*invnormal(1-gamma/(2p)); this can be set by the option ssgamma(int) (default = 0.05). Computational notes A computational alternative to the default of standardizing \"on the fly\" (i.e., incorporating the standardization into the lasso penalty loadings) is to standardize all variables to have unit variance prior to computing the lasso coefficients. This can be done using the prestd option. The results are equivalent in theory. The prestd option can lead to improved numerical precision or more stable results in the case of difficult problems; the cost is (a typically small) computation time required to standardize the data. Either the partial(varlist) option or the pnotpen(varlist) option can be used for variables that should not be penalized by the lasso. The options are equivalent in theory (see above), but numerical results can differ in practice because of the different calculation methods used. Partialling-out variables can lead to improved numerical precision or more stable results in the case of difficult problems vs. specifying the variables as unpenalized, but may be slower in terms of computation time. Both the partial(varlist) and pnotpen(varlist) options use least squares. This is implemented in Mata using one of Mata's solvers. In cases where the variables to be partialled out are collinear or nearly so, different solvers may generate different results. Users may wish to check the stability of their results in such cases. The psolver(.) option can be used to specify the Mata solver used. The default behavior of rlasso to solve AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx. Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates using svsolve((A'A),(A'B)). rlasso will warn if collinear variables are dropped when partialling out. By default the constant (if present) is not penalized if there are no regressors being partialled out; this is equivalent to mean-centering prior to estimation. The exception to this is if aweights or aweights are specified, in which case the constant is partialled-out. The partial(varlist) option will automatically also partial out the constant (if present); to partial out just the constant, specify partial(_cons). The within transformation implemented by the fe option automatically mean-centers the data; the nocons option is redundant in this case and may not be specified with this option. The prestd and pnotpen(varlist) vs. partial(varlist) options can be used as simple checks for numerical stability by comparing results that should be equivalent in theory. If the results differ, the values of the minimized objective functions (e(pmse) or e(prmse)) can be compared. The fe fixed-effects option is equivalent to (but computationally faster and more accurate than) specifying unpenalized panel-specific dummies. The fixed-effects (\"within\") transformation also removes the constant as well as the fixed effects. The panel variable used by the fe option is the panel variable set by xtset. To use weights with fixed effects, the ftools must be installed. Miscellaneous By default rlasso reports only the set of selected variables and their lasso and post-lasso coefficients; the omitted coefficients are not reported in the regression output. The postall and displayall options allow the full coefficient vector (with coefficients of unselected variables set to zero) to be either posted in e(b) or displayed as output. rlasso, like the lasso in general, accommodates possibly perfectly-collinear sets of regressors. Stata's factor variables are supported by rlasso (as well as by lasso2). Users therefore have the option of specifying as regressors one or more complete sets of factor variables or interactions with no base levels using the ibn prefix. This can be interpreted as allowing rlasso to choose the members of the base category. The choice of whether to use partial(varlist) or pnotpen(varlist) will depend on the circumstances faced by the user. The partial(varlist) option can be helpful in dealing with data that have scaling problems or collinearity issues; in these cases it can be more accurate and/or achieve convergence faster than the pnotpen(varlist) option. The pnotpen(varlist) option will sometimes be faster because it avoids using the pre-estimation transformation employed by partial(varlist). The two options can be used simultaneously (but not for the same variables). The treatment of standardization, penalization and partialling-out in rlasso differs from that of lasso2. In the rlasso treatment, standardization incorporates the partialling-out of regressors listed in the pnotpen(varlist) list as well as those in the partial(varlist) list. This is in order to maintain the equivalence of the lasso estimator irrespective of which option is used for unpenalized variables (see the discussion of the Frisch-Waugh-Lovell Theorem for the lasso above). In the lasso2 treatment, standardization takes place after the partialling-out of only the regressors listed in the notpen(varlist) option. In other words, rlasso adjusts the penalty loadings for any unpenalized variables; lasso2 does not. For further details, see lasso2. The initial overhead for fixed-effects estimation and/or partialling out and/or pre-estimation standardization (creating temporary variables and then transforming the data) can be noticable for large datasets. For problems that involve looping over data, users may wish to first transform the data by hand. If a small number of correlations is set using the corrnum(int) option, users may want to increase the number of penalty loadings iterations from the default of 2 to something higher using the maxpsiiter(int) option. The sup-score p-value is obtained by simulation, which can be time-consuming for large datasets. To skip this and use only the conservative (asymptotic bound) critical value, set the number of simulations to zero with the ssnumsim(0) option. Version notes Detailed version notes can be found inside the ado files rlasso.ado and lassoutils.ado. Noteworthy changes appear below. In versions of lassoutils prior to 1.1.01 (8 Nov 2018), the very first iteration to obtain penalty loadings set the constant c=0.55. This was dropped in version 1.1.01, and the constant c is unchanged in all iterations. To replicate the previous behavior of rlasso, use the c0(real) option. For example, with the default value of c=1.1, to replicate the earlier behavior use c0(0.55). In versions of lassoutils prior to 1.1.01 (8 Nov 2018), the sup-score test statistic S was N*max_j rather than sqrt(N)*max_j as in Chernozhukov et al. (2013), and similarly for the simulated statistic W. Examples using prostate cancer data from Hastie et al. (2009) Load prostate cancer data. . clear . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, tab Estimate lasso using data-driven lambda penalty; default homoskedasticity case. . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45 Use square-root lasso instead. . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt Illustrate relationships between lambda, lambda0 and penalty loadings: Basic usage: homoskedastic case, lasso . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45 lambda=lambda0*SD is lasso penalty; incorporates the estimate of the error variance default lambda0 is 2c*sqrt(N)*invnormal(1-gamma/(2*p)) . di e(lambda) . di e(lambda0) In the homoskedastic case, penalty loadings are the vector of SDs of penalized regressors . mat list e(ePsi) ...and the standardized penalty loadings are a vector of 1s. . mat list e(sPsi) Heteroskedastic case, lasso . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, robust lambda and lambda0 are the same as for the homoskedastic case . di e(lambda) . di e(lambda0) Penalty loadings account for heteroskedasticity as well as incorporating SD(x) . mat list e(ePsi) ...and the standardized penalty loadings are not a vector of 1s. . mat list e(sPsi) Homoskedastic case, sqrt-lasso . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt with the sqrt-lasso, the default lambda=lambda0=c*sqrt(N)*invnormal(1-gamma/(2*p)); note the difference by a factor of 2 vs. the standard lasso lambda0 . di e(lambda) . di e(lambda0) rlasso vs. lasso2 (if installed) . rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45 lambda=lambda0*SD is lasso penalty; incorporates the estimate of the error variance default lambda0 is 2c*sqrt(N)*invnormal(1-gamma/(2*p)) . di %8.5f e(lambda) Replicate rlasso estimates using rlasso lambda and lasso2 . lasso2 lpsa lcavol lweight age lbph svi lcp gleason pgg45, lambda(44.34953) Examples using data from Acemoglu-Johnson-Robinson (2001) Load and reorder AJR data for Table 6 and Table 8 (datasets need to be in current directory). . clear . (click to download maketable6.zip from economics.mit.edu) . unzipfile maketable6 . (click to download maketable8.zip from economics.mit.edu) . unzipfile maketable8 . use maketable6 . merge 1:1 shortnam using maketable8 . keep if baseco==1 . order shortnam logpgp95 avexpr lat_abst logem4 edes1975 avelf, first . order indtime euro1900 democ1 cons1 democ00a cons00a, last Alternatively, load AJR data from our website (no manual download required): . clear . use https://statalasso.github.io/dta/AJR.dta Basic usage: . rlasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres Heteroskedastic-robust penalty loadings: . rlasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres, robust Partialling-out vs. non-penalization: . rlasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres, partial(lat_abst) . rlasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres, pnotpen(lat_abst) Request sup-score test (H0: all betas=0): . rlasso logpgp95 lat_abst edes1975 avelf temp* humid* steplow-oilres, supscore Examples using data from Angrist-Krueger (1991) Load AK data and rename variables (dataset needs to be in current directory). NB: this is a large dataset (330k observations) and estimations may take some time to run on some installations. . clear . (click to download asciiqob.zip from economics.mit.edu) . unzipfile asciiqob.zip . infix lnwage 1-9 edu 10-20 yob 21-31 qob 32-42 pob 43-53 using asciiqob.txt Alternatively, get data from our website source (no unzipping needed): . use https://statalasso.github.io/dta/AK91.dta xtset data by place of birth (state): . xtset pob State (place of birth) fixed effects; regressors are year of birth, quarter of birth and QOBxYOB. . rlasso edu i.yob# #i.qob, fe As above but explicit penalized state dummies and all categories (no base category) for all factor vars. Note that the (unpenalized) constant is reported. . rlasso edu ibn.yob# #ibn.qob ibn.pob State fixed effects; regressors are YOB, QOB and QOBxYOB; cluster on state. . rlasso edu i.yob# #i.qob, fe cluster(pob) Example using data from Belloni et al. (2015) Load dataset on eminent domain (available at journal website). . clear . import excel using CSExampleData.xlsx, first Settings used in Belloni et al. (2015) - results as in text discussion (p=147): . rlasso NumProCase Z* BA BL DF, robust lalt corrnum(0) maxpsiiter(100) c0(0.55) . di e(p) Settings used in Belloni et al. (2015) - results as in journal replication file (p=144): . rlasso NumProCase Z*, robust lalt corrnum(0) maxpsiiter(100) c0(0.55) . di e(p) Examples illustrating AC/HAC penalty loadingss . use http://fmwww.bc.edu/ec-p/data/wooldridge/phillips.dta . tsset year, yearly Autocorrelation-consistent (AC) penalty loadings; bandwidth=3; default kernel is Bartlett. . rlasso cinf L(0/10).unem, bw(3) Heteroskedastic- and autocorrelation-consistent (HAC) penalty loadings; bandwidth=5; kernel is quadratic-spectral. . rlasso cinf L(0/10).unem, bw(5) rob kernel(qs) Saved results rlasso saves the following in e(): scalars e(N) sample size e(N_clust) number of clusters in cluster-robust estimation; in the case of 2-way cluster-robust, e(N_clust)=min(e(N_clust1),e(N_clust2)) e(N_g) number of groups in fixed-effects model e(p) number of penalized regressors in model e(s) number of selected regressors e(s0) number of selected and unpenalized regressors including constant (if present) e(lambda0) penalty level excluding rmse (default = 2c*sqrt(N)*invnormal(1-gamma/(2*p))) e(lambda) lasso: penalty level including rmse (=lambda0*rmse); sqrt-lasso: lambda=lambda0 e(slambda) standardized lambda; equiv to lambda used on standardized data; lasso: slambda=lambda/SD(depvar); sqrt-lasso: slambda=lambda0 e(c) parameter in penalty level lambda e(gamma) parameter in penalty level lambda e(niter) number of iterations for shooting algorithm e(maxiter) max number of iterations for shooting algorithm e(npsiiter) number of iterations for loadings algorithm e(maxpsiiter) max iterations for loadings algorithm e(r2) R-sq for lasso estimation e(rmse) rmse using lasso resduals e(rmseOLS) rmse using post-lasso residuals e(pmse) minimized objective function (penalized mse, standard lasso only) e(prmse) minimized objective function (penalized rmse, sqrt-lasso only) e(cons) =1 if constant in model, =0 otherwise e(fe) =1 if fixed-effects model, =0 otherwise e(center) =1 if moments have been centered e(bw) (HAC/AC only) bandwidth used e(supscore) sup-score statistic e(supscore_p) sup-score p-value e(supscore_cv) sup-score critical value (asymptotic bound) macros e(cmd) rlasso e(cmdline) command line e(depvar) name of dependent variable e(varX) all regressors e(varXmodel) penalized regressors e(pnotpen) unpenalized regressors e(partial) partialled-out regressors e(selected) selected and penalized regressors e(selected0) all selected regressors including unpenalized and constant (if present) e(method) lasso or sqrt-lasso e(estimator) lasso, sqrt-lasso or post-lasso ols posted in e(b) e(robust) heteroskedastic-robust penalty loadings e(clustvar) variable defining clusters for cluster-robust penalty loadings; if two-way clustering is used, the variables are in e(clustvar1) and e(clustvar2) e(kernel) (HAC/AC only) kernel used e(ivar) variable defining groups for fixed-effects model matrices e(b) posted coefficient vector e(beta) lasso or sqrt-lasso coefficient vector e(betaOLS) post-lasso coefficient vector e(betaAll) full lasso or sqrt-lasso coefficient vector including omitted, factor base variables, etc. e(betaAllOLS) full post-lasso coefficient vector including omitted, factor base variables, etc. e(ePsi) estimated penalty loadings e(sPsi) standardized penalty loadings (vector of 1s in homoskedastic case functions e(sample) estimation sample References Acemoglu, D., Johnson, S. and Robinson, J.A. 2001. The colonial origins of comparative development: An empirical investigation. American Economic Review, 91(5):1369-1401. https://economics.mit.edu/files/4123 Ahrens, A., Aitkens, C., Dizten, J., Ersoy, E., Kohns, D. and M.E. Schaffer. 2020. A Theory-based Lasso for Time-Series Data. Invited paper for the International Conference of Econometrics of Vietnam, January 2020. Forthcoming in Studies in Computational Intelligence (Springer). Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397. Angrist, J. and Kruger, A. 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106(4):979-1014. http://www.jstor.org/stable/2937954 Belloni, A. and Chernozhukov, V. 2011. High-dimensional sparse econometric models: An introduction. In Alquier, P., Gautier E., and Stoltz, G. (eds.), Inverse problems and high-dimensional estimation. Lecture notes in statistics, vol. 203. Springer, Berlin, Heidelberg. https://arxiv.org/pdf/1106.5242.pdf Belloni, A., Chernozhukov, V. and Wang, L. 2011. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98:791-806. https://doi.org/10.1214/14-AOS1204 Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369-2429. http://onlinelibrary.wiley.com/doi/10.3982/ECTA9626/abstract Belloni, A., Chernozhukov, V. and Hansen, C. 2013. Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics: 10th World Congress, Vol. 3: Econometrics, Cambridge University Press: Cambridge, 245-295. http://arxiv.org/abs/1201.0220 Belloni, A., Chernozhukov, V. and Hansen, C. 2014. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81:608-650. https://doi.org/10.1093/restud/rdt044 Belloni, A., Chernozhukov, V. and Hansen, C. 2015. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives 28(2):29-50. http://www.aeaweb.org/articles.php?doi=10.1257/jep.28.2.29 Belloni, A., Chernozhukov, V., Hansen, C. and Kozbur, D. 2016. Inference in high dimensional panel models with an application to gun control. Journal of Business and Economic Statistics 34(4):590-605. http://amstat.tandfonline.com/doi/full/10.1080/07350015.2015.1102733 Belloni, A., Chernozhukov, V. and Wang, L. 2014. Pivotal estimation via square-root-lasso in nonparametric regression. Annals of Statistics 42(2):757-788. https://doi.org/10.1214/14-AOS1204 Chernozhukov, V., Chetverikov, D. and Kato, K. 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Annals of Statistics 41(6):2786-2819. https://projecteuclid.org/euclid.aos/1387313390 Cameron, A.C., Gelbach, J.B. and D.L. Miller. Robust Inference with Multiway Clustering. Journal of Business \u0026amp; Economic Statistics 29(2):238-249. https://www.jstor.org/stable/25800796. Working paper version: NBER Technical Working Paper 327, http://www.nber.org/papers/t0327. Chernozhukov, V., Hardle, W.K., Huang, C. and W. Wang. 2018 (rev 2020). LASSO-driven inference in time and space. Working paper. https://arxiv.org/abs/1806.05081 Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets. https://ideas.repec.org/c/boc/bocode/s458213.html Friedman, J., Hastie, T., \u0026amp; Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1\\9622. https://doi.org/10.18637/jss.v033.i01 Fu, W.J. 1998. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7(3):397-416. http://www.tandfonline.com/doi/abs/10.1080/10618600.1998.10474784 Hastie, T., Tibshirani, R. and Friedman, J. 2009. The elements of statistical learning (2nd ed.). New York: Springer-Verlag. https://web.stanford.edu/~hastie/ElemStatLearn/ Spindler, M., Chernozhukov, V. and Hansen, C. 2016. High-dimensional metrics. https://cran.r-project.org/package=hdm. Thompson, S.B. 2011. Simple formulas for standard errors that cluster by both firm and time. Journal of Financial Economics 99(1):1-10. Working paper version: http://ssrn.com/abstract=914002. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267-288. https://doi.org/10.2307/2346178 Yamada, H. 2017. The Frisch-Waugh-Lovell Theorem for the lasso and the ridge regression. Communications in Statistics - Theory and Methods 46(21):10897-10902. http://dx.doi.org/10.1080/03610926.2016.1252403 Website Please check our website https://statalasso.github.io/ for more information. Installation rlasso is part of the lassopack package. To get the latest stable version of lassopack from our website, check the installation instructions at https://statalasso.github.io/installation/. We update the stable website version more frequently than the SSC version. Earlier versions of lassopack are also available from the website. To verify that lassopack is correctly installed, click on or type whichpkg lassopack (which requires whichpkg to be installed; ssc install whichpkg). Acknowledgements Thanks to Alexandre Belloni for providing Matlab code for the square-root-lasso and to Sergio Correia for advice on the use of the FTOOLS package. Citation of rlasso rlasso is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it as such: Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation http://ideas.repec.org/c/boc/bocode/s458458.html Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397. Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E. Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.uk Also see Help: lasso2, cvlasso, lassologit, pdslasso, ivlasso (if installed) "},{"id":15,"href":"/docs/ddml/plm/","title":"Partial Linear Model (PLM)","section":"DDML","content":" Partially Linear Model # Preparations # We load the data, define global macros and set the seed.\n. use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear . global Y net_tfa . global D e401 . global X tw age inc fsize educ db marr twoearn pira hown . set seed 42 Step 1: Initialize DDML model # We next initialize the ddml estimation and select the model. partial refers to the partially linear model. The model will be stored on a Mata object with the default name \u0026ldquo;m0\u0026rdquo; unless otherwise specified using the mname(name) option.\nNumber of folds\nNote that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds and even more if your sample size is small. . ddml init partial, kfolds(2) Step 2: All machine learners # We add a supervised machine learners for estimating the conditional expectation \\(E[Y|X]\\) . We first add simple linear regression.\n. ddml E[Y|X]: reg $Y $X Learner Y1_reg added successfully. We can add more than one learner per reduced form equation. Here, we also add a random forest learner (implemented in pystacked).\n. ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf) Learner Y2_pystacked added successfully. We do the same for the conditional expectation E[D|X].\n. ddml E[D|X]: reg $D $X Learner D1_reg added successfully. . ddml E[D|X]: pystacked $D $X, type(reg) method(rf) Learner D2_pystacked added successfully. Optionally, you can check if the learners have been added correctly.\n. ddml desc Model: partial, crossfit folds k=2, resamples r=1 Dependent variable (Y): net_tfa net_tfa learners: Y1_reg Y2_pystacked D equations (1): e401 e401 learners: D1_reg D2_pystacked Step 3: Cross-fitting # The learners are iteratively fitted on the training data. This step may take a while.\n. ddml crossfit Cross-fitting E[Y|X] equation: net_tfa Cross-fitting fold 1 2 ...completed cross-fitting Cross-fitting E[D|X] equation: e401 Cross-fitting fold 1 2 ...completed cross-fitting Step 4: Estimation # Finally, we obtain estimates of the coefficients of interest. Since we added two learners for each of our two reduced form equations, we get 4 point estimates. The result shown corresponds to the model with the lowest out-of-sample MSPE.\n. ddml estimate, robust DDML estimation results: spec r Y learner D learner b SE 1 1 Y1_reg D1_reg 5397.308(1130.901) 2 1 Y1_reg D2_pystacked 6707.514 (880.374) * 3 1 Y2_pystacked D1_reg 7044.822(1127.173) 4 1 Y2_pystacked D2_pystacked 6991.835 (755.805) * = minimum MSE specification for that resample. Min MSE DDML model, specification 3 y-E[y|X] = Y2_pystacked_1 Number of obs = 9915 D-E[D|X,Z]= D1_reg_1 ------------------------------------------------------------------------------ | Robust net_tfa | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- e401 | 7044.822 1127.173 6.25 0.000 4835.603 9254.042 ------------------------------------------------------------------------------ To retrieve the very first specification shown, you can type:\n. ddml estimate, robust spec(1) DDML estimation results: spec r Y learner D learner b SE 1 1 Y1_reg D1_reg 5397.308(1130.901) 2 1 Y1_reg D2_pystacked 6707.514 (880.374) * 3 1 Y2_pystacked D1_reg 7044.822(1127.173) 4 1 Y2_pystacked D2_pystacked 6991.835 (755.805) * = minimum MSE specification for that resample. DDML model, specification 1 y-E[y|X] = Y1_reg_1 Number of obs = 9915 D-E[D|X,Z]= D1_reg_1 ------------------------------------------------------------------------------ | Robust net_tfa | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- e401 | 5397.308 1130.901 4.77 0.000 3180.783 7613.834 ------------------------------------------------------------------------------ You could manually retrieve the same point estimate by typing:\n. reg Y1_reg D1_reg, nocons robust Linear regression Number of obs = 9,915 F(1, 9914) = 22.78 Prob \u0026gt; F = 0.0000 R-squared = 0.0037 Root MSE = 39626 ------------------------------------------------------------------------------ | Robust Y1_reg_1 | Coefficient std. err. t P\u0026gt;|t| [95% conf. interval] -------------+---------------------------------------------------------------- D1_reg_1 | 5397.308 1130.901 4.77 0.000 3180.512 7614.105 ------------------------------------------------------------------------------ or graphically:\n. twoway (scatter Y1_reg D1_reg) (lfit Y1_reg D1_reg) where Y1_reg and D1_reg are the orthogonalized versions of net_tfa and e401.\nTo describe the ddml model setup or results in detail, you can use ddml describe with the relevant option (sample, learners, crossfit, estimates), or just describe them all with the all option:\n. ddml describe, all "},{"id":16,"href":"/docs/ddml/plm2/","title":"PLM \u0026 Stacking","section":"DDML","content":" Partially linear model with Stacking # Stacking regression is a simple and powerful method for combining predictions from multiple learners. It is available in Stata via the pystacked package (see here). Below is an example with the partially linear model, but it can be used with any model supported by ddml.\nStep 1: Initialization # Preparation: use the data and globals as above. Use the name m1 for this new estimation, to distinguish it from the previous example that uses the default name m0. This enables having multiple estimations available for comparison. Also specify 5 cross-fitting repetitions.\n. set seed 42 . ddml init partial, kfolds(2) reps(5) mname(m1) Cross-fitting repetitions\nThe results of DDML depends on the exact cross-fit fold split. We recommend re-running the (final) model multiple times on different random folds; see options reps(integer). Step 2: Add learners # Add supervised machine learners for estimating conditional expectations. The first learner in the stacked ensemble is OLS. We also use cross-validated lasso, ridge and two random forests with different settings, which we save in the following macros:\n. global rflow max_features(5) min_samples_leaf(1) max_samples(.7) . global rfhigh max_features(5) min_samples_leaf(10) max_samples(.7) In each step, we add the mname(m1) option to ensure that the learners are not added to the m0 model which is still in memory. We also specify the names of the variables containing the estimated conditional expectations using the learner(varname) option. This avoids overwriting the variables created for the m0 model using default naming.\n. ddml E[Y|X], mname(m1) learner(Y_m1): pystacked $Y $X || /// \u0026gt; method(ols) || /// \u0026gt; method(lassocv) || /// \u0026gt; method(ridgecv) || /// \u0026gt; method(rf) opt($rflow) || /// \u0026gt; method(rf) opt($rfhigh), type(reg) Learner Y_m1 added successfully. . ddml E[D|X], mname(m1) learner(D_m1): pystacked $D $X || /// \u0026gt; method(ols) || /// \u0026gt; method(lassocv) || /// \u0026gt; method(ridgecv) || /// \u0026gt; method(rf) opt($rflow) || /// \u0026gt; method(rf) opt($rfhigh), type(reg) Learner D_m1 added successfully. Options\nNote: Options before \u0026ldquo;:\u0026rdquo; and after the first comma refer to ddml. Options that come after the final comma refer to the estimation command. Make sure to not confuse the two types of options. Check if learners were correctly added (output omitted):\n. ddml desc, mname(m1) learners Step 3/4: Cross-fitting and estimation # . ddml crossfit, mname(m1) Cross-fitting E[Y|X] equation: net_tfa Resample 1... Cross-fitting fold 1 2 ...completed cross-fitting Resample 2... Cross-fitting fold 1 2 ...completed cross-fitting Resample 3... Cross-fitting fold 1 2 ...completed cross-fitting Resample 4... Cross-fitting fold 1 2 ...completed cross-fitting Resample 5... Cross-fitting fold 1 2 ...completed cross-fitting Cross-fitting E[D|X] equation: e401 Resample 1... Cross-fitting fold 1 2 ...completed cross-fitting Resample 2... Cross-fitting fold 1 2 ...completed cross-fitting Resample 3... Cross-fitting fold 1 2 ...completed cross-fitting Resample 4... Cross-fitting fold 1 2 ...completed cross-fitting Resample 5... Cross-fitting fold 1 2 ...completed cross-fitting . ddml estimate, mname(m1) robust DDML estimation results: spec r Y learner D learner b SE * 1 1 Y_m1 D_m1 7386.929 (939.818) * 1 2 Y_m1 D_m1 6882.647 (901.494) * 1 3 Y_m1 D_m1 6532.074 (874.233) * 1 4 Y_m1 D_m1 6533.431 (948.284) * 1 5 Y_m1 D_m1 6671.850 (980.995) * = minimum MSE specification for that resample. Mean/med. Y learner D learner b SE mse mn [min-mse] [mse] 6801.386 (972.913) mse md [min-mse] [mse] 6671.850 (958.334) Median over min-mse specifications y-E[y|X] = Y_m1 Number of obs = 9915 D-E[D|X,Z]= D_m1 ------------------------------------------------------------------------------ | Robust net_tfa | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- e401 | 6671.85 958.3335 6.96 0.000 4793.55 8550.149 ------------------------------------------------------------------------------ Summary over 5 resamples: D eqn mean min p25 p50 p75 max e401 6801.3863 6532.0747 6533.4312 6671.8496 6882.6470 7386.9292 Examine the learner weights used by pystacked:\n. ddml extract, mname(m1) show(pystacked) "},{"id":17,"href":"/docs/ddml/","title":"DDML","section":"Docs","content":" DDML # The Stata package ddml implements Double/Debiased Machine Learning (DDML; Chernozhukov et al. 2018) for Stata. The three main features of the program:\nddml supports five different statistical models that allow to flexibly control for confounders: (1) the Partially Linear Model, (2) the Interactive Model (for binary treatment), (3) the Partially Linear IV Model, the (4) High-dimensional IV Model, and (5) the Interactive IV Model (for binary treatment and instrument).\nddml provides flexible multi-line syntax and short one-line syntax. The multi-line syntax offers a wide range of options, guides the user through the DDML algorithm step-by-step, and includes auxiliary programs for storing, loading and displaying additional information. We also provide a complementary one-line version called qddml (\u0026lsquo;quick\u0026rsquo; ddml), which uses a similar syntax as pdslasso and ivlasso.\nddml is designed to be used in combination with existing supervised machine learning programs available in or via Stata. The requirements for compatibility with ddml are minimal: Any eclass program with the Stata-typical reg y x syntax, support for if conditions and post-estimation predict is compatible with ddml.\nOur recommendation is to use ddml in combination with pystacked. While pystacked allows for fast estimation of popular supervised machine learners, the main advantages is its support for stacking regression and classification.\nReference # Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins, Double/debiased machine learning for treatment and structural parameters, The Econometrics Journal, Volume 21, Issue 1, 1 February 2018, Pages C1–C68, https://doi.org/10.1111/ectj.12097\n"},{"id":18,"href":"/docs/ddml/interactive/","title":"Interactive","section":"DDML","content":" Interactive Model # Preparations: we load the data, define global macros and set the seed.\n. webuse cattaneo2, clear (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138–154) . global Y bweight . global D mbsmoke . global X mage prenatal1 mmarried fbaby mage medu . set seed 42 Step 1: Initialization # We use 5 folds and 5 resamplings; that is, we estimate the model 5 times using randomly chosen folds.\n. ddml init interactive, kfolds(5) reps(5) Step 2: Adding learners # We need to estimate the conditional expectations of \\(E[Y|X,D=0]\\) , \\(E[Y|X,D=1]\\) and \\(E[D|X]\\) . The first two conditional expectations are added jointly.\nWe consider two supervised learners: linear regression and gradient boosted trees (implemented in pystacked). Note that we use gradient boosted regression trees for E[Y|X,D], but gradient boosted classification trees for E[D|X].\n. ddml E[Y|X,D]: reg $Y $X Learner Y1_reg added successfully. . ddml E[Y|X,D]: pystacked $Y $X, type(reg) method(gradboost) Learner Y2_pystacked added successfully. . ddml E[D|X]: logit $D $X Learner D1_logit added successfully. . ddml E[D|X]: pystacked $D $X, type(class) method(gradboost) Learner D2_pystacked added successfully. Step 3: Cross-fitting # . ddml crossfit Step 4: Estimation # In the final estimation step, we can estimate both the average treatment effect (the default) or the average treatment effect of the treated (atet).\n. ddml estimate DDML estimation results: spec r Y0 learner Y1 learner D learner b SE 1 1 Y1_reg Y1_reg D1_logit -232.439 (23.705) * 2 1 Y1_reg Y1_reg D2_pystacked -207.548 (32.276) 3 1 Y1_reg Y2_pystacked D1_logit -212.864 (24.366) 4 1 Y1_reg Y2_pystacked D2_pystacked -196.989 (30.886) 5 1 Y2_pystacked Y1_reg D1_logit -233.565 (23.707) 6 1 Y2_pystacked Y1_reg D2_pystacked -206.370 (32.298) 7 1 Y2_pystacked Y2_pystacked D1_logit -213.989 (24.372) 8 1 Y2_pystacked Y2_pystacked D2_pystacked -195.812 (30.885) 1 2 Y1_reg Y1_reg D1_logit -231.528 (23.962) * 2 2 Y1_reg Y1_reg D2_pystacked -212.120 (29.030) ... \u0026lt;-click or type ddml estimate, replay full to display full summary * = minimum MSE specification for that resample. Mean/med. Y0 learner Y1 learner D learner b SE mse mn [min-mse] [mse] [mse] -216.073 (30.660) mse md [min-mse] [mse] [mse] -216.073 (29.490) Median over min-mse specifications y-E[y|X,D=0] = Y1_reg Number of obs = 4642 y-E[y|X,D=1] = Y1_reg D-E[D|X] = D2_pystacked ------------------------------------------------------------------------------ | Robust bweight | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- mbsmoke | -216.0733 29.48997 -7.33 0.000 -273.8725 -158.274 ------------------------------------------------------------------------------ Warning: 5 resamples had propensity scores trimmed to lower limit .01. Summary over 5 resamples: D eqn mean min p25 p50 p75 max mbsmoke -216.0727 -227.9799 -216.6422 -216.0733 -212.1205 -207.5478 . ddml estimate, atet trim(0) DDML estimation results: spec r Y0 learner Y1 learner D learner b SE 1 1 Y1_reg Y1_reg D1_logit -219.785 (23.719) * 2 1 Y1_reg Y1_reg D2_pystacked -234.764 (24.667) 3 1 Y1_reg Y2_pystacked D1_logit -219.785 (23.719) 4 1 Y1_reg Y2_pystacked D2_pystacked -234.764 (24.667) 5 1 Y2_pystacked Y1_reg D1_logit -225.738 (24.082) 6 1 Y2_pystacked Y1_reg D2_pystacked -228.703 (25.699) 7 1 Y2_pystacked Y2_pystacked D1_logit -225.738 (24.082) 8 1 Y2_pystacked Y2_pystacked D2_pystacked -228.703 (25.699) 1 2 Y1_reg Y1_reg D1_logit -219.674 (23.696) * 2 2 Y1_reg Y1_reg D2_pystacked -231.267 (25.229) ... \u0026lt;-click or type ddml estimate, replay full to display full summary * = minimum MSE specification for that resample. Mean/med. Y0 learner Y1 learner D learner b SE mse mn [min-mse] [mse] [mse] -235.908 (24.954) mse md [min-mse] [mse] [mse] -234.764 (24.919) Median over min-mse specifications y-E[y|X,D=0] = Y1_reg Number of obs = 4642 y-E[y|X,D=1] = Y1_reg D-E[D|X] = D2_pystacked ------------------------------------------------------------------------------ | Robust bweight | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- mbsmoke | -234.7643 24.91944 -9.42 0.000 -283.6055 -185.9231 ------------------------------------------------------------------------------ Summary over 5 resamples: D eqn mean min p25 p50 p75 max mbsmoke -235.9077 -239.9554 -239.3867 -234.7643 -234.1649 -231.2672 Recall that we have specified 5 resampling iterations (reps(5)) By default, the median over the minimum-MSE specification per resampling iteration is shown. At the bottom, a table of summary statistics over resampling iterations is shown.\n"},{"id":19,"href":"/docs/pdslasso/pdslasso_panel/","title":"Panel FE","section":"PDSLASSO","content":" Panel FE and Clustering # pdslasso and ivlasso can also be applied to fixed effect panel models using the methodology of Belloni et al., 2014. Since the appropriate level of regularization depends on the error structure, we need to accommodate cluster dependence that is likely to be present in panel data. Ignoring cluster dependence would lead to a regularization level that is too low.\nFor demonstation, we consider the nlswork example data set:\n. webuse nlswork, clear . xtset idcode year . global controls age-south To estimate the effect of union membership on wages, we consider a large set of controls, including age, race, education, regional controls as well as industry and occupational codes. The total number of controls is 79.\nWe also include individual fixed effect using the fe option. (Note that the data is xtset by idcode.) Lastly, we cluster by individual to account for dependence over time.\n. pdslasso ln_w union ( c.($controls)##c.($controls) i.ind_code i.occ_code ), /// cluster(idcode) fe OLS using CHS lasso-orthogonalized vars (Std. Err. adjusted for 4132 clusters in idcode) ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- union | .0978002 .0096037 10.18 0.000 .0789773 .1166231 ------------------------------------------------------------------------------ OLS using CHS post-lasso-orthogonalized vars (Std. Err. adjusted for 4132 clusters in idcode) ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- union | .0881405 .0092774 9.50 0.000 .0699571 .1063239 ------------------------------------------------------------------------------ OLS with PDS-selected variables and full regressor set (Std. Err. adjusted for 4132 clusters in idcode) ------------------------------------------------------------------------------ | Robust ln_wage | Coef. Std. Err. z P\u0026gt;|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- union | .0883666 .0092876 9.51 0.000 .0701633 .1065699 ------------------------------------------------------------------------------ Note that the output above is abbreviated. In total, 16 controls are selected.\n"},{"id":20,"href":"/docs/ddml/iv/","title":"Partial Linear IV","section":"DDML","content":" Partial Linear IV Model # Preparations # We load the data, define global macros and set the seed.\n. use https://statalasso.github.io/dta/AJR.dta, clear . global Y logpgp95 . global D avexpr . global Z logem4 . global X lat_abst edes1975 avelf temp* humid* steplow-oilres . set seed 42 Step 1: Initialization # Since the data set is very small, we consider 30 cross-fitting folds.\n. ddml init iv, kfolds(30) Step 2: Adding learners # The partially linear IV model has three conditional expectations: \\(E[Y|X]\\) , \\(E[D|X]\\) and \\(E[Z|X]\\) . For each reduced form equation, we add two learners: regress and rforest.\nWe need to add the option vtype(none) for rforest to work with ddml since rforest\u0026rsquo;s predict command doesn\u0026rsquo;t support variable types.\n. ddml E[Y|X]: reg $Y $X Learner Y1_reg added successfully. . ddml E[Y|X], vtype(none): rforest $Y $X, type(reg) Learner Y2_rforest added successfully. . ddml E[D|X]: reg $D $X Learner D1_reg added successfully. . ddml E[D|X], vtype(none): rforest $D $X, type(reg) Learner D2_rforest added successfully. . ddml E[Z|X]: reg $Z $X Learner Z1_reg added successfully. . ddml E[Z|X], vtype(none): rforest $Z $X, type(reg) Learner Z2_rforest added successfully. Step 3/4: Cross-fitting and estimation # . ddml crossfit Z equations (1): logem4 Cross-fitting E[Y|X] equation: logpgp95 Cross-fitting fold 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...completed cross-fitting Cross-fitting E[D|X] equation: avexpr Cross-fitting fold 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...completed cross-fitting Cross-fitting E[Z|X]: logem4 Cross-fitting fold 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...completed cross-fitting Manual estimation # If you are curious what ddml does in the background:\n. ddml estimate m0, spec(8) rep(1) DDML estimation results: spec r Y learner D learner b SE Z learner 1 1 Y1_reg D1_reg 0.378 ( 0.085) Z1_reg 2 1 Y1_reg D1_reg -0.101 ( 0.827) Z2_rforest 3 1 Y1_reg D2_rforest 2.434 ( 2.243) Z1_reg 4 1 Y1_reg D2_rforest 0.052 ( 0.351) Z2_rforest 5 1 Y2_rforest D1_reg 0.122 ( 0.091) Z1_reg 6 1 Y2_rforest D1_reg -1.485 ( 2.455) Z2_rforest 7 1 Y2_rforest D2_rforest 0.784 ( 0.578) Z1_reg * 8 1 Y2_rforest D2_rforest 0.771 ( 0.202) Z2_rforest * = minimum MSE specification for that resample. Min MSE DDML model, specification 8 y-E[y|X] = Y2_rforest_1 Number of obs = 64 D-E[D|X,Z]= D2_rforest_1 Z-E[Z|X] = Z2_rforest_1 ------------------------------------------------------------------------------ logpgp95 | Coefficient Std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- avexpr | .770743 .2018379 3.82 0.000 .375148 1.166338 ------------------------------------------------------------------------------ . ivreg Y2_rf (D2_rf = Z2_rf), nocons Instrumental variables 2SLS regression Source | SS df MS Number of obs = 64 -------------+---------------------------------- F(1, 63) = . Model | -4.59844887 1 -4.59844887 Prob \u0026gt; F = . Residual | 39.7479425 63 .630919722 R-squared = . -------------+---------------------------------- Adj R-squared = . Total | 35.1494936 64 .549210838 Root MSE = .7943 ------------------------------------------------------------------------------ Y2_rforest_1 | Coefficient Std. err. t P\u0026gt;|t| [95% conf. interval] -------------+---------------------------------------------------------------- D2_rforest_1 | .770743 .2018379 3.82 0.000 .3674021 1.174084 ------------------------------------------------------------------------------ Instrumented: D2_rforest_1 Instruments: Z2_rforest_1 "},{"id":21,"href":"/docs/ddml/ivhd/","title":"Flexible IV","section":"DDML","content":" Flexible Partially Linear IV Model # Preparations # We load the data, define global macros and set the seed.\n. use https://statalasso.github.io/dta/BLP_CHS.dta, clear . global Y y . global D price . global X hpwt air mpd space . global Z Zbase* . set seed 42 Step 1: Initialization # We initialize the model.\n. ddml init ivhd Step 2: Add learners # We add learners for \\(E[Y|X]\\) in the usual way.\n. ddml E[Y|X]: reg $Y $X Learner Y1_reg added successfully. . ddml E[Y|X]: pystacked $Y $X, type(reg) Learner Y2_pystacked added successfully. There are some pecularities that we need to bear in mind when adding learners for \\(E[D|Z,X]\\) and \\(E[D|X]\\) . The reason for this is that the estimation of \\(E[D|X]\\) depends on the estimation of \\(E[D|X,Z]\\) . More precisely, we first obtain the fitted values \\(\\hat{D}=E[D|X,Z]\\) and fit these against \\(X\\) to estimate \\(E[\\hat{D}|X]\\) .\nWhen adding learners for \\(E[D|Z,X]\\) , we need to provide a name for each learners using learner(name).\n. ddml E[D|Z,X], learner(Dhat_reg): reg $D $X $Z Learner Dhat_reg added successfully. . ddml E[D|Z,X], learner(Dhat_pystacked): pystacked $D $X $Z, type(reg) Learner Dhat_pystacked added successfully. When adding learners for \\(E[D|X]\\) , we explicitly refer to the learner from the previous step (e.g., learner(Dhat_reg)) and also provide the name of the treatment variable (vname($D)). Finally, we use the placeholder {D} in place of the dependent variable.\n. ddml E[D|X], learner(Dhat_reg) vname($D): reg {D} $X Learner Dhat_reg_h added successfully. . ddml E[D|X], learner(Dhat_pystacked) vname($D): pystacked {D} $X, type(reg) Replacing existing learner Dhat_pystacked_h... Learner Dhat_pystacked_h added successfully. Step 3-4: Cross-fitting and estimation # That\u0026rsquo;s it. Now we can move to cross-fitting and estimation.\n. ddml crossfit Cross-fitting E[Y|X,Z] equation: y Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting Cross-fitting E[D|X,Z] and E[D|X] equation: price Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting . ddml estimate DDML estimation results: spec r Y learner D learner b SE DH learner 1 1 Y1_reg Dhat_reg -0.137 ( 0.011) Dhat_reg_h 2 1 Y1_reg Dhat_reg 0.363 ( 0.145) Dhat_pystac~h 3 1 Y1_reg Dhat_pystac~d -0.089 ( 0.005) Dhat_reg_h 4 1 Y1_reg Dhat_pystac~d -0.114 ( 0.009) Dhat_pystac~h 5 1 Y2_pystacked Dhat_reg -0.096 ( 0.010) Dhat_reg_h 6 1 Y2_pystacked Dhat_reg -0.208 ( 0.073) Dhat_pystac~h * 7 1 Y2_pystacked Dhat_pystac~d -0.042 ( 0.005) Dhat_reg_h 8 1 Y2_pystacked Dhat_pystac~d -0.098 ( 0.008) Dhat_pystac~h * = minimum MSE specification for that resample. Min MSE DDML model, specification 7 y-E[y|X] = Y2_pystacked_1 Number of obs = 2217 E[D|X,Z] = Dhat_pystacked_1 E[D|X] = Dhat_reg_h_1 Orthogonalised D = D - E[D|X]; optimal IV = E[D|X,Z] - E[D|X]. ------------------------------------------------------------------------------ y | Coefficient Std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- price | -.0422251 .0046777 -9.03 0.000 -.0513933 -.033057 ------------------------------------------------------------------------------ Manual estimation # If you are curious what ddml does in the background:\n. ddml estimate m0, spec(8) rep(1) DDML estimation results: spec r Y learner D learner b SE DH learner 1 1 Y1_reg Dhat_reg -0.137 ( 0.011) Dhat_reg_h 2 1 Y1_reg Dhat_reg 0.363 ( 0.145) Dhat_pystac~h 3 1 Y1_reg Dhat_pystac~d -0.089 ( 0.005) Dhat_reg_h 4 1 Y1_reg Dhat_pystac~d -0.114 ( 0.009) Dhat_pystac~h 5 1 Y2_pystacked Dhat_reg -0.096 ( 0.010) Dhat_reg_h 6 1 Y2_pystacked Dhat_reg -0.208 ( 0.073) Dhat_pystac~h * 7 1 Y2_pystacked Dhat_pystac~d -0.042 ( 0.005) Dhat_reg_h 8 1 Y2_pystacked Dhat_pystac~d -0.098 ( 0.008) Dhat_pystac~h * = minimum MSE specification for that resample. DDML model, specification 8 y-E[y|X] = Y2_pystacked_1 Number of obs = 2217 E[D|X,Z] = Dhat_pystacked_1 E[D|X] = Dhat_pystacked_h_1 Orthogonalised D = D - E[D|X]; optimal IV = E[D|X,Z] - E[D|X]. ------------------------------------------------------------------------------ y | Coefficient Std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- price | -.0976524 .0079768 -12.24 0.000 -.1132866 -.0820181 ------------------------------------------------------------------------------ . gen Dtilde = $D - Dhat_pystacked_h_1 . gen Zopt = Dhat_pystacked_1 - Dhat_pystacked_h_1 . ivreg Y2_pystacked_1 (Dtilde=Zopt), nocons Instrumental variables 2SLS regression Source | SS df MS Number of obs = 2,217 -------------+---------------------------------- F(1, 2216) = . Model | 303.676955 1 303.676955 Prob \u0026gt; F = . Residual | 2283.27076 2,216 1.03035684 R-squared = . -------------+---------------------------------- Adj R-squared = . Total | 2586.94771 2,217 1.16686861 Root MSE = 1.0151 ------------------------------------------------------------------------------ Y2_pystack~1 | Coefficient Std. err. t P\u0026gt;|t| [95% conf. interval] -------------+---------------------------------------------------------------- Dtilde | -.0976524 .0079768 -12.24 0.000 -.1132952 -.0820096 ------------------------------------------------------------------------------ Instrumented: Dtilde Instruments: Zopt "},{"id":22,"href":"/docs/lassopack/cvlasso/","title":"Cross-validation","section":"LASSOPACK","content":" Cross-validation # In the course of cross-validation, the data is repeatedly partitioned into training and validation data. The model is fit to the training data and the validation data is used to calculate the prediction error. This in turn enables us to identify the values of \\(\\lambda\\) and \\(\\alpha\\) that optimize predictive performance (i.e., minimize the estimated mean-squared prediction error).\ncvlasso supports \\(K\\) -fold cross-validation and \\(h\\) -step ahead rolling cross-validation. The latter is intended for time-series or panel data with a large time dimension. \\(h\\) -step ahead rolling cross-validation was suggested by Rob H Hyndman in a blog post.\nK-fold cross-validation # We begin with 10-fold cross-validation (the default). If no fold variable is specified (which can be done using the foldvar() option), the data is randomly partitioned into \u0026ldquo;folds\u0026rdquo;.\nWe use seed(123) throughout this demonstration to allow reproducing the outputs below.\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) K-fold cross-validation with 10 folds. Elastic net with alpha=1. Fold 1 2 3 4 5 6 7 8 9 10 | Lambda MSPE st. dev. ----------+--------------------------------------------- 1| 163.62492 1.3162136 .13064798 2| 149.08894 1.2141972 .12282686 3| 135.84429 1.114079 .11387635 ... 17| 36.930468 .5827423 .06260056 ^ ... 27| 14.566138 .53408884 .05830419 * ... 100| .01636249 .54838029 .07390164 * lopt = the lambda that minimizes MSPE. Run model: cvlasso, lopt ^ lse = largest lambda for which MSPE is within one standard error of the minimal MSPE. Run model: cvlasso, lse Note that parts of the output have been omitted for the sake of brevity. The columns 2 to 4 show the value of \\(\\lambda\\) , the estimate of the mean-squared prediction error and the associated standard error.\nThe \\(\\lambda\\) value that minimizes the mean-squared prediction error is indicated by an asterisk (*). A hat (^) marks the largest \\(\\lambda\\) at which the MSPE is within one standard error of the minimal MSPE. We denote these by \\(\\lambda_{lopt}\\) and \\(\\lambda_{lse}\\) , respectively. The former is returned in e(lopt), the latter in e(lse).\n. di e(lopt) 14.566138 . di e(lse) 36.930468 Estimate the selected model # To estimate the full model with either \\(\\lambda_{lopt}\\) or \\(\\lambda_{lse}\\) , we can use lopt or lse. Internally, cvlasso calls lasso2 with either lambda(14.566138) or lambda(36.930468).\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lopt seed(123) . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, lse seed(123) The same as above can be achieved using the replay syntax in two steps.\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cvlasso, lopt . cvlasso, lse If postest is specified, cvlasso posts the lasso2 estimation results.\n. cvlasso, lopt postest . ereturn list K-fold cross-validation over lambda and alpha # alpha() can be a scalar or list of elastic net parameters. Each \\(\\alpha\\) value must lie in the interval [0,1]. If alpha() is a list longer than one, cvlasso cross-validates over \\(\\lambda\\) and \\(\\alpha\\) .\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// alpha(0 0.1 0.5 1) seed(123) Cross-validation over alpha (0 .1 .5 1). alpha | lopt* Minimum MSPE ------------+---------------------------- 0.000 | 12.093063 .54348993 0.100 | 25.454739 .5418149 0.500 | 15.986318 .53499607 1.000 | 14.566138 .53408884 # * lambda value that minimizes MSPE for a given alpha # alpha value that minimizes MSPE The second column in the table indicates the value of \\(\\lambda\\) that minimizes the MSPE for a given value of \\(\\alpha\\) . A hash key (#) indicates that value of \\(\\alpha\\) that minimizes the overall MSPE.\nPlotting # We can plot the estimated mean-squared prediction error over \\(\\lambda\\) . Note that the plotting feature is not supported if we also cross-validate over \\(\\alpha\\) .\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) plotcv This produces the following graph:\nThe two vertical lines indicate \\(\\lambda_{lopt}\\) and \\(\\lambda_{lse}\\) (dashed line).\nSimilar to lasso2, cvlasso allows to pass plotting options on to Stata\u0026rsquo;s line using plotopt().\nPrediction # The predict postestimation command allows to obtain predicted values and residuals for either e(lopt) or e(lse).\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cap drop xbhat1 . predict double xbhat1, lopt . cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, seed(123) . cap drop xbhat2 . predict double xbhat2, lse Store intermediate steps # cvlasso calls lasso2 internally. The saveest(string) allows to access intermediate estimation results.\n. cvlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, /// seed(123) nfolds(3) saveest(step) . estimates dir . estimates restore step1 . estimates replay step1 Note: EBIC and \\(R^2\\) are not calculated to speed up the computation.\nTime-series example using rolling h-step ahead cross-validation # Load airline passenger data:\n. webuse air2, clear There are 144 observations in the sample. origin() controls the sample range used for training and validation. In this example, origin(130) implies that data up to and including \\(t=130\\) are used for training in the first iteration. Data points \\(t=131,...,144\\) are successively used for validation.\n. cvlasso air L(1/12).air, rolling origin(130) Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1. Training from-to (validation point): 13-130 (131), 13-131 (132), 13-132 (133), 13-133 (134), \u0026gt; 13-134 (135), 13-135 (136), 13-136 (137), 13-137 (138), 13-138 (139), 13-139 (140), \u0026gt; 13-140 (141), 13-141 (142), 13-142 (143), 13-143 (144). The notation a-b (v) indicates that data a to b are used for estimation (training), and data point v is used for forecasting (validation). Note that the training dataset starts with \\(t=13\\) since 12 lags are used as predictors.\nThe \u0026ldquo;optimal\u0026rdquo; model includes lags 1, 11 and 12.\n. cvlasso, lopt Estimate lasso with lambda=315.16 (lopt). --------------------------------------------------- Selected | Lasso Post-est OLS ------------------+-------------------------------- air | L1. | 0.1534004 0.1610229 L11. | 0.0638066 0.0724006 L12. | 0.8422566 0.8374074 ------------------+-------------------------------- Partialled-out*| ------------------+-------------------------------- | _cons | 11.5075093 8.2797832 --------------------------------------------------- The option h() controls the forecasting horizon (default is h(1)).\n. cvlasso air L(1/12).air, rolling origin(130) h(2) Rolling forecasting cross-validation with 2-step ahead forecasts. Elastic net with alpha=1. Training from-to (validation point): 13-130 (132), 13-131 (133), 13-132 (134), 13-133 (135), \u0026gt; 13-134 (136), 13-135 (137), 13-136 (138), 13-137 (139), 13-138 (140), 13-139 (141), \u0026gt; 13-140 (142), 13-141 (143), 13-142 (144). In the above examples, the size of the training dataset increases by one data point each step. To keep the size of the training dataset fixed, specify fixedwindow.\n. cvlasso air L(1/12).air, rolling origin(130) fixedwindow Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1. Training from-to (validation point): 13-130 (131), 14-131 (132), 15-132 (133), 16-133 (134), \u0026gt; 17-134 (135), 18-135 (136), 19-136 (137), 20-137 (138), 21-138 (139), 22-139 (140), \u0026gt; 23-140 (141), 24-141 (142), 25-142 (143), 26-143 (144). Panel data example using rolling h-step ahead cross-validation # Rolling cross-validation can also be applied to panel data. For demonstration, load Grunfeld data.\n. webuse grunfeld, clear Apply 1-step ahead cross-validation.\n. cvlasso mvalue L(1/10).mvalue, rolling origin(1950) Rolling forecasting cross-validation with 1-step ahead forecasts. Elastic net with alpha=1. Training from-to (validation point): 1945-1950 (1951), 1945-1951 (1952), 1945-1952 (1953), \u0026gt; 1945-1953 (1954). The model selected by cross-validation:\n. cvlasso, lopt Estimate lasso with lambda=4828.76 (lopt). --------------------------------------------------- Selected | Lasso Post-est OLS ------------------+-------------------------------- mvalue | L1. | 0.7289970 0.7343915 L5. | 0.1181815 0.1239170 L7. | 0.0027785 0.0062233 L8. | 0.0613727 0.0647928 L9. | 0.1014168 0.1031103 ------------------+-------------------------------- Partialled-out*| ------------------+-------------------------------- | _cons | 42.6792365 21.8393696 --------------------------------------------------- More # Please check the help file for more information and examples.\n. help cvlasso "},{"id":23,"href":"/docs/ddml/interactiveiv/","title":"Interactive IV","section":"DDML","content":" Interactive IV Model # Preparations # We load the data, define global macros and set the seed.\n. use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta,clear . global Y earnings . global D training . global Z assignmt . global X sex age married black hispanic . set seed 42 Step 1: Initialization # We initialize the model.\n. ddml init interactiveiv, kfolds(5) Step 2: Add learners # We again add two learners per reduced form equation.\n. ddml E[Y|X,Z]: reg $Y $X Learner Y1_reg added successfully. . ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(lassocv) Learner Y2_pystacked added successfully. . ddml E[D|X,Z]: logit $D $X Learner D1_logit added successfully. . ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(lassocv) Learner D2_pystacked added successfully. . ddml E[Z|X]: logit $Z $X Learner Z1_logit added successfully. . ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(lassocv) Learner Z2_pystacked added successfully. Step 3: Cross-fitting and estimation. # . ddml crossfit Z equations (1): assignmt Cross-fitting E[Y|X,Z] equation: earnings Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting Cross-fitting E[D|X,Z] equation: training Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting Cross-fitting E[Z|X]: assignmt Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting . ddml estimate DDML estimation results: spec r Y0 learner Y1 learner D0 learner D1 learner b SE Z learner 1 1 Y1_reg Y1_reg D1_logit D1_logit 1800.600 (513.492) Z1_logit 2 1 Y1_reg Y1_reg D1_logit D1_logit 1803.585 (515.113) Z2_pystacked 3 1 Y1_reg Y1_reg D1_logit D2_pystacked 1802.494 (513.966) Z1_logit 4 1 Y1_reg Y1_reg D1_logit D2_pystacked 1805.322 (515.543) Z2_pystacked 5 1 Y1_reg Y1_reg D2_pystacked D1_logit 1800.659 (513.506) Z1_logit 6 1 Y1_reg Y1_reg D2_pystacked D1_logit 1803.378 (515.051) Z2_pystacked 7 1 Y1_reg Y1_reg D2_pystacked D2_pystacked 1802.553 (513.980) Z1_logit 8 1 Y1_reg Y1_reg D2_pystacked D2_pystacked 1805.115 (515.481) Z2_pystacked 9 1 Y1_reg Y2_pystacked D1_logit D1_logit 1808.209 (512.654) Z1_logit 10 1 Y1_reg Y2_pystacked D1_logit D1_logit 1811.073 (514.270) Z2_pystacked ... \u0026lt;-click or type ddml estimate, replay full to display full summary * = minimum MSE specification for that resample. Min MSE DDML model, specification 28 y-E[y|X,D=0] = Y2_pystacked0_1 Number of obs = 11204 y-E[y|X,D=1] = Y2_pystacked1_1 D-E[D|X,Z=0] = D1_logit0_1 D-E[D|X,Z=1] = D2_pystacked1_1 Z-E[Z|X] = Z2_pystacked_1 ------------------------------------------------------------------------------ | Robust earnings | Coefficient std. err. z P\u0026gt;|z| [95% conf. interval] -------------+---------------------------------------------------------------- training | 1802.897 514.4563 3.50 0.000 794.5815 2811.213 ------------------------------------------------------------------------------ "},{"id":24,"href":"/docs/pystacked/regression/","title":"Regression","section":"PYSTACKED","content":" Stacking regression # First load the Boston housing data and split the data randomly in training and test sample:\n. insheet using /// https://statalasso.github.io/dta/housing.csv, /// clear comma . set seed 789 . gen train=runiform() . replace train=train\u0026gt;.75 We now consider a more complicated pystacked application with 5 base learners: linear regression, two versions of lasso with AIC-chosen penalty, random forest and gradient boosting:\n. pystacked medv crim-lstat if train, /// type(regress) /// methods(ols lassoic lassoic rf gradboost) /// pipe1(poly2) pipe3(poly2) cmdopt5(learning_rate(0.01) /// n_estimators(1000)) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- ols | 0.0139419 lassoic | 0.0000000 lassoic | 0.3959649 rf | 0.5900932 gradboost | 0.0000000 In this example, we use the lasso twice\u0026mdash;once with and once without the poly2 pipeline. Indeed, nothing keeps us from using base learners multiple times. This way we can compare and combine different sets of options.\nNote the numbering of the pipe*() and cmdopt*() options: We apply the poly2 pipe to the first and third method (ols and lassoic). We also change the default learning rate and number of estimators for gradient boosting (the 5th estimator).\nThe weights determine how much each method contributes to the final stacking contribution. lassoic without poly2 receives a weight of 0, while lassoic with poly2 gets a positive weight.\nYou can verify that options are being passed on to scikit-learn correctly using, e.g., di e(pyopt1) after estimation. Alternative Syntax # The above syntax becomes, admittedly, a bit difficult to read, especially with many methods and many options. We offer an alternative syntax for easier use with many base learners:\n. pystacked medv crim-lstat || /// m(ols) pipe(poly2) || /// m(lassoic) || /// m(lassoic) pipe(poly2) || /// m(rf) || /// m(gradboost) opt(learning_rate(0.01) n_estimators(1000)) | /// if train , type(regress) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- ols | 0.0000000 lassoic | 0.0000000 lassoic | 0.3696124 rf | 0.6303876 gradboost | 0.0000000 Changing the final estimator # The default final learner of pystacked is non-negative least squares (NNLS) where the are constrained to be non-negative and to sum to 1.\nHere, we switch to the \u0026ldquo;singlebest\u0026rdquo; approach which selects the base learner with the lowest RMSPE.\n. pystacked medv crim-lstat || /// m(ols) pipe(poly2) || /// m(lassoic) || /// m(lassoic) pipe(poly2) || /// m(rf) || /// m(gradboost) opt(learning_rate(0.01) n_estimators(1000)) /// if train, type(regress) finalest(singlebest) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- ols | 0.0000000 lassoic | 0.0000000 lassoic | 0.0000000 rf | 0.0000000 gradboost | 1.0000000 Predictions # In addition to the stacking predicted values, we can also get the predicted values of each base learner using the basexb option (formerly called transform):\n. predict double yh, xb . predict double ybase, basexb . list yh ybase* if _n \u0026lt;= 5 +-----------------------------------------------------------------------+ | yh ybase1 ybase2 ybase3 ybase4 ybase5 | |-----------------------------------------------------------------------| 1. | 24.726908 22.301384 30.27872 25.451438 25.721 24.726908 | 2. | 21.993344 22.596584 25.133314 23.908178 21.356 21.993344 | 3. | 32.628795 29.987554 31.027162 33.898635 32.968001 32.628795 | 4. | 34.027947 30.977113 29.428614 32.00817 33.098001 34.027947 | 5. | 35.215694 34.034215 28.85696 32.311613 35.109001 35.215694 | +-----------------------------------------------------------------------+ Notice that the stacking predicted values are equal to the gradient boosting predicted values, since boosting was identified as the best learner.\nPlotting # pystacked also comes with plotting features. The graph option creates a scatter plot of predicted vs observed values for stacking and each base learner. There is no need to re-run the stacking estimation. You can also use pystacked with graph as a post-estimation command:\n. pystacked, graph holdout Here, we show the out-of-sample predicted values. To see the in-sample predicted values, simply omit the holdout option. Note that the holdout option won\u0026rsquo;t work if the estimation was run on the whole sample.\nRoot mean squared prediction error (RMSPE) # The table option allows to compare stacking weights with three types of RMSPE: in-sample RMSPE of learners fit on the whole training sample, RMSPE of cross-validated predictions, and out-of-sample RMSPE. As with the graph option, we can use table as a post-estimation command:\n. pystacked, table holdout Number of holdout observations: 361 RMSPE: In-Sample, CV, Holdout ----------------------------------------------------------------- Method | Weight In-Sample CV Holdout -----------------+----------------------------------------------- STACKING | . 0.637 4.547 3.664 ols | 0.000 1.308 18.660 11.529 lassoic | 0.000 4.501 5.179 5.218 lassoic | 0.000 1.798 4.913 4.964 rf | 0.000 1.484 4.599 3.932 gradboost | 1.000 0.637 4.547 3.664 "},{"id":25,"href":"/docs/lassopack/rlasso/","title":"Rigorous lasso","section":"LASSOPACK","content":" Theory driven penalty # rlasso provides routines for estimating the coefficients of a lasso or square-root lasso regression with data-dependent, theory-driven penalization. The number of regressors, \\(p\\) , may be large and possibly greater than the number of observations, \\(N\\) . rlasso implements a version of the lasso that allows for heteroskedastic and clustered errors; see Belloni et al. (2012, 2016).\nWe start again with the prostate cancer data for demonstration.\n. clear . insheet using https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data, tab Homoskedastic lasso # The optimal penalization depends on whether the errors are homoskedastic, heteroskedastic or cluster-dependent.\nSimilar to regress, rlasso assumes homoskedasticity by default. Under homoskedasticity, the optimal penalty level is given by\n\\[\\lambda=\\sigma2c\\sqrt{N}\\Phi^{-1}(1-\\gamma/(2p)), \\] which guarantees that the \u0026ldquo;rigorous\u0026rdquo; lasso is well-behaved. The unobserved \\(\\sigma\\) is estimated using an iterative algorithm.\nTo run the lasso with theory-driven penalization, type:\n. rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45 --------------------------------------------------- Selected | Lasso Post-est OLS ------------------+-------------------------------- lcavol | 0.4400059 0.5258519 lweight | 0.2385063 0.6617699 svi | 0.3024128 0.6656665 _cons |* 0.9533782 -0.7771568 --------------------------------------------------- *Not penalized e(lambda) returns \\(\\lambda\\) , and e(lambda0) stores \\(\\lambda_0=\\lambda/\\hat{\\sigma}\\) , i.e., the penalty level excluding the standard deviation of the error.\n. di e(lambda) 44.984163 . di e(lambda0) 64.923165 Heteroskedastic lasso # To allow for heteroskedasticity, we specify the robust option.\n. rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, robust --------------------------------------------------- Selected | Lasso Post-est OLS ------------------+-------------------------------- lcavol | 0.4518205 0.5258519 lweight | 0.2047086 0.6617699 svi | 0.1995573 0.6656665 _cons |* 1.0823460 -0.7771568 --------------------------------------------------- *Not penalized The names of selected predictors are stored in e(selected) (without constant) and e(selected0) (with constant):\n. di e(selected0) lcavol lweight svi _cons . di e(selected) lcavol lweight svi Square-root lasso # With the sqrt-lasso of Belloni et al. (2011, 2014), the default penalty level is\n\\(\\lambda=c \\sqrt{N} \\Phi^{-1}(1-\\gamma/(2p)).\\) Note the difference by a factor of 2 compared to the standard lasso. More importantly, the optimal penalty level of the square-root lasso is independent of \\(\\sigma\\) , leading to a practical advantage.\nThe square-root lasso is available through the sqrt option.\n. rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt --------------------------------------------------- Selected | Sqrt-lasso Post-est OLS ------------------+-------------------------------- lcavol | 0.4293894 0.5258519 lweight | 0.1861616 0.6617699 svi | 0.2574895 0.6656665 _cons |* 1.1673922 -0.7771568 --------------------------------------------------- *Not penalized In this example, lasso and square-root lasso select the same variables. Thus the post-estimation OLS estimator, which is OLS using the variables selected, is the same in both cases.\nThe estimated penalty level is:\n. di e(lambda) 32.461583 The square-root lasso also allows for heteroskedastic errors:\n. rlasso lpsa lcavol lweight age lbph svi lcp gleason pgg45, sqrt robust --------------------------------------------------- Selected | Sqrt-lasso Post-est OLS ------------------+-------------------------------- lcavol | 0.4402037 0.5258519 lweight | 0.1329878 0.6617699 svi | 0.1264166 0.6656665 _cons |* 1.3741342 -0.7771568 --------------------------------------------------- *Not penalized Cluster-dependent errors # Both rigorous lasso and rigorous square-root lasso allow for within-panel correlation (based on Belloni et al., 2016, JBES). The fe option applies the within-transformation and cluster() specifies the cluster variable.\nNB: The two regressions below take a few minutes to run, and you might need to increase the maximum matsize using set matsize.\nIn this example, we interact the variable grade and age using Stata\u0026rsquo;s factor variable notation (see help factor variables).\n. webuse nlswork . xtset idcode . rlasso ln_w i.grade#i.age ttl_exp tenure not_smsa south, /// fe cluster(idcode) --------------------------------------------------- Selected | Lasso Post-est OLS ------------------+-------------------------------- grade#age | 12 18 | -0.1226071 -0.2087164 12 19 | -0.0481608 -0.1109979 12 20 | -0.0088640 -0.0627530 | ttl_exp | 0.0206773 0.0226526 tenure | 0.0107726 0.0123681 not_smsa | -0.0305386 -0.0957148 --------------------------------------------------- The results of cluster lasso and cluster square-root lasso are again similar:\n. rlasso ln_w i.grade#i.age ttl_exp tenure not_smsa south, /// sqrt fe cluster(idcode) Selected | Sqrt-lasso Post-est OLS ------------------+-------------------------------- grade#age | 12 18 | -0.1223057 -0.2087164 12 19 | -0.0479408 -0.1109979 12 20 | -0.0086753 -0.0627530 | ttl_exp | 0.0206704 0.0226526 tenure | 0.0107671 0.0123681 not_smsa | -0.0303104 -0.0957148 --------------------------------------------------- More # More information can be found in the help file:\nhelp rlasso "},{"id":26,"href":"/docs/about/","title":"About","section":"Docs","content":" Authors # Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland\nChristian B Hansen, University of Chicago, USA\nMark E Schaffer, Heriot-Watt University, UK\nThomas Wiemann, University of Chicago, USA\nIssues and questions # If you have encountered any issues or have questions, contact us via email, the issues section in the relevant Github repository or via Statalist. Since we do not check Statalist on a regular basis, please tag us or alert us to the post.\nThis page # The Stata ML Page is maintained by Achim Ahrens.\nLast updated: December 12, 2022\n"},{"id":27,"href":"/docs/pystacked/classification/","title":"Classification","section":"PYSTACKED","content":" Stacking classifier # Stacking can be applied in a similar way to classification problems. For demonstration, we consider the Spambase Data Set from the Machine Learning Repository. We load the data and shuffle the observations around since they are ordered by outcome.\n. insheet using /// https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, /// clear comma . set seed 42 . gen uni=runiform() . sort uni Stacking classification works very similar to stacking regression. The example below is somewhat more complicated. Let\u0026rsquo;s go through it step-by-step:\nWe use 6 base learners: logit, random forest, gradient boosting and 3x neural nets. We apply the poly2 pipeline to the logistic regressor, which creates squares and interaction terms of the predictors. For both gradient boosting and random forest, we increase the number of classification trees to 1000. We consider three types of neural nets: (1) with one layer of 100 neurons (the default), (2) three layers with 50 neurons each, (3) one layer of 200 neurons. We use type(class) to specify that we consider a classification task. Finally, njobs(-1) switches parallelization on with all available CPUs. Please note that this might take a while to run.\n. pystacked v58 v1-v57 || /// \u0026gt; m(logit) pipe(poly2) || /// \u0026gt; m(rf) opt(n_estimators(1000)) || /// \u0026gt; m(gradboost) opt(n_estimators(1000)) || /// \u0026gt; m(nnet) || /// \u0026gt; m(nnet) opt(hidden_layer_sizes(50 50 50)) || /// \u0026gt; m(nnet) opt(hidden_layer_sizes(200)) || /// \u0026gt; if _n\u0026lt;=3000 , type(class) njobs(-1) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- logit | 0.0016062 rf | 0.0762563 gradboost | 0.7524429 nnet | 0.0810773 nnet | 0.0574056 nnet | 0.0312117 Confusion matrix # After estimation, we can obtain the predicted class using predict. The predicted classes allow us to construct in-sample and out-of-sample confusion matrices:\n. predict spam, class . tab spam v58 if _n\u0026lt;=3000, cell | v58 spam | 0 1 | Total -----------+----------------------+---------- 0 | 1,792 2 | 1,794 | 59.73 0.07 | 59.80 -----------+----------------------+---------- 1 | 1 1,205 | 1,206 | 0.03 40.17 | 40.20 -----------+----------------------+---------- Total | 1,793 1,207 | 3,000 | 59.77 40.23 | 100.00 . tab spam v58 if _n\u0026gt;3000, cell | v58 spam | 0 1 | Total -----------+----------------------+---------- 0 | 962 43 | 1,005 | 60.09 2.69 | 62.77 -----------+----------------------+---------- 1 | 33 563 | 596 | 2.06 35.17 | 37.23 -----------+----------------------+---------- Total | 995 606 | 1,601 | 62.15 37.85 | 100.00 The table option makes this even easier. The table below shows the in-sample and out-of-sample classification errors for stacking and each base learner\u0026ndash;all in one table.\n. pystacked, table holdout Number of holdout observations: 1601 Confusion matrix: In-Sample, CV, Holdout ----------------------------------------------------------------------------- Method | Weight In-Sample CV Holdout | 0 1 0 1 0 1 -----------------+----------------------------------------------------------- STACKING 0 | . 1792 2 1728 76 962 43 STACKING 1 | . 1 1205 65 1131 33 563 logit 0 | 0.002 1077 68 1097 65 562 35 logit 1 | 0.002 716 1139 696 1142 433 571 rf 0 | 0.076 1792 0 1709 100 948 44 rf 1 | 0.076 1 1207 84 1107 47 562 gradboost 0 | 0.752 1792 2 1726 75 960 43 gradboost 1 | 0.752 1 1205 67 1132 35 563 nnet 0 | 0.081 1758 175 1671 113 957 100 nnet 1 | 0.081 35 1032 122 1094 38 506 nnet 0 | 0.057 1669 77 1650 125 910 55 nnet 1 | 0.057 124 1130 143 1082 85 551 nnet 0 | 0.031 1679 86 1654 116 911 54 nnet 1 | 0.031 114 1121 139 1091 84 552 ROC curve and AUC # pystacked supports ROC curves which allow to assess the classification performance for varying disrimination thresholds. The y-axis in an ROC plot corresponds to sensitivity (true positive rate) and the x-axis corresponds to 1-specificity (false positive rate). The Area Under the Curve (AUC) displayed below each ROC plot is a common evaluation metric for classification problems.\n. pystacked, graph(subtitle(Spam data)) /// lgraph(plotopts(msymbol(i) ylabel(0 1, format(%3.1f)))) /// holdout "},{"id":28,"href":"/docs/papers/","title":"Readings","section":"Docs","content":" pystacked # Ahrens, A., Hansen, C.B. and Schaffer, M.E., 2022. pystacked: Stacking generalization and machine learning in Stata. arXiv preprint arXiv:2208.10896.\npystacked implements stacked generalization (Wolpert, 1992) for regression and binary classification via Python\u0026rsquo;s scikit-learn. Stacking combines multiple supervised machine learners \u0026ndash; the \u0026ldquo;base\u0026rdquo; or \u0026ldquo;level-0\u0026rdquo; learners \u0026ndash; into a single learner. The currently supported base learners include regularized regression, random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multi-layer perceptron). pystacked can also be used with as a \u0026lsquo;regular\u0026rsquo; machine learning program to fit a single base learner and, thus, provides an easy-to-use API for scikit-learn\u0026rsquo;s machine learning algorithms.\nLink to working paper\nBibtex file\nlassopack # Ahrens A, Hansen CB, Schaffer ME (2020). lassopack: Model selection and prediction with regularized regression in Stata. The Stata Journal. 20(1):176-235. doi:10.1177/1536867X20909697\nIn this article, we introduce lassopack, a suite of programs for regularized regression in Stata. lassopack implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso, and postestimation ordinary least squares. The methods are suitable for the high-dimensional setting, where the number of predictors p may be large and possibly greater than the number of observations, n. We offer three approaches for selecting the penalization (“tuning”) parameters: information criteria (implemented in lasso2), K-fold cross-validation and h-step-ahead rolling cross-validation for cross-section, panel, and time-series data (cvlasso), and theory-driven (“rigorous” or plugin) penalization for the lasso and square-root lasso for cross-section and panel data (rlasso). We discuss the theoretical framework and practical considerations for each approach. We also present Monte Carlo results to compare the performances of the penalization approaches.\nDownload earlier arXiv-Version \u0026ndash; Stata Journal paper\nFeel free to contact me (AA) if you have trouble accessing the SJ version.\nBibtex file\nSlides # Presentation slides from our presentation at the 2018 London Stata Conference are available here. Presentation slides for pystacked are here. Presentation slides for ddml are here. "},{"id":29,"href":"/docs/lassopack/lassologit/","title":"Lassologit","section":"LASSOPACK","content":" Logistic lasso # lassologit is intended for classification tasks with binary outcomes. lassologit maximizes the penalized log-likelihood:\n\\[\\frac{1}{N} \\sum_{i=1}^N \\left\\{y_i (\\beta_0 \u0026#43; \\boldsymbol{x}_i\u0026#39;\\boldsymbol{\\beta}) - \\log\\left(1\u0026#43;e^{(\\beta_0\u0026#43;\\boldsymbol{x}_i\u0026#39;\\boldsymbol{\\beta})}\\right)\\right\\} - \\frac{\\lambda}{N} ||\\boldsymbol{\\beta}||_1\\] where \\(y_i\\) is the binary outcome variable and \\(\\boldsymbol{x}_i\\) is the vector of predictors. \\(\\boldsymbol{\\beta}\\) is the vector of parameters to be estimated. The last term in the objective function imposes a penalty on the absolute size of \\(\\boldsymbol{\\beta}\\) . The intercept \\(\\beta_0\\) is (by default) not penalized.\nlassologit implements the coordinate descent algorithm of Friedman, Hastie \u0026amp; Tibshirani (2010, Section 3). For further speed improvements, we also utilize the strong rule proposed in Tibshirani et al. (2012).\nLike lassopack, lassologit consists of three programs which correspond to three approaches for selecting the tuning parameter \\(\\lambda\\) :\nThe base program lassologit allows to select the tuning parameter as the value of \\(\\lambda\\) that minimizes either \\(AIC\\) , \\(BIC\\) , \\(AIC_c\\) or \\(EBIC\\) . cvlassologit supports \\(K\\) -fold cross-validation. \\(\\lambda\\) may be selected as the value that minimizes the estimated deviance or miss-classification rate. rlassologit implements theory-driven penalization for the logistic lasso (see e.g. Belloni, Chernozhukov \u0026amp; Wei, 2016). Installation\nLassologit has been integrated into lassopack after the first release. To get the latest lassologit version, simply install lassopack. "},{"id":30,"href":"/docs/pdslasso/ivlasso_help/","title":"Help file","section":"PDSLASSO","content":" ---------------------------------------------------------------------------------------------------------------------------------- help pdslasso, help ivlasso pdslasso v1.3 ---------------------------------------------------------------------------------------------------------------------------------- Title pdslasso and ivlasso -- Programs for post-selection and post-regularization OLS or IV estimation and inference Syntax pdslasso depvar regressors (hd_controls) [weight] [if exp] [in range] [ , partial(varlist) pnotpen(varlist) psolver(string) aset(varlist) post(method) robust cluster(varlist) bw(int) kernel(string) fe noftools rlasso[(name)] sqrt noisily loptions(options) olsoptions(options) noconstant ] ivlasso depvar regressors [(hd_controls)] (endog=instruments) [if exp] [in range] [ , partial(varlist) pnotpen(varlist) psolver(string) aset(varlist) post(method) robust cluster(varlist) bw(int) kernel(string) fe noftools rlasso[(name)] sqrt noisily loptions(options) ivoptions(options) first idstats sscset ssgamma(real) ssgridmin(real) ssgridmax(real) ssgridpoints(integer 100) ssgridmat(name) noconstant ] Note: pdslasso requires rlasso and ivreg2 to be installed; ivlasso also requires ranktest. See help rlasso, help ivreg2 and help ranktest or click on ssc install lassopack or ssc install ranktest to install. Note: the fe option will take advantage of the ftools package (if installed) for the fixed-effects transform; the speed gains using this package can be large. See help ftools or click on ssc install ftools to install. Note: ivlasso also supports the simpler pdslasso syntax. Options Description ---------------------------------------------------------------------------------------------------------------------------- partial(varlist) controls and instruments to be partialled-out prior to lasso estimation pnotpen(varlist) controls and instruments always included, not penalized by lasso aset(varlist) controls and instruments in amelioration set, always included in post-lasso post(method) pds, lasso or plasso; which estimation results are to be posted in e(b) and e(V) robust heteroskedastic-robust VCE; lasso penalty loadings account for heteroskedasticity cluster(varlist) cluster-robust VCE; lasso penalty loadings account for clustering; both standard (1-way) and 2-way clustering supported bw(int) HAC/AC VCE; lasso penalty loadings account for autocorrelation (AC) using bandwidth int; use with robust to account for both heteroskedasticity and autocorrelation (HAC) kernel(string) kernel used for HAC/AC penalty loadings (one of: bartlett, truncated, parzen, thann, thamm, daniell, tent, qs; default=bartlett) fe fixed-effects model (requires data to be xtset) noftools do not use FTOOLS package for fixed-effects transform (slower; rarely used) rlasso[(name)] store and display intermediate lasso and post-lasso results from rlasso with optional prefix name (if just rlasso is specified the default prefix is _ivlasso_ or _pdslasso_) sqrt use sqrt-lasso instead of standard lasso noisily display step-by-step intermediate rlasso estimation results loptions(options) lasso options specific to rlasso estimation; see help rlasso olsoptions(options) (pdslasso only) options specific to PDS OLS estimation of structural equation ivoptions(options) (ivlasso only) options specific to PDS OLS or IV estimation of structural equation first (ivlasso only) display and store first-stage results for 2SLS idstats (ivlasso only) request weak-identification statistics for 2SLS noconstant suppress constant from regression (cannot be used with aweights or pweights) psolver(string) override default solver used for partialling out (one of: qr, qrxx, lu, luxx, svd, svdxx, chol; default=qrxx) ---------------------------------------------------------------------------------------------------------------------------- Sup-score test Description (ivlasso only) ---------------------------------------------------------------------------------------------------------------------------- sscset request sup-score weak-identification-robust confidence set ssgamma(real) significance level for sup-score weak-identification-robust tests and confidence intervals (default=0.05, 5%) ssgridmin(real) minimum value for grid search for sup-score weak-identification-robust confidence intervals (default=grid centered at OLS estimate) ssgridmax(real) maximum value for grid search for sup-score weak-identification-robust confidence intervals (default=grid centered at OLS estimate) ssgridpoints(real) number of points in grid search for sup-score weak-identification-robust confidence intervals (default=100) ssgridmat(name) user-supplied Stata r x k matrix of r jointly hypothesized values for the k endogenous regressors to be tested using the sup-score test ssomitgrid(name) supress display of sup-score test results with user-supplied grid ssmethod(name) \"abound\" (default) = use conservative critical value (asymptotic bound) c*sqrt(N)*invnormal(1-gamma/(2p)); \"simulate\" = simulate distribution to obtain p-values for sup-score test; \"select\" = reject if rlasso selects any instruments ---------------------------------------------------------------------------------------------------------------------------- Postestimation: predict [type] newvar [if] [in] [, resid xb ] pdslasso and ivlasso may be used with time-series or panel data, in which case the data must be tsset or xtset first; see help tsset or xtset. aweights and pweights are supported; see help weights. pweights is equivalent to aweights + robust. All varlists may contain time-series operators or factor variables; see help varlist. Contents Description Computational notes Examples of usage Saved results References Website Installation Acknowledgements Citation of pdslasso and ivlasso Description pdslasso and ivlasso are routines for estimating structural parameters in linear models with many controls and/or instruments. The routines use methods for estimating sparse high-dimensional models, specifically the lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) and the square-root-lasso (Belloni et al. 2011, 2014). pdslasso is used for the case where a researcher has an outcome variable y, a structural or causal variable of interest d, and a large set of potential control variables x1, x2, x3, .... The usage in this case is: pdslasso y d (x1 x2 x3 ...) pdslasso accepts multiple causal variables, e.g.: pdslasso y d1 d2 (x1 x2 x3 ...) Important: The high-dimensional controls must be included within the parentheses (...). If this is not done, they are treated as causal rather than as controls. The problem the researcher faces is that the \"right\" set of controls is not known. In traditional practice, this presents her with a difficult choice: use too few controls, or the wrong ones, and omitted variable bias will be present; use too many, and the model will suffer from overfitting. The methods implemented in pdslasso address this problem by selecting enough controls to address the former problem but not so many as to introduce the latter. ivlasso is used for the case where a researcher has an endogenous causal variable of interest e, and a large set of potential instruments {it:z1, z2, z3, ...). The usage in this case is: ivlasso y (e = z1 z2 z3 ...) ivlasso accepts multiple causal variables, e.g.: pdslasso y (e1 e2 = z1 z2 z3 ...) ivlasso also allows combinations of exogenous and endogenous causal variables (d, e) and high-dimensional controls and instruments (x, z), e.g.: pdslasso y d (x1 x2 x3 ...) (e = z1 z2 z3 ...) Two approaches are implemented in pdslasso and ivlasso: 1. The \"post-double-selection\" (PDS) methodology of Belloni et al. (2012, 2013, 2014, 2015, 2016), denoted \"PDS methodology\" below. 2. The \"post-regularization\" (or \"double-orthogonalization\") methodology of Chernozhukov, Hansen and Spindler (2015), denoted \"CHS methodology\" below. The implemention of these methods in pdslasso and ivlasso uses the separate Stata program rlasso, which provides lasso and sqrt-lasso estimation with data-driven penalization; see rlasso for details. For an overview of rlasso and the theory behind it, see Ahrens et al. (2020) The PDS methodology uses the lasso estimator to select the controls. Specifically, the lasso is used twice: (1) estimate a lasso regression with y as the dependent variable and the control variables x1, x2, x3, ... as regressors; (2) estimate a lasso regression with d as the dependent variable and again the control variables x1, x2, x3, ... as regressors. The lasso estimator achieves a sparse solution, i.e., most coefficients are set to zero. The final choice of control variables to include in the OLS regression of y on d is the union of the controls selected selected in steps (1) and (2), hence the name \"post-double selection\" for the methodolgy. The PDS methodology can be employed to select instruments as well as controls in instrumental variables estimation. The CHS methodology is closely related. Instead of using the lasso-selected controls and instruments in a post-regularization OLS or IV estimation, the selected variables are used to construct orthogonalized versions of the dependent variable, the exogenous and/or endogenous causal variables of interest and to construct optimal instruments from the lasso-selected IVs. The orthogonalized versions are based either on the lasso or post-lasso estimated coefficients; the post-lasso is OLS applied to lasso-selected variables. See Chernozhukov et al. (2015) for details. The set of variables selected by the lasso and used in the OLS post-lasso estimation and in the PDS structural estimation can be augmented by variables that were penalized but not selected by the lasso. The penalized variables that are used in this way to augment the post-lasso and PDS estimations are called the \"amelioration set\" and can be specified with the aset(varlist) option. This option affects only the CHS post-lasso-based and PDS estimations; the CHS lasso-based orthogonalized variables are unaffected. See Chernozhukov et al. (2014) for details. pdslasso and ivlasso report the PDS-based and the two (lasso and post-lasso) CHS-based estimations. If the sqrt option is specified, instead of the lasso the sqrt-lasso estimator is used; see rlasso for further details and references. If the IV model is weakly identified (the instruments are only weakly correlated with the endogenous regressors) Belloni et al. (2012, 2013) suggest using weak-identification-robust hypothesis tests and confidence sets based the Chernozhukov et al. (2013) sup-score test. The intuition behind the sup-score test is similar to that of the Anderson-Rubin (1949) test. Consider the simplest case (a single endogenous regressor d and no exogenous regressors or controls) where the null hypothesis is that the coefficient on d is H0:beta=b0. If the null is true, then the structural residual is simply e=y-b0*d. Under the additional assumption that the instruments are valid (orthogonal to the true disturbance), they should be uncorrelated with e. The sup-score tests reported by ivlasso are in effect high-dimensional versions of the Anderson-Rubin test. The test is implemented in rlasso; see help rlasso for details. Specifically, ivlasso reports sup-score tests of statistical significance of the instruments where the dependent variable is e=y-b0*d, the instruments are regressors, and b0 is a hypothesized value of the coefficient on d; a large test statistic indicates rejection of the null H0:beta=b0. The default is to use a conservative (asymptotic bound) critical value as suggested by Belloni et al. (2012, 2013) (option ssmethod(abound)). Alternative methods are to use p-values obtained by simulation via a multiplier bootstrap (option ssmethod(simulate)), or to estimate a lasso regression with the instruments as regressors, and if (no) instruments are selected we (fail to) reject the null H0:beta=b0 at the gamma significance level (option ssmethod(select)). A 100*(1-gamma)% sup-score-based confidence set can be constructed by a grid search over the range of hypothesized values of beta. ivlasso reports the result of the sup-score test of the null H0:beta=0 with the idstats option, and in addition, for the single endogenous regressor case only, reports sup-score confidence sets with the sscset option. For the multiple-endogenous regressor case, sets of jointly hypothesized values for the componets of beta can be tested using the ssgridmat(name) option. The matrix provided in the option should be an r x k Stata matrix, where each row contains a set of values that together specify a null hypothesis for the coefficients of the k endogenous regressors. This option allows the user to specify a grid search in multiple dimensions. Computational notes The various options available for the underlying calls to rlasso can be controlled via the option loptions(rlasso option list). The rlasso option center, to center moments in heteroskedastic and cluster-robust loadings, will be a commonly-employed option. This can be specified by lopt(center). Another rlasso option that may often be used is to \"pre-standardize\" the data to have unit variance prior to computing the lasso coefficients with the prestd option. This is a computational alternative to the rlasso default of standardizing \"on the fly\" (i.e., incorporating the standardization into the lasso penalty loadings). This is specified by lopt(prestd). The results are equivalent in theory. The prestd option can lead to improved numerical precision or more stable results in the case of difficult problems; the cost is (a typically small) computation time required to standardize. rlasso implements a version of the lasso with data-dependent penalization and, for the heteroskedastic and clustered cases, regressor-specific penalty loadings; see rlasso for details. Note that specification of robust or cluster(.) as options to pdslasso or ivlasso automatically implies the use of robust or cluster-robust lasso penalty loadings. Penalty loadings and VCE type can be separately controlled via the olsoptions(.) (for pdslasso) or ivoptions(.) (for ivlasso) vs. loptions(rlasso option list); for example, olsoptions(cluster(clustvar)) + loptions(robust) would use heteroskedastic-robust penalty loadings for the lasso estimations and a cluster-robust covariance estimator for the PDS and CHS estimations of the structural equation. Either the partial(varlist) option or the pnotpen(varlist) option can be used for variables that should not be penalized by the lasso. By the Frisch-Waugh-Lovell Theorem for the lasso (Yamada 2017), the estimated lasso coefficients are the same in theory whether the unpenalized regressors are partialled-out or given zero penalty loadings, so long as the same penalty loadings are used for the penalized regressors in both cases. Although the options are equivalent in theory, numerical results can differ in practice because of the different calculation methods used; see rlasso for further details. The constant, if present, is always unpenalized or partialled-out By default the constant (if present) is not penalized if there are no regressors being partialled out; this is equivalent to mean-centering prior to estimation. The exception to this is if aweights or aweights are specified, in which case the constant is partialled-out. The partial(varlist) option always partials out the constant (if present) along with the variables specified in varlist; to partial out just the constant, specify partial(_cons). Partialling-out of controls is done by ivlasso; partialling-out of instruments is done in the lasso estimation by rlasso. Partialling-out is implemented in Mata using one of Mata's solvers. In cases where the variables to be partialled out are collinear or nearly so, different solvers may generate different results. Users may wish to check the stability of their results in such cases. The psolver(.) option can be used to specify the Mata solver used. The default behavior for solving AX=B for X is to use the QR decomposition applied to (A'A) and (A'B), i.e., qrsolve((A'A),(A'B)), abbreviated qrxx. Available options are qr, qrxx, lu, luxx, svd, svdxx, where, e.g., svd indicates using svsolve(A,B) and svdxx indicates using svsolve((A'A),(A'B)). pdslasso/ivlasso will warn if collinear variables are dropped when partialling out. The lasso and sqrt-lasso estimations are obtained via numerical methods (coordinate descent). Results can be unstable for difficult problems (e.g., if the scaling of variables covers a wide range of magnitudes). Using variables that are all measured on a similar scale will help (as usual). Partialling-out variables is usually preferable to specifying them as unpenalized. See rlasso for discussion of the various options for controlling the numerical methods used. The sup-score-based tests reported by ivlasso come in three versions: (a) using lasso-orthogonalized variables, where the variables have first been orthogonalized with respect to the high-dimensional controls using the lasso; (b) using post-lasso-orthogonalized variables; (c) using the variables without any orthogonalization. The orthogonalizations use the same lasso settings as in the main estimation. After orthgonalization, e~ = y~ - b0*d~ is constructed (where a tilde indicates an orthogonalized variable), and then the sup-score test is conducted using e~ and the instruments. Versions (a) and (b) are not reported if there are no high-dimensional controls. Version (c) is available if there are high-dimensional controls but only if the method(select) option is used. The sup-score-based tests are not available if the specification also includes either exogenous causal regressors or unpenalized instruments. For large datasets, obtaining the p-value for the sup-score test by simulation (multiplier bootstrap, ssmethod(simulate) option) can be time-consuming. In such cases, using the default method of a conservative (asymptotic bound) critical value (ssmethod(abound) option) will be much faster. The grid search to construct the sup-score confidence set can be controlled by the ssgridmin, ssgridmax and ssgridpoints options. If these options are not specified by the user, a 100-point grid centered on the OLS estimator is used. The fe fixed-effects option is equivalent to (but computationally faster and more accurate than) specifying unpenalized panel-specific dummies. The fixed-effects (\"within\") transformation also removes the constant as well as the fixed effects. The panel variable used by the fe option is the panel variable set by xtset. rlasso, like the lasso in general, accommodates possibly perfectly-collinear sets of regressors. Stata's factor variables are supported by rlasso. Users therefore have the option of specifying as high-dimensional controls or instruments one or more complete sets of factor variables or interactions with no base levels using the ibn prefix. This can be interpreted as allowing the lasso to choose the members of the base category. For a detailed discussion of an R implementation of this methodology, see Spindler et al. (2016). Examples using data from Acemoglu-Johnson-Robinson (2001) Load and reorder AJR data for Table 6 and Table 8 (datasets need to be in current directory). . clear . (click to download maketable6.zip from economics.mit.edu) . unzipfile maketable6 . (click to download maketable8.zip from economics.mit.edu) . unzipfile maketable8 . use maketable6 . merge 1:1 shortnam using maketable8 . keep if baseco==1 . order shortnam logpgp95 avexpr lat_abst logem4 edes1975 avelf, first . order indtime euro1900 democ1 cons1 democ00a cons00a, last Alternatively, load AJR data from our website (no manual download required): . clear . use https://statalasso.github.io/dta/AJR.dta Examples with exogenous regressors: Replicate OLS results in Panel C, col. 9. . reg logpgp95 avexpr lat_abst edes1975 avelf temp* humid* steplow-oilres Basic usage: select from high-dim controls. . pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres) As above, hetoroskedastic-robust. . pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres), rob Specify that latitude is an unpenalized control to be partialled out. . pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres), partial(lat_abst) Specify that latitude is an unpenalized control using the notpen option (equivalent). . pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres), pnotpen(lat_abst) Specify that latitude is in the amelioration set. . pdslasso logpgp95 avexpr (lat_abst edes1975 avelf temp* humid* steplow-oilres), aset(lat_abst) Example with endogenous regressor, high-dimensional controls and low-dimensional instrument: Replicate IV results in Panels A \u0026amp; B, col. 9. . ivreg logpgp95 (avexpr=logem4) lat_abst edes1975 avelf temp* humid* steplow-oilres, first Select controls; specify that logem4 is an unpenalized instrument to be partialled out. . ivlasso logpgp95 (avexpr=logem4) (lat_abst edes1975 avelf temp* humid* steplow-oilres), partial(logem4) Example with endogenous regressor and high-dimensional instruments and controls: Select controls and instruments; specify that logem4 is an unpenalized instrument and lat_abst is an unpenalized control; request weak identification stats and first-stage results. . ivlasso logpgp95 (lat_abst edes1975 avelf temp* humid* steplow-oilres) (avexpr=logem4 euro1900-cons00a), partial(logem4 lat_abst) idstats first Replay first-stage estimation. (Can also use est restore to make this the current estimation results.) . est replay _ivlasso_avexpr Select controls and instruments; specify that lat_abst is an unpenalized control; request weak identification stats and sup-score confidence sets. . ivlasso logpgp95 (lat_abst edes1975 avelf temp* humid* steplow-oilres) (avexpr=logem4 euro1900-cons00a), partial(lat_abst) idstats sscset As above but heteroskedastic-robust and use grid options to control grid search and test level; also set seed in rlasso options to make multiplier-bootstrap p-values replicable. . ivlasso logpgp95 (lat_abst edes1975 avelf temp* humid* steplow-oilres) (avexpr=logem4 euro1900-cons00a), partial(lat_abst) rob idstats sscset ssgridmin(0) ssgridmax(2) ssgamma(0.1) lopt(seed(1)) Examples using data from Angrist-Krueger (1991) Load AK data and rename variables (dataset needs to be in current directory). NB: this is a large dataset (330k observations) and estimations may take some time to run on some installations. . clear . (click to download asciiqob.zip from economics.mit.edu) . unzipfile asciiqob.zip . infix lnwage 1-9 edu 10-20 yob 21-31 qob 32-42 pob 43-53 using asciiqob.txt Alternative source (no unzipping needed): . use https://statalasso.github.io/dta/AK91.dta xtset data by place of birth (state): . xtset pob Table VII (1930-39) col 2. Year and state of birth = yob \u0026amp; pob. . ivregress 2sls lnwage i.pob i.yob (edu=i.qob i.yob#i.qob i.pob#i.qob) Fixed effects; select year controls and IVs; IVs are QOB and QOBxYOB. . ivlasso lnwage (i.yob) (edu=i.qob i.yob#i.qob), fe Fixed effects; select year controls and IVs; IVs are QOB, QOBxYOB, QOBxSOB. . ivlasso lnwage (i.yob) (edu=i.qob i.yob#i.qob i.pob#i.qob), fe All dummies \u0026amp; interactions incl. base levels. . ivlasso lnwage (i.yob) (edu=ibn.qob ibn.yob#ibn.qob ibn.pob#ibn.qob), fe Example using data from Belloni et al. (2015) Load dataset on eminent domain (available at journal website). . clear . import excel using https://statalasso.github.io/dta/CSExampleData.xlsx, first Settings used in Belloni et al. (2015) - results as in journal replication file (not text) (Includes use of undocumented rlasso option c0(real) to control initial penalty loadings.) Store rlasso intermediate results for replay later. . ivlasso CSIndex (NumProCase = Z*), nocons robust rlasso lopt(lalt corrnum(0) maxpsiiter(100) c0(0.55)) . estimates replay _ivlasso_step5_NumProCase Saved results ivlasso saves the following in e(): scalars e(N) sample size e(xhighdim_ct) number of all high-dimensional controls e(zhighdim_ct) number of all high-dimensional instruments e(N_clust) number of clusters in cluster-robust estimation; in the case of 2-way cluster-robust, e(N_clust)=min(e(N_clust1),e(N_clust2)) e(N_g) number of groups in fixed-effects model e(bw) (HAC/AC only) bandwidth used e(ss_gamma) significance level in sup-score tests and CIs e(ss_level) test level in % in sup-score tests and CIs (=100*(1-gamma)) e(ss_gridmin) min grid point in sup-score CI e(ss_gridmax) max grid point in sup-score CI e(ss_gridpoints) number of grid points in sup-score CI macros e(cmd) pdslasso or ivlasso e(depvar) name of dependent variable e(dexog) name(s) of exogenous structural variable(s) e(dendog) name(s) endogenous structural variable(s) e(xhighdim) names of high-dimensional control variables e(zhighdim) names of high-dimensional instruments e(method) lasso or sqrt-lasso e(kernel) (HAC/AC only) kernel used e(ss_null) result of sup-score test (reject/fail to reject) e(ss_null_l) result of lasso-orthogonalized sup-score test (reject/fail to reject) e(ss_null_pl) result of post-lasso-orthogonalized sup-score test (reject/fail to reject) e(ss_cset) confidence interval for sup-score test e(ss_cset_l) confidence interval for lasso-orthogonalized sup-score test e(ss_cset_pl) confidence interval for post-lasso-orthogonalized sup-score test e(ss_method) simulate, abound or select matrices e(b) posted coefficient vector e(V) posted variance-covariance matrix e(beta_pds) PDS coefficient vector e(V_pds) PDS variance-covariance matrix e(beta_lasso) CHS lasso-based coefficient vector e(V_lasso) CHS lasso-based variance-covariance matrix e(beta_plasso) CHS post-lasso-based coefficient vector e(V_plasso) CHS post-lasso-based variance-covariance matrix e(ss_citable) sup-score test results used to construct confidence sets e(ss_gridmat) sup-score test results using user-specified grid functions e(sample) References Ahrens, A., Hansen, C.B. and M.E. Schaffer. 2020. lassopack: model selection and prediction with regularized regression in Stata. The Stata Journal, 20(1):176-235. https://journals.sagepub.com/doi/abs/10.1177/1536867X20909697. Working paper version: https://arxiv.org/abs/1901.05397. Anderson, T. W. and Rubin, H. 1949. Estimation of the Parameters of Single Equation in a Complete System of Stochastic Equations. Annals of Mathematical Statistics 20:46-63. https://projecteuclid.org/euclid.aoms/1177730090 Angrist, J. and Kruger, A. 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106(4):979-1014. http://www.jstor.org/stable/2937954 Belloni, A., Chernozhukov, V. and Wang, L. 2011. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98:791-806. https://doi.org/10.1214/14-AOS1204 Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369-2429. http://onlinelibrary.wiley.com/doi/10.3982/ECTA9626/abstract Belloni, A., Chernozhukov, V. and Hansen, C. 2013. Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics: 10th World Congress, Vol. 3: Econometrics, Cambridge University Press: Cambridge, 245-295. http://arxiv.org/abs/1201.0220 Belloni, A., Chernozhukov, V. and Hansen, C. 2014. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81:608-650. https://doi.org/10.1093/restud/rdt044 Belloni, A., Chernozhukov, V. and Hansen, C. 2015. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives 28(2):29-50. http://www.aeaweb.org/articles.php?doi=10.1257/jep.28.2.29 Belloni, A., Chernozhukov, V., Hansen, C. and Kozbur, D. 2016. Inference in High Dimensional Panel Models with an Application to Gun Control. Journal of Business and Economic Statistics 34(4):590-605. http://amstat.tandfonline.com/doi/full/10.1080/07350015.2015.1102733 Belloni, A., Chernozhukov, V. and Wang, L. 2014. Pivotal estimation via square-root-lasso in nonparametric regression. Annals of Statistics 42(2):757-788. https://doi.org/10.1214/14-AOS1204 Chernozhukov, V., Chetverikov, D. and Kato, K. 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Annals of Statistics 41(6):2786-2819. https://projecteuclid.org/euclid.aos/1387313390 Chernozhukov, V. Hansen, C., and Spindler, M. 2015. Post-selection and post-regularization inference in linear models with many controls and instruments. American Economic Review: Papers \u0026amp; Proceedings 105(5):486-490. http://www.aeaweb.org/articles.php?doi=10.1257/aer.p20151022 Correia, S. 2016. FTOOLS: Stata module to provide alternatives to common Stata commands optimized for large datasets. https://ideas.repec.org/c/boc/bocode/s458213.html Spindler, M., Chernozhukov, V. and Hansen, C. 2016. High-dimensional metrics. https://cran.r-project.org/package=hdm. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267-288. https://doi.org/10.2307/2346178 Yamada, H. 2017. The Frisch-Waugh-Lovell Theorem for the lasso and the ridge regression. Communications in Statistics - Theory and Methods 46(21):10897-10902. http://dx.doi.org/10.1080/03610926.2016.1252403 Website Please check our website https://statalasso.github.io/ for more information. Installation pdslasso/ivlasso require installation of the lassopack package. To get the latest stable versions of lassopack and pdslasso/ivlasso from our website, check the installation instructions at https://statalasso.github.io/installation/. We update the website versions more frequently than the SSC version. Earlier versions of these programs are also available from the website. To verify that pdslasso is correctly installed, click on or type whichpkg pdslasso (which requires whichpkg to be installed; ssc install whichpkg). Acknowledgements Thanks to Sergio Correia for advice on the use of the FTOOLS package. Citation of pdslasso and ivlasso pdslasso and ivlasso are not official Stata commands. They are free contributions to the research community, like a paper. Please cite it as such: Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018 (updated 2020). pdslasso and ivlasso: Progams for post-selection and post-regularization OLS or IV estimation and inference. http://ideas.repec.org/c/boc/bocode/s458459.html Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E. Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.uk Also see Help: rlasso, lasso2, cvlasso (if installed) "},{"id":31,"href":"/docs/pystacked/parallel/","title":"Parallelization","section":"PYSTACKED","content":" Parallelization # pystacked can be run in parallel, even without a StataMP license.\npystacked can be parallelized at the level of the base learners or at the stacking level (to speed up the cross-validation process). Example 1 below uses no parallelization (the default). Example 2 parallelizes the random forest base learner. Example 3 parallelizes at the top level.\n. insheet using /// \u0026gt; https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, /// \u0026gt; clear comma . set seed 42 . gen uni=runiform() . sort uni . timer on 1 . pystacked v58 v1-v57, type(class) methods(rf gradboost nnet) /// \u0026gt; cmdopt1(n_estimators(1000)) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- rf | 0.3698014 gradboost | 0.5437376 nnet | 0.0864610 . timer off 1 . timer on 2 . pystacked v58 v1-v57, type(class) methods(rf gradboost nnet) /// \u0026gt; cmdopt1(n_jobs(-1) n_estimators(1000)) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- rf | 0.3293277 gradboost | 0.5661072 nnet | 0.1045651 . timer off 2 . timer on 3 . pystacked v58 v1-v57, type(class) methods(rf gradboost nnet) /// \u0026gt; cmdopt1(n_estimators(1000)) njobs(-1) Stacking weights: --------------------------------------- Method | Weight -----------------+--------------------- rf | 0.3514905 gradboost | 0.5024690 nnet | 0.1460405 . timer off 3 . timer list 1: 196.95 / 1 = 196.9510 2: 30.01 / 1 = 30.0140 3: 83.05 / 1 = 83.0450 Which method is faster depends on the choice and number of base learners and number of folds. In this example, parallelizing the random forest is the fastest approach since we fit many trees independently.\nn_jobs(-1) uses all available cores. If you don\u0026rsquo;t want to use all CPUs, you can use, for example, n_jobs(4) to ask for 4 CPUs; see also the scikit-learn documentation. n_jobs(-2) asks for all cores minus 1.\nYou can change the backend used for parallelization using backend(); the default is \u0026rsquo;loky\u0026rsquo; under Linux/MacOS and \u0026rsquo;threading\u0026rsquo; under Windows. See here for more information. "},{"id":32,"href":"/docs/ddml/help/","title":"Help file","section":"DDML","content":" ------------------------------------------------------------------------------------------------------------------------ help ddml v0.5 ------------------------------------------------------------------------------------------------------------------------ Title ddml -- Stata package for Double Debiased Machine Learning ddml implements algorithms for causal inference aided by supervised machine learning as proposed in Double/debiased machine learning for treatment and structural parameters (Econometrics Journal, 2018). Five different models are supported, allowing for binary or continous treatment variables and endogeneity, high-dimensional controls and/or instrumental variables. ddml supports a variety of different ML programs, including but not limited to lassopack and pystacked. The package includes the wrapper program qddml, which uses a simplified one-line syntax, but offers less flexibility. qddml relies on crossfit, which can be used as a standalone program. Please check the examples provided at the end of the help file. Syntax Estimation with ddml proceeds in four steps. Step 1. Initialize ddml and select model: ddml init model [if] [in] [ , mname(name) kfolds(integer) fcluster(varname) foldvar(varlist) reps(integer) norandom tabfold vars(varlist) ] where model is either partial, iv, interactive, fiv, interactiveiv; see model descriptions. Step 2. Add supervised ML programs for estimating conditional expectations: ddml eq [ , mname(name) vname(varname) learner(varname) vtype(string) predopt(string) ] : command depvar vars [ , cmdopt ] where, depending on model chosen in Step 1, eq is either E[Y|X] E[Y|D,X] E[Y|X,Z] E[D|X] E[D|X,Z] E[Z|X]. command is a supported supervised ML program (e.g. pystacked or cvlasso). See supported programs. Note: Options before \":\" and after the first comma refer to ddml. Options that come after the final comma refer to the estimation command. Step 3. Cross-fitting: ddml crossfit [ , mname(name) shortstack ] This step implements the cross-fitting algorithm. Each learner is fitted iteratively on training folds and out-of-sample predicted values are obtained. Step 4. Estimate causal effects: ddml estimate [ , mname(name) robust cluster(varname) vce(type) att trim(real) ] The ddml estimate command returns treatment effect estimates for all combination of learners added in Step 2. Optional. Report/post selected results: ddml estimate [ , replay mname(name) spec(integer or string) rep(integer or string) fulltable notable allest ] Auxiliary sub-programs: Download latest ddml from Github: ddml update Report information about ddml model: ddml desc [ , mname(name) learners crossfit estimates sample all ] Export results in csv format: ddml export [ , mname(name) fname(name) ] Retrieve information from ddml: ddml extract [ object_name , mname(name) show(display_item) ename(name) vname(varname) stata keys key1(string) key2(string) key3(string) subkey1(string) subkey2(string) ] display_item can be mse, n or pystacked. ddml stores many internal results on associative arrays. These can be retrieved using the different key options. See ddml extract for details. Drop the ddml estimation mname and all associated variables: ddml drop mname Options init options Description ------------------------------------------------------------------------------------------------------------------ mname(name) name of the DDML model. Allows to run multiple DDML models simultaneously. Defaults to m0. kfolds(integer) number of cross-fitting folds. The default is 5. fcluster(varname) cluster identifiers for cluster randomization of random folds. foldvar(varlist) integer variable with user-specified cross-fitting folds (one per cross-fitting repetition). norandom use observations in existing order instead of randomizing before splitting into folds; if multiple resamples, applies to first resample only; ignored if user-defined fold variables are provided in foldvar(varlist). reps(integer) number of re-sampling iterations, i.e., how often the cross-fitting procedure is repeated on randomly generated folds. tabfold prints a table with frequency of observations by fold. ------------------------------------------------------------------------------------------------------------------ Equation options Description ------------------------------------------------------------------------------------------------------------------ mname(name) name of the DDML model. Defaults to m0. vname(varname) name of the dependent variable in the reduced form estimation. This is usually inferred from the command line but is mandatory for the fiv model. learner(varname) name of the variable to be created. vtype(string) variable type of the variable to be created. Defaults to double. none can be used to leave the type field blank (this is required when using ddml with rforest.) predopt(string) predict option to be used to get predicted values. Typical values could be xb or pr. Default is blank. ------------------------------------------------------------------------------------------------------------------ Cross-fitting Description ------------------------------------------------------------------------------------------------------------------ mname(name) name of the DDML model. Defaults to m0. shortstack asks for short-stacking to be used. Short-stacking runs contrained non-negative least squares on the cross-fitted predicted values to obtain a weighted average of several base learners. ------------------------------------------------------------------------------------------------------------------ Estimation Description ------------------------------------------------------------------------------------------------------------------ mname(name) name of the DDML model. Defaults to m0. spec(integer/string) select specification (specification number, \"mse\" or \"ss\") rep(integer/string) select resampling iteration (resample number, \"mn\" or \"md\") robust report SEs that are robust to the presence of arbitrary heteroskedasticity. cluster(varname) select cluster-robust variance-covariance estimator. vce(type) select variance-covariance estimator, see here trim(real) trimming of propensity scores. The default is 0.01 (that is, values below 0.01 and above 0.99 are set to 0.01 and 0.99, respectively). ------------------------------------------------------------------------------------------------------------------ Auxiliary Description ------------------------------------------------------------------------------------------------------------------ mname(name) name of the DDML model. Defaults to m0. ------------------------------------------------------------------------------------------------------------------ Models This section provides an overview of supported models. Throughout we use Y to denote the outcome variable, X to denote confounders, Z to denote instrumental variable(s), and D to denote the treatment variable(s) of interest. Partial linear model [partial] Y = a.D + g(X) + U D = m(X) + V where the aim is to estimate a while controlling for X. To this end, we estimate the conditional expectations E[Y|X] and E[D|X] using a supervised machine learner. Interactive model [interactive] Y = g(X,D) + U D = m(X) + V which relaxes the assumption that X and D are separable. D is a binary treatment variable. We estimate the conditional expectations E[D|X], as well as E[Y|X,D=0] and E[Y|X,D=1] (jointly added using ddml E[Y|X,D]). Partial linear IV model [iv] Y = a.D + g(X) + U Z = m(X) + V where the aim is to estimate a. We estimate the conditional expectations E[Y|X], E[D|X] and E[Z|X] using a supervised machine learner. Interactive IV model [interactiveiv] Y = g(Z,X) + U D = h(Z,X) + V Z = m(X) + E where the aim is to estimate the local average treatment effect. We estimate, using a supervised machine learner, the following conditional expectations: E[Y|X,Z=0] and E[Y|X,Z=1] (jointly added using ddml E[Y|X,Z]); E[D|X,Z=0] and E[D|X,Z=1] (jointly added using ddml E[D|X,Z]); E[Z|X]. Flexible Partially Liner IV model [fiv] Y = a.D + g(X) + U D = m(Z) + g(X) + V where the estimand of interest is a. We estimate the conditional expectations E[Y|X], E[D^|X] and D^:=E[D|Z,X] using a supervised machine learner. The instrument is then formed as D^-E^[D^|X] where E^[D^|X] denotes the estimate of E[D^|X]. Note: \"{D}\" is a placeholder that is used because last step (estimation of E[D|X]) uses the fitted values from estimating E[D|X,Z]. Please see example section below. Compatible programs ddml is compatible with a large set of user-written Stata commands. It has been tested with - lassopack for regularized regression (see lasso2, cvlasso, rlasso). - the pystacked package (see pystacked. Note that pystacked requires Stata 16. - rforest by Zou \u0026amp; Schonlau. Note that rforest requires the option vtype(none). - svmachines by Guenther \u0026amp; Schonlau. Beyond these, it is compatible with any Stata program that - uses the standard \"reg y x\" syntax, - supports if-conditions, - and comes with predict post-estimation programs. Examples Below we demonstrate the use of ddml for each of the 5 models supported. Note that estimation models are chosen for demonstration purposes only and kept simple to allow you to run the code quickly. Partially linear model I. Preparations: we load the data, define global macros and set the seed. . use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear . global Y net_tfa . global D e401 . global X tw age inc fsize educ db marr twoearn pira hown . set seed 42 We next initialize the ddml estimation and select the model. partial refers to the partially linear model. The model will be stored on a Mata object with the default name \"m0\" unless otherwise specified using the mname(name) option. Note that we set the number of random folds to 2, so that the model runs quickly. The default is kfolds(5). We recommend to consider at least 5-10 folds and even more if your sample size is small. Note also that we recommend re-running the model multiple times on different random folds; see options reps(integer). . ddml init partial, kfolds(2) We add a supervised machine learners for estimating the conditional expectation E[Y|X]. We first add simple linear regression. . ddml E[Y|X]: reg $Y $X We can add more than one learner per reduced form equation. Here, we also add a random forest learner (implemented in pystacked). . ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf) We do the same for the conditional expectation E[D|X]. . ddml E[D|X]: reg $D $X . ddml E[D|X]: pystacked $D $X, type(reg) method(rf) Optionally, you can check if the learners have been added correctly. . ddml desc Cross-fitting. The learners are iteratively fitted on the training data. This step may take a while. . ddml crossfit Finally, we obtain estimates of the coefficients of interest. Since we added two learners for each of our two reduced form equations, we get 4 point estimates. The result shown corresponds to the model with the lowest out-of-sample MSPE. . ddml estimate, robust To retrieve the very first specification shown, you can type: . ddml estimate, robust spec(1) replay You could manually retrieve the same point estimate by typing: . reg Y1_reg D1_reg, nocons robust or graphically: . twoway (scatter Y1_reg D1_reg) (lfit Y1_reg D1_reg) where Y1_reg and D1_reg are the orthogonalized versions of net_tfa and e401. To describe the ddml model setup or results in detail, you can use ddml describe with the relevant option (sample, learners, crossfit, estimates), or just describe them all with the all option: . ddml describe, all Partially linear model II. Stacking regression using pystacked. Stacking regression is a simple and powerful method for combining predictions from multiple learners. It is available in Stata via the pystacked package. Below is an example with the partially linear model, but it can be used with any model supported by ddml. Preparation: use the data and globals as above. Use the name m1 for this new estimation, to distinguish it from the previous example that uses the default name m0. This enables having multiple estimations available for comparison. Also specify 5 resamplings. . set seed 42 . ddml init partial, kfolds(2) reps(5) mname(m1) Add supervised machine learners for estimating conditional expectations. The first learner in the stacked ensemble is OLS. We also use cross-validated lasso, ridge and two random forests with different settings, which we save in the following macros: . global rflow max_features(5) min_samples_leaf(1) max_samples(.7) . global rfhigh max_features(5) min_samples_leaf(10) max_samples(.7) In each step, we add the mname(m1) option to ensure that the learners are not added to the m0 model which is still in memory. We also specify the names of the variables containing the estimated conditional expectations using the learner(varname) option. This avoids overwriting the variables created for the m0 model using default naming. . ddml E[Y|X], mname(m1) learner(Y_m1): pystacked $Y $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf) opt($rfhigh), type(reg) . ddml E[D|X], mname(m1) learner(D_m1): pystacked $D $X || method(ols) || method(lassocv) || method(ridgecv) || method(rf) opt($rflow) || method(rf) opt($rfhigh), type(reg) Note: Options before \":\" and after the first comma refer to ddml. Options that come after the final comma refer to the estimation command. Make sure to not confuse the two types of options. Check if learners were correctly added: . ddml desc, mname(m1) learners Cross-fitting and estimation. . ddml crossfit, mname(m1) . ddml estimate, mname(m1) robust Examine the learner weights used by pystacked. . ddml extract, mname(m1) show(pystacked) We can compare the effects with the first ddml model (if you have run the first example above). . ddml estimate, mname(m0) replay Partially linear IV model. Preparations: we load the data, define global macros and set the seed. . use https://statalasso.github.io/dta/AJR.dta, clear . global Y logpgp95 . global D avexpr . global Z logem4 . global X lat_abst edes1975 avelf temp* humid* steplow-oilres . set seed 42 Preparations: we load the data, define global macros and set the seed. Since the data set is very small, we consider 30 cross-fitting folds. . ddml init iv, kfolds(30) The partially linear IV model has three conditional expectations: E[Y|X], E[D|X] and E[Z|X]. For each reduced form equation, we add two learners: regress and rforest. We need to add the option vtype(none) for rforest to work with ddml since rforest's predict command doesn't support variable types. . ddml E[Y|X]: reg $Y $X . ddml E[Y|X], vtype(none): rforest $Y $X, type(reg) . ddml E[D|X]: reg $D $X . ddml E[D|X], vtype(none): rforest $D $X, type(reg) . ddml E[Z|X]: reg $Z $X . ddml E[Z|X], vtype(none): rforest $Z $X, type(reg) Cross-fitting and estimation. . ddml crossfit . ddml estimate, robust If you are curious what ddml does in the background: . ddml estimate m0, spec(8) rep(1) . ivreg Y2_rf (D2_rf = Z2_rf), nocons Interactive model--ATE and ATET estimation. Preparations: we load the data, define global macros and set the seed. . webuse cattaneo2, clear . global Y bweight . global D mbsmoke . global X mage prenatal1 mmarried fbaby mage medu . set seed 42 We use 5 folds and 5 resamplings; that is, we estimate the model 5 times using randomly chosen folds. . ddml init interactive, kfolds(5) reps(5) We need to estimate the conditional expectations of E[Y|X,D=0], E[Y|X,D=1] and E[D|X]. The first two conditional expectations are added jointly. We consider two supervised learners: linear regression and gradient boosted trees (implemented in pystacked). Note that we use gradient boosted regression trees for E[Y|X,D], but gradient boosted classification trees for E[D|X]. . ddml E[Y|X,D]: reg $Y $X . ddml E[Y|X,D]: pystacked $Y $X, type(reg) method(gradboost) . ddml E[D|X]: logit $D $X . ddml E[D|X]: pystacked $D $X, type(class) method(gradboost) Cross-fitting: . ddml crossfit In the final estimation step, we can estimate both the average treatment effect (the default) or the average treatment effect of the treated (atet). . ddml estimate . ddml estimate, atet trim(0) Recall that we have specified 5 resampling iterations (reps(5)) By default, the median over the minimum-MSE specification per resampling iteration is shown. At the bottom, a table of summary statistics over resampling iterations is shown. Interactive IV model--LATE estimation. Preparations: we load the data, define global macros and set the seed. . use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta,clear . global Y earnings . global D training . global Z assignmt . global X sex age married black hispanic . set seed 42 We initialize the model. . ddml init interactiveiv, kfolds(5) We again add two learners per reduced form equation. . ddml E[Y|X,Z]: reg $Y $X . ddml E[Y|X,Z]: pystacked $Y c.($X)# #c($X), type(reg) m(lassocv) . ddml E[D|X,Z]: logit $D $X . ddml E[D|X,Z]: pystacked $D c.($X)# #c($X), type(class) m(lassocv) . ddml E[Z|X]: logit $Z $X . ddml E[Z|X]: pystacked $Z c.($X)# #c($X), type(class) m(lassocv) Cross-fitting and estimation. . ddml crossfit . ddml estimate Flexible Partially Linear IV model. Preparations: we load the data, define global macros and set the seed. . use https://github.com/aahrens1/ddml/raw/master/data/BLP.dta, clear . global Y share . global D price . global X hpwt air mpd space . global Z sum* . set seed 42 We initialize the model. . ddml init fiv We add learners for E[Y|X] in the usual way. . ddml E[Y|X]: reg $Y $X . ddml E[Y|X]: pystacked $Y $X, type(reg) There are some pecularities that we need to bear in mind when adding learners for E[D|Z,X] and E[D|X]. The reason for this is that the estimation of E[D|X] depends on the estimation of E[D|X,Z]. More precisely, we first obtain the fitted values D^=E[D|X,Z] and fit these against X to estimate E[D^|X]. When adding learners for E[D|Z,X], we need to provide a name for each learners using learner(name). . ddml E[D|Z,X], learner(Dhat_reg): reg $D $X $Z . ddml E[D|Z,X], learner(Dhat_pystacked): pystacked $D $X $Z, type(reg) When adding learners for E[D|X], we explicitly refer to the learner from the previous step (e.g., learner(Dhat_reg)) and also provide the name of the treatment variable (vname($D)). Finally, we use the placeholder {D} in place of the dependent variable. . ddml E[D|X], learner(Dhat_reg) vname($D): reg {D} $X . ddml E[D|X], learner(Dhat_pystacked) vname($D): pystacked {D} $X, type(reg) That's it. Now we can move to cross-fitting and estimation. . ddml crossfit . ddml estimate If you are curious what ddml does in the background: . ddml estimate m0, spec(8) rep(1) . gen Dtilde = $D - Dhat_pystacked_h_1 . gen Zopt = Dhat_pystacked_1 - Dhat_pystacked_h_1 . ivreg Y2_pystacked_1 (Dtilde=Zopt), nocons References Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. https://doi.org/10.1111/ectj.12097 Installation To get the latest stable version of ddml from our website, check the installation instructions at https://statalasso.github.io/installation/. We update the stable website version more frequently than the SSC version. To verify that ddml is correctly installed, click on or type whichpkg ddml (which requires whichpkg to be installed; ssc install whichpkg). Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E Schaffer, Heriot-Watt University, UK m.e.schaffer@hw.ac.uk Thomas Wiemann, University of Chicago, USA wiemann@uchicago.edu Also see (if installed) Help: lasso2, cvlasso, rlasso, ivlasso, pdslasso, pystacked. "},{"id":33,"href":"/docs/pystacked/help/","title":"Help file","section":"PYSTACKED","content":" ------------------------------------------------------------------------------------------------------------------------ help pystacked v0.4.8 ------------------------------------------------------------------------------------------------------------------------ Title pystacked -- Stata program for Stacking Regression Overview pystacked implements stacking regression (Wolpert, 1992) via scikit-learn's sklearn.ensemble.StackingRegressor and sklearn.ensemble.StackingClassifier. Stacking is a way of combining multiple supervised machine learners (the \"base\" or \"level-0\" learners) into a meta learner. The currently supported base learners are linear regression, logit, lasso, ridge, elastic net, (linear) support vector machines, gradient boosting, and neural nets (MLP). pystacked can also be used with a single base learner and, thus, provides an easy-to-use API for scikit-learn's machine learning algorithms. pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). pystacked has been tested with scikit-learn 0.24, 1.0, 1.1.0, 1.1.1 and 1.1.2. See here and here for how to set up Python for Stata on your system. Contents Syntax overview Syntax 1 Syntax 2 Other options Postestimation and prediction options Stacking Supported base learners Base learners: Options Pipelines Learner-specific predictors Example Stacking Regression Example Stacking Classification Installation Misc (references, contact, etc.) Syntax overview There are two alternative syntaxes. The first syntax is: pystacked depvar predictors [if exp] [in range] [, methods(string) cmdopt1(string) cmdopt2(string) ... pipe1(string) pipe2(string) ... xvars1(varlist) xvars2(varlist) ... otheropts ] The second syntax is: pystacked depvar predictors || method(string) opt(string) pipeline(string) xvars(varlist) || method(string) opt(string) pipeline(string) xvars(varlist) || ... || [if exp] [in range] [, otheropts ] The first syntax uses methods(string) to select base learners, where string is a list of base learners. Options are passed on to base learners via cmdopt1(string), cmdopt2(string) to cmdopt10(string). That is, up to 10 base learners can be specified and options are passed on in the order in which they appear in methods(string) (see Command options). Likewise, the pipe*(string) option can be used for pre-processing predictors within Python on the fly (see Pipelines). Furthermore, xvars*(varlist) allows to specify a learner-specific varlist of predictors. The second syntax imposes no limit on the number of base learners (aside from the increasing computational complexity). Base learners are added before the comma using method(string) together with opt(string) and separated by \"||\". Syntax 1 Option Description ------------------------------------------------------------------------------------------------------------------ methods(string) a list of base learners, defaults to \"ols lassocv gradboost\" for regression and \"logit lassocv gradboost\" for classification; see Base learners. cmdopt*(string) options passed to the base learners, see Command options. pipe*(string) pipelines passed to the base learners, see Pipelines. Regularized linear learners use the stdscaler pipeline by default, which standardizes the predictors. To suppress this, use nostdscaler. For other learners, there is no default pipeline. xvars*(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. See here. ------------------------------------------------------------------------------------------------------------------ Note: * is replaced with 1 to 10. The number refers to the order given in methods(string). Syntax 2 Option Description ------------------------------------------------------------------------------------------------------------------ method(string) a base learner, see Base learners. opt(string) options, see Command options. pipeline(string) pipelines applied to the predictors, see Pipelines. pipelines passed to the base learners, see Pipelines. Regularized linear learners use the stdscaler pipeline by default, which standardizes the predictors. To suppress this, use nostdscaler. For other learners, there is no default pipeline. xvars(varlist) overwrites the default list of predictors. That is, you can specify learner-specific lists of predictors. See here. ------------------------------------------------------------------------------------------------------------------ Other options Option Description ------------------------------------------------------------------------------------------------------------------ type(string) reg(ress) for regression problems or class(ify) for classification problems. The default is regression. finalest(string) final estimator used to combine base learners. The default is non-negative least squares without an intercept and the additional constraint that weights sum to 1 (nnls1). Alternatives are nnls0 (non-negative least squares without intercept without the sum-to-one constraint), singlebest (use base learner with minimum MSE), ols (ordinary least squares) or ridge for (logistic) ridge, which is the sklearn default. For more information, see here. nosavepred do not save predicted values (do not use if predict is used after estimation) nosavebasexb do not save predicted values of each base learner (do not use if predict with basexb is used after estimation) njobs(int) number of jobs for parallel computing. The default is 0 (no parallelization), -1 uses all available CPUs, -2 uses all CPUs minus 1. backend(string) joblib backend used for parallelization; the default is 'loky' under Linux/MacOS and 'threading' under Windows. See here for more information. folds(int) number of folds used for cross-validation (not relevant for voting); default is 5. Ignored if foldvar(varname) if specified. foldvar(varname) integer fold variable for cross-validation. bfolds(int) number of folds used for base learners that use cross-validation (e.g. lassocv); default is 5. norandom folds are created using the ordering of the data. noshuffle cross-validation folds for base learners that use cross-validation (e.g. lassocv) are based on ordering of the data. sparse converts predictor matrix to a sparse matrix. This will only lead to speed improvements if the predictor matrix is sufficiently sparse. Not all learners support sparse matrices and not all learners will benefit from sparse matrices in the same way. You can also use the sparse pipeline to use sparse matrices for some learners, but not for others. pyseed(int) set the Python seed. Note that since pystacked uses Python, we also need to set the Python seed to ensure replicability. Three options: 1) pyseed(-1) draws a number between 0 and 10^8 in Stata which is then used as a Python seed. This way, you only need to deal with the Stata seed. For example, set seed 42 is sufficient, as the Python seed is generated automatically. 2) Setting pyseed(x) with any positive integer x allows to control the Python seed directly. 3) pyseed(0) sets the seed to None in Python. The default is pyseed(-1). ------------------------------------------------------------------------------------------------------------------ Voting Description ------------------------------------------------------------------------------------------------------------------ voting use voting regressor (ensemble.VotingRegressor) or voting classifier (ensemble.VotingClassifier); see here for a brief explanation. votetype(string) type of voting classifier: hard (default) or soft voteweights(numlist) positive weights used for voting regression/classification. The length of numlist should be the number of base learners - 1. The last weight is calculated to ensure that sum(weights)=1. ------------------------------------------------------------------------------------------------------------------ Postestimation and prediction options Postestimation tables After estimation, pystacked can report a table of in-sample (both cross-validated and full-sample-refitted) and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the table reports the root MSPE (mean squared prediction error); for classification problems, a confusion matrix is reported. The default holdout sample used for out-of-sample performance with the holdout option is all observations not included in the estimation. Alternatively, the user can specify the holdout sample explicitly using the syntax holdout(varname). The table can be requested postestimation as below, or as part of the pystacked estimation command. Table syntax: pystacked [, table holdout[(varname)] ] Postestimation graphs pystacked can also report graphs of in-sample and, optionally, out-of-sample (holdout sample) performance for both the stacking regression and the base learners. For regression problems, the graphs compare predicted vs actual values of depvar. For classification problems, the default is to report ROC curves; optionally, histograms of predicted probabilities are reported. As with the table option, the default holdout sample used for out-of-sample performance is all observations not included in the estimation, but the user can instead specify the holdout sample explicitly. The table can be requested postestimation as below, or as part of the pystacked estimation command. The graph option on its own reports the graphs using pystacked's default settings. Because graphs are produced using Stata's twoway, roctab and histogram commands, the user can control either the combined graph (graph(options)) or the individual learner graphs (lgraph(options)) appear by passing options to these commands. Graph syntax: pystacked [, graph[(options)] lgraph[(options)] histogram holdout[(varname)] ] Prediction To get stacking predicted values: predict type newname [if exp] [in range] [, pr class xb resid ] To get fitted values for each base learner: predict type stub [if exp] [in range] [, basexb cvalid ] Option Description ------------------------------------------------------------------------------------------------------------------ xb predicted value; the default for regression pr predicted probability; the default for classification class predicted class resid residuals basexb predicted values for each base learner (default = use base learners re-fitted on full estimation sample) cvalid cross-validated predicted values. Currently only supported if combined with basexb. ------------------------------------------------------------------------------------------------------------------ Note: Predicted values (in and out-sample) are calculated and stored in Python memory when pystacked is run. predict pulls the predicted values from Python memory and saves them in Stata memory. This means that no changes on the data in Stata memory should be made between pystacked call and predict call. If changes to the data set are made, predict will return an error. Stacking Stacking is a way of combining cross-validated predictions from multiple base (\"level-0\") learners into a final prediction. A final estimator (\"level-1\") is used to combine the base predictions. The default final predictor for stacking regession is non-negative least squares (NNLS) without an intercept and with the constraint that weights sum to one. Note that in this respect we deviate from the scikit-learn default and follow the recommendation in Breiman (1996) and Hastie et al. (2009, p. 290). The scikit-learn defaults for the final estimator are ridge regression for stacking regression and logistic ridge for classification tasks. To use the scikit-learn default, use finalest(ridge). pystacked also supports ordinary (unconstrained) least squares as the final estimator (finalest(ols)). Finally, singlebest uses the single base learner that exhibits the smallest cross-validated mean squared error. An alternative to stacking is voting. Voting regression uses the weighted average of base learners to form predictions. By default, the unweighted average is used, but the user can specify weights using voteweights(numlist). Voting classifier uses a majority rule by default (hard voting). An alternative is soft voting where the (weighted) probabilities are used to form the final prediction. Supported base learners The following base learners are supported: Base learners ols Linear regression (regression only) logit Logistic regression (classification only) lassoic Lasso with penalty chosen by AIC/BIC (regression only) lassocv Lasso with cross-validated penalty ridgecv Ridge with cross-validated penalty elasticcv Elastic net with cross-validated penalty svm Support vector machines gradboost Gradient boosting rf Random forest linsvm Linear SVM nnet Neural net The base learners can be chosen using the methods(lassocv gradboost nnet) (Syntax 1) or method(string) options (Syntax 2). Please see links in the next section for more information on each method. Base learners: Options Options can be passed to the base learners via cmdopt*(string) (Syntax 1) or opt(string) (Syntax 2). The defaults are adopted from scikit-learn. To see the default options of each base learners, simply click on the \"Show options\" links below. To see which alternative settings are allowed, please see the scikit-learn documentations linked below. We strongly recommend that you read the scikit-learn documentation carefully. The option showoptions shows the options passed on to Python. We recommend to verify that options have been passed to Python as intended. Linear regression Methods ols Type: reg Documentation: linear_model.LinearRegression Show options Logistic regression Methods: logit Type: class Documentation: linear_model.LogisticRegression Show options Penalized regression with information criteria Methods lassoic Type: reg Documentation: linear_model.LassoLarsIC Show options Penalized regression with cross-validation Methods: lassocv, ridgecv and elasticv Type: regress Documentation: linear_model.ElasticNetCV Show lasso options Show ridge options Show elastic net options Penalized logistic regression with cross-validation Methods: lassocv, ridgecv and elasticv Type: class Documentation: linear_model.LogisticRegressionCV Show lasso options Show ridge options Show elastic options Random forest classifier Method: rf Type: class Documentation: ensemble.RandomForestClassifier Show options Random forest regressor Method: rf Type: reg Documentation: ensemble.RandomForestRegressor Show options Gradient boosted regression trees Method: gradboost Type: reg Documentation: ensemble.GradientBoostingRegressor Show options Linear SVM (SVR) Method: linsvm Type: reg Documentation: svm.LinearSVR Show options SVM (SVR) Method: svm Type: class Documentation: svm.SVR Show options SVM (SVC) Method: svm Type: reg Documentation: svm.SVC Show options Neural net classifier (Multi-layer Perceptron) Method: nnet Type: class Documentation: sklearn.neural_network.MLPClassifier Show options Neural net regressor (Multi-layer Perceptron) Method: nnet Type: reg Documentation: sklearn.neural_network.MLPRegressor Show options Learner-specific predictors By default, pystacked uses the same set of predictors for all base learners. This is often not desirable. For example, when using linear machine learners such as the lasso, it is recommended to create interactions. There are two methods to allow for learner-specific sets of predictors: 1) Pipelines, discussed in the next section, can be used to create polynomials on the fly. 2) The xvars*(varlist) option allows to specify predictors for a specific learner. If xvars*(varlist) is missing for a specific learner, the default predictor list is used. Pipelines Scikit-learn uses pipelines to pre-preprocess input data on the fly. Pipelines can be used to impute missing observations or create transformation of predictors such as interactions and polynomials. The following pipelines are currently supported: Pipelines stdscaler StandardScaler() stdscaler0 StandardScaler(with_mean=False) sparse SparseTransformer() onehot OneHotEncoder()() minmaxscaler MinMaxScaler() medianimputer SimpleImputer(strategy='median') knnimputer KNNImputer() poly2 PolynomialFeatures(degree=2) poly3 PolynomialFeatures(degree=3) Pipelines can be passed to the base learners via pipe*(string) (Syntax 1) or pipeline(string) (Syntax 2). stdscaler0 is intended for sparse matrices, since stdscaler will make a sparse matrix dense. Example using Boston Housing data (Harrison et al., 1978) Data set The data set is available from the UCI Machine Learning Repository. The following variables are included in the data set of 506 observations: Predictors CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion Black by town LSTAT % lower status of the population Outcome MEDV Median value of owner-occupied homes in $1000's Getting started Load housing data. . insheet using https://statalasso.github.io/dta/housing.csv, clear Stacking regression with lasso, random forest and gradient boosting. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(lassocv rf gradboost) The weights determine how much each base learner contributes to the final stacking prediction. Request the root MSPE table (in-sample only): . pystacked, table Re-estimate using the first 400 observations, and request the root MSPE table. RMSPEs for both in-sample (both refitted and cross-validated) and the default holdout sample (all unused observations) are reported.: . pystacked medv crim-lstat if _n\u0026lt;=400, type(regress) pyseed(123) methods(lassocv rf gradboost) . pystacked, table holdout Graph predicted vs actual for the holdout sample: . pystacked, graph holdout Storing the predicted values: . predict double yhat, xb Storing the cross-validated predicted values: . predict double yhat_cv, xb cvalid We can also save the predicted values of each base learner: . predict double yhat, basexb Learner-specific predictors (Syntax 1) pystacked allows the use of different sets of predictors for each base learners. For example, linear estimators might perform better if interactions are provided as inputs. Here, we use interactions and 2nd-order polynomials for the lasso, but not for the other base learners. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) xvars2(c.(crim-lstat)# #c.(crim-lstat)) The same can be achieved using pipelines which create polynomials on-the-fly in Python. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe2(poly2) Learner-specific predictors (Syntax 2) We demonstrate the same using the alternative syntax, which is often more handy: . pystacked medv crim-lstat || m(ols) || m(lassocv) xvars(c.(crim-lstat)# #c.(crim-lstat)) || m(rf) || , type(regress) pyseed(123) . pystacked medv crim-lstat || m(ols) || m(lassocv) pipe(poly2) || m(rf) || , type(regress) pyseed(123) Options of base learners (Syntax 1) We can pass options to the base learners using cmdopt*(string). In this example, we change the maximum tree depth for the random forest. Since random forest is the third base learner, we use cmdopt3(max_depth(3)). . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) cmdopt3(max_depth(3)) You can verify that the option has been passed to Python correctly: . di e(pyopt3) Options of base learners (Syntax 2) The same results as above can be achieved using the alternative syntax, which imposes no limit on the number of base learners. . pystacked medv crim-lstat || m(ols) pipe(poly2) || m(lassocv) pipe(poly2) || m(rf) opt(max_depth(3)) , type(regress) pyseed(123) Single base learners You can use pystacked with a single base learner. In this example, we are using a conventional random forest: . pystacked medv crim-lstat, type(regress) pyseed(123) methods(rf) Voting You can also use pre-defined weights. Here, we assign weights of 0.5 to OLS, .1 to the lasso and, implicitly, .4 to the random foreset. . pystacked medv crim-lstat, type(regress) pyseed(123) methods(ols lassocv rf) pipe1(poly2) pipe2(poly2) voting voteweights(.5 .1) Classification Example using Spam data Data set For demonstration we consider the Spambase Data Set from the UCI Machine Learning Repository. The data includes 4,601 observations and 57 variables. The aim is to predict whether an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email. Predictors v1-v48 percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor corresponds to, see link below. v49-v54 percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times the character appears in the e-mail) divided by total number of characters in e-mail. To see which character each predictor corresponds to, see link below. v55 average length of uninterrupted sequences of capital letters v56 length of longest uninterrupted sequence of capital letters v57 total number of capital letters in the e-mail Outcome v58 denotes whether the e-mail was considered spam (1) or not (0). For more information about the data see https://archive.ics.uci.edu/ml/datasets/spambase. Load spam data. . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma Throughout this example, we add the option njobs(4), which enables parallelization with 4 cores. We consider three base learners: logit, random forest and gradient boosting: . pystacked v58 v1-v57, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2) Out-of-sample classification. As the data is ordered by outcome, we first shuffle the data randomly. . set seed 42 . gen u = runiform() . sort u Estimation on the first 2000 observations. . pystacked v58 v1-v57 if _n\u0026lt;=2000, type(class) pyseed(123) methods(logit rf gradboost) njobs(4) pipe1(poly2) We can get both the predicted probabilities or the predicted class: . predict spam, class . predict spam_p, pr Confusion matrix, just in-sample and both in- and out-of-sample. . pystacked, table . pystacked, table holdout Confusion matrix for a specified holdout sample. . gen h = _n\u0026gt;3000 . pystacked, table holdout(h) ROC curves for the default holdout sample. Specify a subtitle for the combined graph. . pystacked, graph(subtitle(Spam data)) holdout Predicted probabilites (hist option) for the default holdout sample. Specify number of bins for the individual learner graphs. . pystacked, graph hist lgraph(bin(20)) holdout Installation pystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). See this help file, this Stata blog entry and this Youtube video for how to set up Python on your system. Installing Anaconda is in most cases the easiest way of installing Python including all required packages. You can check your scikit-learn version using: . python: import sklearn . python: sklearn.__version__ Updating scikit-learn: If you use Anaconda, update scikit-learn through your Anaconda Python distribution. Make sure that you have linked Stata with the correct Python installation using python query. If you use pip, you can update scikit-learn by typing \"\u0026lt;Python path\u0026gt; -m pip install -U scikit-learn\" into the terminal, or directly in Stata: . shell \u0026lt;Python path\u0026gt; -m pip install -U scikit-learn Note that you might need to restart Stata for changes to your Python installation to take effect. For further information, see https://scikit-learn.org/stable/install.html. To install/update pystacked, type . net install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replace References Harrison, D. and Rubinfeld, D.L (1978). Hedonic prices and the demand for clean air. J. Environ. Economics \u0026amp; Management, vol.5, 81-102, 1978. Hastie, T., Tibshirani, R., \u0026amp; Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science \u0026amp; Business Media. Wolpert, David H. Stacked generalization. Neural networks 5.2 (1992): 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1 Contact If you encounter an error, contact us via email. If you have a question, you can also post on Statalist (please tag @Achim Ahrens). Acknowledgements pystacked took some inspiration from Michael Droste's pylearn, which implements other scikit-learn programs for Stata. Thanks to Jan Ditzen for testing an early version of the program. We also thank Brigham Frandsen and Marco Alfano for feedback. All remaining errors are our own. Citation Please also cite scikit-learn; see https://scikit-learn.org/stable/about.html. Authors Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland achim.ahrens@gess.ethz.ch Christian B. Hansen, University of Chicago, USA Christian.Hansen@chicagobooth.edu Mark E Schaffer, Heriot-Watt University, UK "},{"id":34,"href":"/docs/lassopack/lasso2_replication/","title":"Comparison glmnet","section":"LASSOPACK","content":" Replication of glmnet and StataCorp\u0026rsquo;s lasso # Use Stata\u0026rsquo;s auto dataset with missing data dropped. The variable price1000 is used to illustrate scaling effects.\n. sysuse auto, clear . drop if rep78==. . gen double price1000 = price/1000 Replication of glmnet # To load the data into R for comparison with glmnet, use the following commands. The packages haven and tidyr need to be installed.\n\u0026gt; auto \u0026lt;- haven::read_dta(\u0026quot;http://www.stata-press.com/data/r9/auto.dta\u0026quot;) \u0026gt; auto \u0026lt;- tidyr::drop_na() \u0026gt; n \u0026lt;- nrow(auto) \u0026gt; price \u0026lt;- auto$price \u0026gt; X \u0026lt;- auto[, c(\u0026quot;mpg\u0026quot;, \u0026quot;rep78\u0026quot;, \u0026quot;headroom\u0026quot;, \u0026quot;trunk\u0026quot;, \u0026quot;weight\u0026quot;, \u0026quot;length\u0026quot;, \u0026quot;turn\u0026quot;, \u0026quot;displacement\u0026quot;, \u0026quot;gear_ratio\u0026quot;, \u0026quot;foreign\u0026quot;)] \u0026gt; X$foreign \u0026lt;- as.integer(X$foreign) \u0026gt; X \u0026lt;- as.matrix(X) Replication of StataCorp\u0026rsquo;s lasso # Replication of StataCorp\u0026rsquo;s lasso and elasticnet requires only the rescaling of lambda by 2N. N=69; so the lasso2 lambda becomes 138000/(2x69) = 1000.\n. lasso2 price mpg-foreign, lambda(138000) . lasso linear price mpg-foreign, grid(1, min(1000)) . lassoselect lambda = 1000 . lassocoef, display(coef, penalized) . lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . elasticnet linear price mpg-foreign, alphas(0.6) grid(1, min(1000)) . lassoselect alpha = 0.6 lambda = 1000 . lassocoef, display(coef, penalized) Notes on invariance and objective function # glmnet uses the same definition of the lasso L1 penalty as StataCorp\u0026rsquo;s lasso, so lasso2\u0026rsquo;s default parameterization again requires only rescaling by 2N. When the lglmnet option is used with the lglmnet option, the L1 penalty should be provided using the glmnet definition. To estimate in R, load glmnet with library(\u0026quot;glmnet\u0026quot;) and use the following command:\n\u0026gt; r\u0026lt;-glmnet(X,price,alpha=1,lambda=1000,thresh=1e-15) To achieve the same results with lasso2:\n. lasso2 price mpg-foreign, lambda(138000) . lasso2 price mpg-foreign, lambda(1000) lglmnet The R code below uses glmnet to estimate an elastic net model. lasso2 with the lglmnet option will replicate it.\n\u0026gt; r\u0026lt;-glmnet(X,price,alpha=0.6,lambda=1000,thresh=1e-15) . lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet lasso2\u0026rsquo;s default parameterization of the elastic net (like StataCorp\u0026rsquo;s elasticnet) is not invariant to scaling:\n. lasso2 price mpg-foreign, alpha(0.6) lambda(138000) . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(138) When lasso2 uses the glmnet parameterization of the elastic net via the lglmnet option, results are invariant to scaling: the only difference is that the coefficients change by the same factor of proportionality as the dependent variable.\n. lasso2 price mpg-foreign, alpha(0.6) lambda(1000) lglmnet . lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet The reason that the default lasso2 paramaterization is (like StataCorp\u0026rsquo;s) no invariant to scaling because the penalty on L2 norm is influenced by scaling, and this in turn affects the relative weights on the L1 and L2 penalties. The example below shows how to reparameterize so that the default lasso2 parameterization for the elastic net replicates the glmnet parameterization. The example using the scaling above, where the dependent variable is price1000 and the glmnet lambda=1.\nNote: The large-sample standard deviation of price1000 is equal to 2.8912586.\n. qui sum price1000 . di r(sd) * 1/sqrt( r(N)/(r(N)-1)) The lasso2 alpha = alpha(lglmnet)xSD(y) / (1-alpha(glmnet) + alpha(glmnet)xSD(y)). In this example, alpha = 0.81262488.\n. di (0.6*2.8912586)/( 1-0.6 + 0.6*2.8912586) The lasso2 lambda = 2N*lambda(lglmnet) * (alpha(lglmnet) + (1-alpha(lglmnet))/SD(y)). In this example, lambda = 101.89203.\n. di 2*69*( 0.6 + (1-0.6)/2.8912586) lasso2 using the glmnet and then replicated using the lasso2/StataCorp parameterization:\n. lasso2 price1000 mpg-foreign, alpha(0.6) lambda(1) lglmnet . lasso2 price1000 mpg-foreign, alpha(.81262488) lambda(101.89203) "},{"id":35,"href":"/docs/ddml/installation/","title":"Installation","section":"DDML","content":" Installation # You can get the lastest versions from github:\nnet install ddml, from(https://raw.githubusercontent.com/aahrens1/ddml/master) Please check for updates on a regular basis.\nOffline installation # If you want to use ddml in an offline environment, we recommend to download the packages from the Github repositories. The links to repositories are above; click the green button \u0026ldquo;Code\u0026rdquo; and \u0026ldquo;Download ZIP\u0026rdquo;. Then run net install as above but from() should refer to the downloaded and unzipped repository folder.\n"},{"id":36,"href":"/docs/pdslasso/installation/","title":"Installation","section":"PDSLASSO","content":" SSC version # You can get pdslasso from SSC:\nssc install pdslasso Note that pdslasso requires lassopack to be installed.\nAdd replace to overwrite existing versions of the packages.\nGithub installation # Please note that we update the SSC versions less frequently. You can get the lastest versions from github:\nnet install pdslasso, /// from(\u0026quot;https://raw.githubusercontent.com/statalasso/pdslasso/master/\u0026quot;) Please check for updates on a regular basis.\nInstalling old versions: # We keep old versions of lassopack and pdslasso on github to facilitate reproducibility. For example, to install version 1.3 of lassopack, simply use\nnet install lassopack, /// from(\u0026quot;https://raw.githubusercontent.com/statalasso/pdslasso/master/pdslasso_v13\u0026quot;) Check out our github repository here to see which old versions are available.\nOffline installation # If you want to use pdslasso in an offline environment, we recommend to download the packages from the Github repositories. The links to repositories are above; click the green button \u0026ldquo;Code\u0026rdquo; and \u0026ldquo;Download ZIP\u0026rdquo;. Then run net install as above but from() should refer to the downloaded and unzipped repository folder.\nVerify installation # To check that the packages were installed correctly, type e.g.\nwhichpkg pdslasso which requires the user-written package whichpkg.\nWe recommend to add this to your log files to facilitate reproducibility.\n"},{"id":37,"href":"/docs/pdslasso/pdslasso_cite/","title":"Citation","section":"PDSLASSO","content":" Citation # pdslasso and ivlasso are not official Stata commands. They are free contributions to the research community, like a paper.\nPlease cite it as such:\nAhrens, A., Hansen, C.B., Schaffer, M.E. 2018. pdslasso and ivlasso: Programs for post-selection and post-regularization OLS or IV estimation and inference. http://ideas.repec.org/c/boc/bocode/s458459.html\nBibtex file\n"},{"id":38,"href":"/docs/lassopack/installation/","title":"Installation","section":"LASSOPACK","content":" SSC version # You can get lassopack from SSC:\nssc install lassopack Add replace to overwrite existing versions of the packages.\nGithub installation # Please note that we update the SSC versions less frequently. You can get the lastest versions from github:\nnet install pdslasso, /// from(\u0026quot;https://raw.githubusercontent.com/statalasso/pdslasso/master/\u0026quot;) Please check for updates on a regular basis.\nInstalling old versions: # We keep old versions of lassopack on github to facilitate reproducibility. For example, to install version 1.2 of lassopack, simply use\nnet install lassopack, /// from(\u0026quot;https://raw.githubusercontent.com/statalasso/lassopack/master/lassopack_v12\u0026quot;) Check out our github repository here to see which old versions are available.\nOffline installation # If you want to use lassopack in an offline environment, we recommend to download the packages from the Github repositories. The links to repositories are above; click the green button \u0026ldquo;Code\u0026rdquo; and \u0026ldquo;Download ZIP\u0026rdquo;. Then run net install as above but from() should refer to the downloaded and unzipped repository folder.\nVerify installation # To check that the packages were installed correctly, type e.g.\nwhichpkg lassopack which requires the user-written package whichpkg.\nWe recommend to add this to your log files to facilitate reproducibility.\n"},{"id":39,"href":"/docs/pystacked/installation/","title":"Installation","section":"PYSTACKED","content":" Installation # You can get the lastest versions from Github:\nnet install pystacked, /// from(https://raw.githubusercontent.com/aahrens1/pystacked/main) replace Please check for updates on a regular basis.\npystacked requires at least Stata 16 (or higher), a Python installation and scikit-learn (0.24 or higher). Python and scikit-learn are available for free. You can also install from SSC, but note that we update the SSC version less regularly:\nssc install pystacked Install old versions # To install an old version:\nnet install pystacked, from(https://raw.githubusercontent.com/aahrens1/pystacked/vXXX) replace where XXX is replaced with one of the archived branches.\nSetting up Python # If you haven\u0026rsquo;t set up Python for Stata, type help python and check this Stata blog post for how to set up Python for Stata on your system.\nIn short, you can either install Python manually (e.g. from www.python.org/) or use a distribution such as Anaconda. Anaconda is in most cases the easier method.\nAfter you have installed Python, you might also need to tell Stata where your Python installation is located. You can do this using python set exec. Note that you will usually have more than one Python installation on your system, since Python is shipped with all common operating systems (yet, usually an old version). python search will show all Python installations Stata can find.\nUpdating scitkit-learn # pystacked requires scikit-learn 0.24 or higher. You can check your scikit-learn version using:\n. python: import sklearn . python: sklearn.__version__ If you use Anaconda, you can use the Anaconda Navigator (or the conda command line tool) to update packages. Otherwise you can use pip (see here).\nFor example, if your Python installation is located in /usr/local/bin/python3.9, you could update scikit-learn by typing\n/usr/local/bin/python3.9 -m pip install -U scikit-learn into the terminal, or directly in Stata (restart required):\n. shell /usr/local/bin/python3.9 -m pip install -U scikit-learn Offline installation # If you want to use pystacked in an offline environment, we recommend to download the packages from the Github repository. Click the green button \u0026ldquo;Code\u0026rdquo; and \u0026ldquo;Download ZIP\u0026rdquo;. Then run net install as above but from() should refer to the downloaded and unzipped repository folder.\nVerify installation # To check that the packages were installed correctly, type e.g.\nwhichpkg pystacked which requires the user-written package whichpkg.\nWe recommend to add this to your log files to facilitate reproducibility.\n"},{"id":40,"href":"/docs/lassopack/lassopack_cite/","title":"Citation","section":"LASSOPACK","content":" Citation # lassopack is not an official Stata package. It is a free contribution to the research community, like a paper.\nPlease cite it as such:\nAhrens, A., Hansen, C.B., Schaffer, M.E. 2018. LASSOPACK: Stata module for lasso, square-root lasso, elastic net, ridge, adaptive lasso estimation and cross-validation. http://ideas.repec.org/c/boc/bocode/s458458.html\nBibtex file\nAhrens A, Hansen CB, Schaffer ME (2020). lassopack: Model selection and prediction with regularized regression in Stata. The Stata Journal. 20(1):176-235. doi:10.1177/1536867X20909697\nBibtex file\n"},{"id":41,"href":"/docs/pystacked/citation/","title":"Citation","section":"PYSTACKED","content":" Citation # pystacked is not an official Stata command. It\u0026rsquo;s a free contribution to the research community, like a paper. Please cite it as such.\nTo cite our working paper:\n@misc{https://doi.org/10.48550/arxiv.2208.10896, doi = {10.48550/ARXIV.2208.10896}, url = {https://arxiv.org/abs/2208.10896}, author = {Ahrens, Achim and Hansen, Christian B. and Schaffer, Mark E.}, keywords = {Econometrics (econ.EM), Machine Learning (stat.ML), FOS: Economics and business, FOS: Economics and business, FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {pystacked: Stacking generalization and machine learning in Stata}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution Share Alike 4.0 International} } Also cite scikit-learn as explained here:\n@article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} } "},{"id":42,"href":"/docs/lassopack/lassologit/lassologit_demo/","title":"Example using Spam data","section":"Lassologit","content":" Logistic Lasso: Spam data # For demonstration we consider the Spambase Data Set from the Machine Learning Repository. The data set includes 4,601 observations and 57 variables. The aim is to predict if an email is spam (i.e., unsolicited commercial e-mail) or not. Each observation corresponds to one email.\nPredictors v1-v48 percentage of words in the e-mail that match a specific word, i.e. 100 * (number of times the word appears in the e-mail) divided by total number of words in e-mail. To see which word each predictor corresponds to, see link below. v49-v54 percentage of characters in the e-mail that match a specific character, i.e. 100 * (number of times the character appears in the e-mail) divided by total number of characters in e-mail. To see which character each predictor corresponds to, see link below. v55 average length of uninterrupted sequences of capital letters v56 length of longest uninterrupted sequence of capital letters v57 total number of capital letters in the e-mail Outcome v58 denotes whether the e-mail was considered spam (1) or not (0). . insheet using https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data, clear comma Introduction to lassologit # The basic syntax for lassologit is to specify the dependent variable followed by a list of predictors:\n. lassologit v58 v1-v57 The output of lassologit shows the penalty levels (lambda), the number of predictors included (s), the \\(\\ell_1\\) norm, one information criterion ( \\(EBIC\\) by default), McFadden\u0026rsquo;s Pseudo- \\(R^2\\) and, in the last column, which predictors are included/removed from the model.\nBy default, one line per knot is shown. Knots are points at which predictors enter or leave the model.\nTo obtain the logistic lasso estimate for a user-specified scalar lambda or a list of lambdas, the lambda(numlist) option can be used. Note that output and the objects stored in e() depend on whether lambda is only one value or a list of more than one value.\nInformation criteria # To estimate the model selected by one of the information criteria, use the lic() option:\n. lassologit v58 v1-v57 . lassologit, lic(ebic) . lassologit, lic(aicc) In the above example, we use the replay syntax that works similar to a post-estimation command. lassologit reports the logistic lasso estimates and the post-logit estimates (from applying logit estimation to the model selected by the logitistic lasso) for the value of lambda selected by the specified information criterion.\nNB: lic() does not change the estimation results in memory. The advantage is that this way lic() can be used multiple times to compare results without that we need to re-estimate the model.\nTo store the model selected by one of the information criteria, add postresults:\n. lassologit, lic(ebic) postresults Cross-validation with cvlassologit # cvlassologit implements \\(K\\) -fold cross-validation where the data is by default randomly partitioned.\nHere, we use \\(K=3\\) and seed(123) to set the seed for reproducibility. (Be patient, this takes a minute.)\n. cvlassologit v58 v1-v57, nfolds(3) seed(123) The output shows the prediction performance measured by deviance for each \\(\\lambda\\) value. To estimate the model selected by cross-validation we can specify lopt or lse using the replay syntax.\n. cvlassologit, lopt . cvlassologit, lse Rigorous penalization with rlassologit # Lastly, we consider the logistic lasso with rigorous penalization:\n. rlassologit v58 v1-v57 rlassologit displays the logistic lasso solution and the post-logit solution.\nThe rigorous lambda is returned in e(lambda) and, in this example, is equal to 79.207801.\n. di e(lambda) We get the same result when specifying the rigorous lambda manually using the lambda() option of lassologit:\n. lassologit v58 v1-v57, lambda(79.207801) Prediction # After selecting a model, we can use predict to obtain predicted probabilities or linear predictions.\nFirst, we select a model using lic() in combination with postresults as above:\n. lassologit v58 v1-v57 . lassologit, lic(ebic) postresults Then, we use predict:\n. predict double phat, pr . predict double xbhat, xb pr saves the predicted probability of success and xb saves the linear predicted values.\nNote that the use of postresults is required. Without postresults the results of the estimation with the selected penalty level are not stored.\nThe approach for cvlassologit is very similar:\n. cvlassologit v58 v1-v57 . cvlassologit, lopt postresults . predict double phat, pr In the case of rlassologit, we don\u0026rsquo;t need to select a specific penalty level and we also don\u0026rsquo;t need to specify postresults.\n. rlassologit v58 v1-v57 . predict double phat, pr Assessing prediction accuracy with holdout() # We can leave one partition of the data out of the estimation sample and check the accuracy of prediction using the holdout(varname) option.\nWe first define a binary holdout variable:\n. gen myholdout = (_n\u0026gt;4500) There are 4,601 observations in the sample, and we exclude observations 4,501 to 4,601 from the estimation. These observations are used to assess classification accuracy. The holdout variable should be set to 1 for all observations that we want to use for assessing classification accuracy.\n. lassologit v58 v1-v57, holdout(myholdout) . mat list e(loss) . rlassologit v58 v1-v57, holdout(myholdout) . mat list e(loss) The loss measure is returned in e(loss). As with cross-validation, deviance is used by default. lossmeasure(class) will return the average number of miss-classifications.\nPlotting with lassologit # lassologit supports plotting of the coefficient path over $$\\lambda$$. Here, we create the plot using the replay syntax, but the same can be achieved in one line:\n. lassologit v58 v1-v57 . lassologit, plotpath(lambda) plotvar(v1-v5) plotlabel plotopt(legend(off)) In the above example, we use the following settings: plotpath(lambda) plots estimates against lambda. plotvar(v1-v5) restricts the set of variables plotted to v1-v5 (to avoid that the graph is too cluttered). plotlabel puts variable labels next to the lines. plotopt(legend(off)) turns the legend off.\nPlotting with cvlassologit # The plotcv option creates a graph of the estimates loss a function of lambda:\n. cvlassologit v58 v1-v57, nfolds(3) seed(123) . cvlassologit v58 v1-v57, plotcv The vertical solid red line indicates the value of lambda that minimizes the loss function. The dashed red line corresponds to the largest lambda for which MSPE is within one standard error of the minimum loss.\nMore # More information can be found in the help file:\nhelp lassologit "}]