crossvalidate.sthlp

{smcl}
{* *! version 0.0.3 01mar2024}{...}
{vieweralsosee "[R] predict" "mansection R predict"}{...}
{vieweralsosee "[R] estat classification" "mansection R estat_classification"}{...}
{vieweralsosee "[P] creturn" "mansection P creturn"}{...}
{vieweralsosee "" "--"}{...}
{viewerjumpto "Overview" "crossvalidate##overview"}{...}
{viewerjumpto "Commands" "crossvalidate##cmds"}{...}
{viewerjumpto "Additional Information" "crossvalidate##additional"}{...}
{viewerjumpto "Contact" "crossvalidate##contact"}{...}
{title:Cross-Validation in Stata}

{marker overview}{title:Overview}

{pstd}
The crossvalidate package includes several commands and a Mata library that 
provide a range of possible cross-validation techniques that can be used with 
any {help program:eclass} Stata estimation command.  For the majority of users, 
the prefix commands (see {help xv} or {help xvloo}) should handle any of your 
needs.  On what we believe will be uncommon or rare occassions, a user made need 
a bit more control over the process.  In those cases, the lower level commands 
provide a way for users to avoid programming the entire cross-validation process 
while retaining the benefits that these commands provide.

{pstd}
{bf:IMPORTANT!!!} If you intend to only use the lower-level commands, you will 
need to call {help libxv} first.  This compiles the Mata source code into libxv 
on your machine.  If you are using either of the prefix commands {help xv} or 
{help xvloo}, they will handle this step for you.  However, if you intend to use 
metric functions that you have defined prior to {help libxv} compiling the mata 
library, you should call {help libxv}, then define your function, and then call 
{help xv} or {help xvloo}.  Prior to compiling {help libxv}, the contents of 
Mata are cleared to ensure that {help libxv} only contains the functions that 
should be included in the library.

{pstd}
This help file provides an overview of the commands included in the crossvalidate 
package.  We leave detailed information to the documentation for each of the 
individual commands.

{marker cmds}{title:Commands}

{synoptset 15 tabbed}{...}
{synoptline}
{synopthdr:Command Name}
{synoptline}
{syntab:Prefix Commands}
{synopt :{opt {help xv}}}Cross-Validation{p_end}
{synopt :{opt {help xvloo}}}Leave-One-Out Cross-Validation{p_end}
{syntab:Lower Level Commands}
{synopt :{opt {help splitit}}}Splits the dataset into train/test or train/validation/test splits{p_end}
{synopt :{opt {help fitit}}}Calls the estimation command on the appropriate split{p_end}
{synopt :{opt {help predictit}}}Predicts the outcome on the appropriate split{p_end}
{synopt :{opt {help validateit}}}Computes {p_end}
{syntab:Utility Commands}
{synopt :{opt {help classify}}}Used to manage {p_end}
{synopt :{opt {help cmdmod}}}Used for metaprogramming tasks in commands above{p_end}
{synopt :{opt {help state}}}Retrieves current settings and binds to the dataset{p_end}
{synoptline}

{dlgtab:Prefix Commands}

{phang}
{help xv} is a prefix command that should address the majority of use cases for 
cross-validation.  Use the prefix and provide the required arguments, then write 
the estimation command you would use to fit your model under normal circumstances.  
The command will handle spliting the data, fitting the model to the appropriate 
subsets of data, generating the predicted values, and computing the quantities 
of interest that describe the quality of the results.  You can create simple 
train/test and train/validation/test splits with or without K-Folds, using 
simple random sampling or clustered sampling (including sampling of panel units).

{phang}
{help xvloo} is also a prefix command but is used to perform leave-one-out (LOO) 
cross-validation.  LOO can be though of as a special case of K-Fold 
cross-validation where K is equal to the number of observations, or clusters, in 
the training set; another way to think of this is using a jackknife for 
cross-validation.  Therefore, we strongly recommend only using this command when 
working with smaller sample sizes.  Additionally, if the number of observations 
in your dataset plus the number of variables in the dataset plus 2 is greater 
than the number of variables your version of Stata can support you will not be 
able to use this prefix.

{dlgtab:Lower Level Commands}

{phang}
{help splitit} is a command called by the prefix commands to create the splits 
in the data in memory.  As mentioned above, you can create train/test and 
train/validation/test splits with or without K-Folds, using simple random 
sampling or clustered sampling (which includes sampling panel units).  This 
command generates a new variable to identify the splits in the dataset which is 
required to be passed to the subsequent commands below.

{phang}
{help fitit} is a command called by the prefix commands to update and execute 
the user supplied estimation command.  The "update" made by this command is the 
insertion, or modification, of an if expression that is used to ensure that the 
estimation command you passed (either as an argument to this command or via the 
prefix) is executed for the subset of data you intended.  When used with K-Fold 
cross-validation this command will also fit the model to the entire training set 
in addition to each of the K-Folds, unless you tell it otherwise.

{phang}
{help predictit} is a command called by the prefix commands to manage and 
generate the predicted values based on the previously fitted model.  In the case 
of K-Fold cross-validation, it ensures all the predicted values are stored in a 
single variable with appropriate storage type (double precision for continuous 
outcomes and byte for categorical outcomes).  Like {help fitit}, this command 
will also generate predictions based on the model fitted to the entire training 
set when using K-Fold cross-validation unless you tell it otherwise.

{phang}
{help validateit} is the last command called by the prefix commands and is used 
to compute the validation/test metric of your choosing.  We've included a 
selection of metrics in the Mata library distributed with this package and they 
are listed in the help file for {help validateit}.  Additionally, if there is a 
validation metric that we have not implemented you may be able to use it by 
defining a Mata function that follows our function signature requirements and 
passing the name of that function to the appropriate option.

{dlgtab:Utility Commands}

{phang}
{help classify} is a utility called by the {help predictit} command when fitting 
classification models.  This utility ensures that class identifiers are returned 
as the predicted values for binomial, multinomial, and ordinal outcomes.

{phang}
{help cmdmod} is a utility called by {help fitit} and possible {help predictit} 
to create the updated estimation command string and if expression for prediction. 

{phang}
{help state} is a utility called by the {help xv} and {help xvloo} commands as 
an option to bind information about the current state of the computer and 
pseudo-random number generator if requested.


{marker additional}{...}
{title:Additional Information}
{p 4 4 8}If you have questions, comments, or find bugs, please submit an issue in the {browse "https://github.com/wbuchanan/crossvalidate":crossvalidate GitHub repository}.{p_end}


{marker contact}{...}
{title:Contact}
{p 4 4 8}William R. Buchanan, Ph.D.{p_end}
{p 4 4 8}Sr. Research Scientist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}wbuchanan at sagcorp [dot] com{p_end}

{p 4 4 8}Steven D. Brownell, Ph.D.{p_end}
{p 4 4 8}Economist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}sbrownell at sagcorp [dot] com{p_end}