-
Notifications
You must be signed in to change notification settings - Fork 1
/
splitit.sthlp
145 lines (120 loc) · 6.03 KB
/
splitit.sthlp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
{smcl}
{* *! version 0.0.7 06mar2024}{...}
{viewerjumpto "Syntax" "splitit##syntax"}{...}
{viewerjumpto "Description" "splitit##description"}{...}
{viewerjumpto "Options" "splitit##options"}{...}
{viewerjumpto "Examples" "splitit##examples"}{...}
{viewerjumpto "Returned Values" "splitit##retvals"}{...}
{viewerjumpto "Additional Information" "splitit##additional"}{...}
{viewerjumpto "Contact" "splitit##contact"}{...}
{title:Dataset Splitting and Folding for Cross-Validation in Stata}
{marker syntax}{...}
{title:Syntax}
{p 8 32 2}
{cmd:splitit} # [#] {ifin} [{cmd:,} {cmdab:u:id(}{it:varlist}{cmd:)}
{cmdab:tp:oint(}{it:string asis}{cmd:)} {cmdab:k:fold(}{it:integer}{cmd:)}
{cmdab:spl:it(}{it:string asis}{cmd:)} {cmd:loo}]{p_end}
{synoptset 15 tabbed}{...}
{synoptline}
{synopthdr}
{synoptline}
{syntab:Arguments}
{synopt :{opt #}}The proportion of the data set to allocate to the training set.{p_end}
{synopt :{it:{opt [#]}}}The proportion of the data set to allocate to the validation set.{p_end}
{syntab:Options}
{synopt :{opt u:id}}a variable list for clustered sampling/splitting{p_end}
{synopt :{opt tp:oint}}a numeric, td(), tc(), or tC() value{p_end}
{synopt :{opt k:fold}}the number of K-Folds to create in the training set; default is {cmd:kfold(1)}{p_end}
{synopt :{opt spl:it}}a new variable name; default is {cmd:split(_xvsplit)}{p_end}
{synopt :{opt loo}}is used only for Leave-One-Out cross-validation{p_end}
{synoptline}
{marker description}{...}
{title:Description}
INCLUDE help xvphase-split
{marker options}{...}
{title:Options}
{phang}
{opt u:id} accepts a variable list for clustered sampling/splitting. When an
argument is passed to this parameter entire clusters will be split into the
respective training and validation and/or training sets. When this option is
used with {opt tp:oint} for {help xtset} data, the panel variable must be nested
within the clusters defined by {opt u:id}.
{phang}
{hi:IMPORTANT!!!} the order of the {help varlist} passed to {opt u:id} is
assumed to follow the hierarchy of the nesting in the data. Ensure that the
{help varlist} passed to this option follows the same convention as used with
commands like {help mixed}.
{phang}
{opt tpoint} a time point delimiting the training split from it's corresponding
forecastting split. This can also be accomplished by passing the appropriate if
expression in your estimation command. Use of this option will result in an
additional variable with the suffix {it:xv4} being created to identify the
forecasting set associated with each split/K-Fold. This is to ensure that
the forecasting period data will not affect the model training.
{phang}
{opt k:fold} is used to specify the number of K-Folds to create in the training
set.
{phang}
{opt spl:it} is used to specify the name of a new variable that will store the
identifiers for the splits in the data. If no argument is passed to this option
{cmd splitit} will store the result in a variable named {it:_xvsplit}.
{phang}
{opt loo} is an option used to alter the underlying logic used to compute the
validation metrics/monitors. Since Leave-One-Out cross-validation is a special
case of K-Fold CV the model fitting and predictions are generated in a manner
consistent with all other K-Fold cases. However, with only a single unit in
each K-Fold in the LOO case computing validation metrics requires different
treatment. In the LOO case the validation metric/monitors are computed using
the predicted and observed values for all of the training set units in aggregate.
{marker examples}{...}
{title:Examples}
{p 4 4 2}Load example dataset{p_end}
{p 8 4 2}{stata sysuse auto, clear}{p_end}
{p 4 4 2}Simple Random Sampling{p_end}
{p 6 4 2}Test/Train Split{p_end}
{p 8 4 2}{stata splitit .8, ret(splitvar)}{p_end}
{p 6 4 2}Train/Validation/Test Split{p_end}
{p 8 4 2}{stata splitit .6 .2, ret(splitvar)}{p_end}
{p 4 4 2}K-Fold Simple Random Sampling{p_end}
{p 6 4 2}Test/Train Split{p_end}
{p 8 4 2}{stata splitit .8, ret(splitvar) k(5)}{p_end}
{p 6 4 2}Train/Validation/Test Split{p_end}
{p 8 4 2}{stata splitit .6 .2, ret(splitvar) k(5)}{p_end}
{p 4 4 2}Clustered Random Sampling{p_end}
{p 6 4 2}Test/Train Split{p_end}
{p 8 4 2}{stata splitit .8, ret(splitvar) uid(foreign)}{p_end}
{p 6 4 2}Train/Validation/Test Split{p_end}
{p 8 4 2}{stata splitit .6 .2, ret(splitvar) uid(foreign)}{p_end}
{p 4 4 2}K-Fold Clustered Random Sampling{p_end}
{p 6 4 2}Test/Train Split{p_end}
{p 8 4 2}{stata splitit .8, ret(splitvar) k(5) uid(foreign)}{p_end}
{p 6 4 2}Train/Validation/Test Split{p_end}
{p 8 4 2}{stata splitit .6 .2, ret(splitvar) k(5) uid(foreign)}{p_end}
{marker retvals}{...}
{title:Returned Values}
{p 4 4 8}The following lists the names of the r-macros and their contents.{p_end}
{synoptset 25 tabbed}{...}
{synoptline}
{synopthdr}
{synoptline}
{synopt :{cmd:r(stype)}}the split method{p_end}
{synopt :{cmd:r(flavor)}}the sampling method{p_end}
{synopt :{cmd:r(splitter)}}the variable containing the sample split identifiers{p_end}
{synopt :{cmd:r(forecastset)}}the variable containing the sample split identifiers for the forecast sample{p_end}
{synopt :{cmd:r(training)}}the value(s) of the splitter variable that identify the training set(s){p_end}
{synopt :{cmd:r(validation)}}the value of the splitter variable that identifies the validation set{p_end}
{synopt :{cmd:r(testing)}}the value of the splitter variable that identifies the test set{p_end}
{synoptline}
{marker additional}{...}
{title:Additional Information}
{p 4 4 8}If you have questions, comments, or find bugs, please submit an issue in the {browse "https://github.com/wbuchanan/crossvalidate":crossvalidate GitHub repository}.{p_end}
{marker contact}{...}
{title:Contact}
{p 4 4 8}William R. Buchanan, Ph.D.{p_end}
{p 4 4 8}Sr. Research Scientist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}wbuchanan at sagcorp [dot] com{p_end}
{p 4 4 8}Steven D. Brownell, Ph.D.{p_end}
{p 4 4 8}Economist, SAG Corporation{p_end}
{p 4 4 8}{browse "https://www.sagcorp.com":SAG Corporation}{p_end}
{p 4 4 8}sbrownell at sagcorp [dot] com{p_end}