DESCRIPTION, Readme, and vignettes updated

SGDinference-Lab · Nov 6, 2023 · 1a8aa7f · 1a8aa7f
1 parent 8fe1c41
commit 1a8aa7f
Show file tree

Hide file tree

Showing 16 changed files with 314 additions and 33 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,17 +1,20 @@
 Package: SGDinference
 Type: Package
 Title: Inference with Stochastic (sub-)Gradient Descent
-Version: 0.0.1
+Version: 0.1.0
 Authors@R: c(
     person("Sokbae", "Lee", email = "sl3841@columbia.edu", role = "aut"),
     person("Yuan", "Liao", email = "yuan.liao@rutgers.edu", role = "aut"),
     person("Myung Hwan", "Seo", email = "myunghseo@snu.ac.kr", role = "aut"),
     person("Youngki", "Shin", email = "shiny11@mcmaster.ca", role = c("aut", "cre")))
 Description: The package provides estimation and inference methods for large-scale mean and quantile regression models via stochastic (sub-)gradient descent (S-subGD) algorithms. 
-  The inference procedure handles cross-sectional data sequentially: 
-  (i) updating the parameter estimate with each incoming "new observation", 
-  (ii) aggregating it as a Polyak-Ruppert average, and 
-  (iii) computing an asympotically pivotal statistic for inference through random scaling.
+    The inference procedure handles cross-sectional data sequentially: 
+    (i) updating the parameter estimate with each incoming "new observation", 
+    (ii) aggregating it as a Polyak-Ruppert average, and 
+    (iii) computing an asymptotically pivotal statistic for inference through random scaling. 
+    The methodology used in the SGDinference package is described in detail in the following papers: 
+    (i) Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 7, pp. 7381-7389). <https://doi.org/10.1609/aaai.v36i7.20701>. 
+    (ii) Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.
 License: GPL-3
 Imports: 
     stats,

diff --git a/R/Census2000.R b/R/Census2000.R
@@ -2,7 +2,7 @@
 #'
 #' The Census2000 dataset 
 #' from Acemoglu and Autor (2011) consists of observations on 26,120 nonwhite, female workers. 
-#' This small dataset is contructed from "microwage2000_ext.dta" at 
+#' This small dataset is constructed from "microwage2000_ext.dta" at 
 #' \url{https://economics.mit.edu/people/faculty/david-h-autor/data-archive}.
 #' Specifically, observations are dropped if hourly wages are missing or 
 #' years of education are smaller than 6.    

diff --git a/R/SGDinference.R b/R/SGDinference.R
@@ -1,9 +1,10 @@
 #' SGDinference
 #' 
-#' The SGDinference package provides estimation and inference methods for large-scale mean and quantile regression models via stochastic (sub-)gradient descent (S-subGD) algorithms. The inference procedure handles cross-sectional data sequentially: 
+#' The SGDinference package provides estimation and inference methods for large-scale mean and quantile regression models via stochastic (sub-)gradient descent (S-subGD) algorithms. 
+#' The inference procedure handles cross-sectional data sequentially: 
 #' (i) updating the parameter estimate with each incoming "new observation", 
 #' (ii) aggregating it as a Polyak-Ruppert average, and 
-#' (iii) computing an asympotically pivotal statistic for inference through random scaling.
+#' (iii) computing an asymptotically pivotal statistic for inference through random scaling.
 #' 
 #' @docType package
 #' @author Sokbae Lee, Yuan Liao, Myung Hwan Seo, Youngki Shin

diff --git a/R/sgd_qr.R b/R/sgd_qr.R
@@ -4,7 +4,7 @@
 #'
 #' @param formula formula. The response is on the left of a ~ operator. The terms are on the right of a ~ operator, separated by a + operator.
 #' @param data an optional data frame containing variables in the model. 
-#' @param gamma_0 numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is NULL and it is determined by the adaptive method in Chet et al. (2023).
+#' @param gamma_0 numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is NULL and it is determined by the adaptive method in Lee et al. (2023).
 #' @param alpha numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is 0.501.
 #' @param burn numeric. A tuning parameter for "burn-in" observations. 
 #'    We burn-in up to (burn-1) observations and use observations from (burn) for estimation. Default is 1, i.e. no burn-in. 

diff --git a/R/sgdi_qr.R b/R/sgdi_qr.R
@@ -4,7 +4,7 @@
 #'
 #' @param formula formula. The response is on the left of a ~ operator. The terms are on the right of a ~ operator, separated by a + operator.
 #' @param data an optional data frame containing variables in the model. 
-#' @param gamma_0 numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is NULL and it is determined by the adaptive method in Chet et al. (2023).
+#' @param gamma_0 numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is NULL and it is determined by the adaptive method in Lee et al. (2023).
 #' @param alpha numeric. A tuning parameter for the learning rate (gamma_0 x t ^ alpha). Default is 0.501.
 #' @param burn numeric. A tuning parameter for "burn-in" observations. 
 #'    We burn-in up to (burn-1) observations and use observations from (burn) for estimation. Default is 1, i.e. no burn-in. 

diff --git a/README.Rmd b/README.Rmd
@@ -24,7 +24,7 @@ __SGDinference__ is an R package that provides estimation and inference methods
 
   (i) updating the parameter estimate with each incoming "new observation", 
   (ii) aggregating it as a Polyak-Ruppert average, and 
-  (iii) computing an asympotically pivotal statistic for inference through random scaling.
+  (iii) computing an asymptotically pivotal statistic for inference through random scaling.
 
 The methodology used in the SGDinference package is described in detail in the following papers:
 
@@ -41,3 +41,103 @@ You can install the development version from  [GitHub](https://github.com/) with
 # install.packages("devtools") # if you have not installed "devtools" package
 devtools::install_github("SGDinference-Lab/SGDinference")
 ```
+
+We begin by calling the SGDinference package.
+
+```{r setup}
+library(SGDinference)
+```
+
+
+## Case Study: Estimating the Mincer Equation
+
+To illustrate the usefulness of the package, we use a small dataset included in the package.
+Specifically, the _Census2000_ dataset from Acemoglu and Autor (2011) consists of observations on 26,120 nonwhite, female workers. This small dataset is constructed from "microwage2000_ext.dta" at 
+<https://economics.mit.edu/people/faculty/david-h-autor/data-archive>.
+Observations are dropped if hourly wages are missing or years of education are smaller than 6.
+Then, a 5 percent random sample is drawn to make the dataset small.
+The following three variables are included:
+
+- ln_hrwage: log hourly wages
+- edyrs: years of education
+- exp: years of potential experience
+
+We now define the variables.
+
+```{r}
+    y = Census2000$ln_hrwage 
+  edu = Census2000$edyrs
+  exp = Census2000$exp
+ exp2 = exp^2/100
+```
+
+As a benchmark, we first estimate the Mincer equation and report the point estimates and their 95% heteroskedasticity-robust confidence intervals.
+
+```{r}
+mincer = lm(y ~ edu + exp + exp2)
+inference = lmtest::coefci(mincer, df = Inf,
+                             vcov = sandwich::vcovHC)
+results = cbind(mincer$coefficients,inference)
+colnames(results)[1] = "estimate"
+print(results)
+```
+
+
+### Estimating the Mean Regression Model Using SGD
+
+We now estimate the same model using SGD. 
+
+```{r}
+ mincer_sgd = sgdi_lm(y ~ edu + exp + exp2)
+ print(mincer_sgd)
+```
+It can be seen that the estimation results are similar between two methods. 
+There is a different command that only computes the estimates but not confidence intervals.
+
+```{r}
+ mincer_sgd = sgd_lm(y ~ edu + exp + exp2)
+ print(mincer_sgd)
+```
+
+We compare the execution times between two versions and find that there is not much difference in this simple example. By construction, it takes more time to conduct inference via `sgdi_lm`.
+
+```{r}
+library(microbenchmark)
+res <- microbenchmark(sgd_lm(y ~ edu + exp + exp2),
+                      sgdi_lm(y ~ edu + exp + exp2),
+                      times=100L)
+print(res)
+```
+To plot the SGD path, we first construct a SGD path for the return to education coefficients.
+```{r}
+mincer_sgd_path = sgdi_lm(y ~ edu + exp + exp2, path = TRUE, path_index = 2)
+```
+
+Then, we can plot the SGD path.
+
+```{r}
+plot(mincer_sgd_path$path_coefficients, ylab="Return to Education", xlab="Steps")
+```
+
+To observe the initial paths, we now truncate the paths up to 2,000.
+
+```{r}
+plot(mincer_sgd_path$path_coefficients[1:2000], ylab="Return to Education", xlab="Steps")
+print(c("2000th step", mincer_sgd_path$path_coefficients[2000]))
+print(c("Final Estimate", mincer_sgd_path$coefficients[2]))
+```
+
+It can be seen that the SGD path almost converged only after the 2,000 steps, less than 10% of the sample size. 
+
+## What else the package can do
+
+See the vignette for the quantile regression example. 
+
+# References
+
+Acemoglu, D. and Autor, D., 2011. Skills, tasks and technologies: Implications for employment and earnings. In _Handbook of labor economics_ (Vol. 4, pp. 1043-1171). Elsevier.
+
+Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In _Proceedings of the AAAI Conference on Artificial Intelligence_ (Vol. 36, No. 7, pp. 7381-7389).
+<https://doi.org/10.1609/aaai.v36i7.20701>.
+
+Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. 	arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ procedure handles cross-sectional data sequentially:
 1)  updating the parameter estimate with each incoming “new
     observation”,
 2)  aggregating it as a Polyak-Ruppert average, and
-3)  computing an asympotically pivotal statistic for inference through
+3)  computing an asymptotically pivotal statistic for inference through
     random scaling.
 
 The methodology used in the SGDinference package is described in detail
@@ -43,3 +43,164 @@ You can install the development version from
 # install.packages("devtools") # if you have not installed "devtools" package
 devtools::install_github("SGDinference-Lab/SGDinference")
 ```
+
+We begin by calling the SGDinference package.
+
+``` r
+library(SGDinference)
+```
+
+## Case Study: Estimating the Mincer Equation
+
+To illustrate the usefulness of the package, we use a small dataset
+included in the package. Specifically, the *Census2000* dataset from
+Acemoglu and Autor (2011) consists of observations on 26,120 nonwhite,
+female workers. This small dataset is constructed from
+“microwage2000_ext.dta” at
+<https://economics.mit.edu/people/faculty/david-h-autor/data-archive>.
+Observations are dropped if hourly wages are missing or years of
+education are smaller than 6. Then, a 5 percent random sample is drawn
+to make the dataset small. The following three variables are included:
+
+- ln_hrwage: log hourly wages
+- edyrs: years of education
+- exp: years of potential experience
+
+We now define the variables.
+
+``` r
+    y = Census2000$ln_hrwage 
+  edu = Census2000$edyrs
+  exp = Census2000$exp
+ exp2 = exp^2/100
+```
+
+As a benchmark, we first estimate the Mincer equation and report the
+point estimates and their 95% heteroskedasticity-robust confidence
+intervals.
+
+``` r
+mincer = lm(y ~ edu + exp + exp2)
+inference = lmtest::coefci(mincer, df = Inf,
+                             vcov = sandwich::vcovHC)
+results = cbind(mincer$coefficients,inference)
+colnames(results)[1] = "estimate"
+print(results)
+#>                estimate       2.5 %      97.5 %
+#> (Intercept)  0.58114741  0.52705757  0.63523726
+#> edu          0.12710477  0.12329983  0.13090971
+#> exp          0.03108721  0.02877637  0.03339806
+#> exp2        -0.04498841 -0.05070846 -0.03926835
+```
+
+### Estimating the Mean Regression Model Using SGD
+
+We now estimate the same model using SGD.
+
+``` r
+ mincer_sgd = sgdi_lm(y ~ edu + exp + exp2)
+ print(mincer_sgd)
+#> Call: 
+#> sgdi_lm(formula = y ~ edu + exp + exp2)
+#> 
+#> Coefficients: 
+#>             Coefficient    CI.Lower    CI.Upper
+#> (Intercept)  0.58692678  0.51821551  0.65563805
+#> edu          0.12652414  0.12289664  0.13015163
+#> exp          0.03153344  0.02785877  0.03520811
+#> exp2        -0.04603275 -0.05576062 -0.03630488
+#> 
+#> Significance Level: 95 %
+```
+
+It can be seen that the estimation results are similar between two
+methods. There is a different command that only computes the estimates
+but not confidence intervals.
+
+``` r
+ mincer_sgd = sgd_lm(y ~ edu + exp + exp2)
+ print(mincer_sgd)
+#> Call: 
+#> sgd_lm(formula = y ~ edu + exp + exp2)
+#> 
+#> Coefficients: 
+#>             Coefficient
+#> (Intercept)  0.58400539
+#> edu          0.12674866
+#> exp          0.03151793
+#> exp2        -0.04593379
+```
+
+We compare the execution times between two versions and find that there
+is not much difference in this simple example. By construction, it takes
+more time to conduct inference via `sgdi_lm`.
+
+``` r
+library(microbenchmark)
+res <- microbenchmark(sgd_lm(y ~ edu + exp + exp2),
+                      sgdi_lm(y ~ edu + exp + exp2),
+                      times=100L)
+#> Warning in microbenchmark(sgd_lm(y ~ edu + exp + exp2), sgdi_lm(y ~ edu + :
+#> less accurate nanosecond times to avoid potential integer overflows
+print(res)
+#> Unit: milliseconds
+#>                           expr      min       lq     mean   median       uq
+#>   sgd_lm(y ~ edu + exp + exp2) 3.425058 3.784853 4.319049 3.893012 4.167773
+#>  sgdi_lm(y ~ edu + exp + exp2) 4.163099 4.520229 5.106551 4.622975 4.875515
+#>        max neval
+#>   8.647433   100
+#>  11.408947   100
+```
+
+To plot the SGD path, we first construct a SGD path for the return to
+education coefficients.
+
+``` r
+mincer_sgd_path = sgdi_lm(y ~ edu + exp + exp2, path = TRUE, path_index = 2)
+```
+
+Then, we can plot the SGD path.
+
+``` r
+plot(mincer_sgd_path$path_coefficients, ylab="Return to Education", xlab="Steps")
+```
+
+<img src="man/figures/README-unnamed-chunk-8-1.png" width="100%" />
+
+To observe the initial paths, we now truncate the paths up to 2,000.
+
+``` r
+plot(mincer_sgd_path$path_coefficients[1:2000], ylab="Return to Education", xlab="Steps")
+```
+
+<img src="man/figures/README-unnamed-chunk-9-1.png" width="100%" />
+
+``` r
+print(c("2000th step", mincer_sgd_path$path_coefficients[2000]))
+#> [1] "2000th step"       "0.123913138474636"
+print(c("Final Estimate", mincer_sgd_path$coefficients[2]))
+#> [1] "Final Estimate"    "0.126641059670094"
+```
+
+It can be seen that the SGD path almost converged only after the 2,000
+steps, less than 10% of the sample size.
+
+## What else the package can do
+
+See the vignette for the quantile regression example.
+
+# References
+
+Acemoglu, D. and Autor, D., 2011. Skills, tasks and technologies:
+Implications for employment and earnings. In *Handbook of labor
+economics* (Vol. 4, pp. 1043-1171). Elsevier.
+
+Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online
+inference with stochastic gradient descent via random scaling. In
+*Proceedings of the AAAI Conference on Artificial Intelligence* (Vol.
+36, No. 7, pp. 7381-7389). <https://doi.org/10.1609/aaai.v36i7.20701>.
+
+Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for
+Quantile Regression with Tens of Millions of Observations.
+arXiv:2209.14502 \[econ.EM\]
+<https://doi.org/10.48550/arXiv.2209.14502>.
diff --git a/man/Census2000.Rd b/man/Census2000.Rd
diff --git a/man/SGDinference.Rd b/man/SGDinference.Rd
diff --git a/man/figures/README-unnamed-chunk-8-1.png b/man/figures/README-unnamed-chunk-8-1.png
diff --git a/man/figures/README-unnamed-chunk-9-1.png b/man/figures/README-unnamed-chunk-9-1.png
diff --git a/man/sgd_qr.Rd b/man/sgd_qr.Rd
diff --git a/man/sgdi_qr.Rd b/man/sgdi_qr.Rd