Merge pull request #587 from SebKrantz/master

Update
SebKrantz · Jun 1, 2024 · 9029692 · 9029692
2 parents e2a32f9 + 48b0fc1
commit 9029692
Show file tree

Hide file tree

Showing 60 changed files with 10,719 additions and 760 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -26,3 +26,8 @@ man/figures
 _cache$
 _snaps
 ^CITATION\.cff$
+^\.DS_Store$
+^revdep$
+\.orig$
+
+
diff --git a/CITATION.cff b/CITATION.cff
@@ -8,7 +8,7 @@ message: 'To cite package "collapse" in publications use:'
 type: software
 license: GPL-2.0-or-later
 title: 'collapse: Advanced and Fast Data Transformation'
-version: 2.0.14
+version: 2.0.15
 abstract: A C/C++ based package for advanced data transformation and statistical computing
   in R that is extremely fast, class-agnostic, robust and programmer friendly. Core
   functionality includes a rich set of S3 generic grouped and weighted statistical
@@ -21,8 +21,8 @@ abstract: A C/C++ based package for advanced data transformation and statistical
   statistics, powerful tools to work with nested data, fast data object conversions,
   functions for memory efficient R programming, and helpers to effectively deal with
   variable labels, attributes, and missing data. It is well integrated with base R
-  classes, 'dplyr'/'tibble', 'data.table', 'sf', 'plm' (panel-series and data frames),
-  and 'xts'/'zoo'.
+  classes, 'dplyr'/'tibble', 'data.table', 'sf', 'units', 'plm' (panel-series and
+  data frames), and 'xts'/'zoo'.
 authors:
 - family-names: Krantz
   given-names: Sebastian
@@ -42,7 +42,7 @@ preferred-citation:
 repository: https://CRAN.R-project.org/package=collapse
 repository-code: https://github.com/SebKrantz/collapse
 url: https://sebkrantz.github.io/collapse/
-date-released: '2024-04-30'
+date-released: '2024-05-30'
 contact:
 - family-names: Krantz
   given-names: Sebastian
@@ -73,7 +73,7 @@ references:
   - family-names: Krantz
     given-names: Sebastian
   year: '2024'
-  notes: R package version 2.0.14
+  notes: R package version 2.0.15
   doi: 10.5281/zenodo.8433090
   url: https://sebkrantz.github.io/collapse/
 - type: software

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: collapse
 Title: Advanced and Fast Data Transformation
-Version: 2.0.14
-Date: 2024-04-30
+Version: 2.0.15
+Date: 2024-05-30
 Authors@R: c(
            person("Sebastian", "Krantz", role = c("aut", "cre"), 
                   email = "sebastian.krantz@graduateinstitute.ch", 
@@ -28,7 +28,7 @@ Description: A C/C++ based package for advanced data transformation and
     (grouped, weighted) summary statistics, powerful tools to work with nested data, 
     fast data object conversions, functions for memory efficient R programming, and 
     helpers to effectively deal with variable labels, attributes, and missing data. 
-    It is well integrated with base R classes, 'dplyr'/'tibble', 'data.table', 'sf', 
+    It is well integrated with base R classes, 'dplyr'/'tibble', 'data.table', 'sf', 'units', 
     'plm' (panel-series and data frames), and 'xts'/'zoo'.
 URL: https://sebkrantz.github.io/collapse/,  
      https://github.com/SebKrantz/collapse,

diff --git a/NAMESPACE b/NAMESPACE
@@ -405,6 +405,7 @@ importFrom("stats", "as.formula", "complete.cases", "cor", "cov", "var", "pt",
  export(fncol)
  export(fdim)
  export(as_numeric_factor)
+ export(as_integer_factor)
  export(as_character_factor)
  export(as.numeric_factor)
  export(as.character_factor)

diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,13 @@
+# collapse 2.0.15
+
+* `pivot()` has new arguments `FUN = "last"` and `FUN.args = NULL`, allowing wide and recast pivots with aggregation (default last value as before). `FUN` currently supports a single function returning a scalar value. *Fast Statistical Functions* receive vectorized execution. `FUN.args` can be used to supply a list of function arguments, including data-length arguments such as weights. There are also a couple of internal functions callable using function strings: `"first"`, `"last"`, `"count"`, `"sum"`, `"mean"`, `"min"`, or `"max"`. These are built into the reshaping C-code and thus extremely fast. Thanks @AdrianAntico for the request (#582).
+
 # collapse 2.0.14
 
+* Updated '*collapse* and *sf*' vignette to reflect the recent support for *units* objects, and added a few more examples.
+
+* Fixed a bug in `join()` where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting. 
+
 * Added function `group_by_vars()`: A standard evaluation version of `fgroup_by()` that is slimmer and safer for programming, e.g. `data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))`. Or, using *magrittr*: 
 ```r 
 library(magrittr)
@@ -15,8 +23,13 @@ data %>%
 }
 ```
 
+* Added function `as_integer_factor()` to turn factors/factor columns into integer vectors. `as_numeric_factor()` already exists, but is memory inefficient for most factors where levels can be integers. 
+
+* `join()` now internally checks if the rows of the joined datasets match exactly. This check, using `identical(m, seq_row(y))`, is inexpensive, but, if `TRUE`, saves a full subset and deep copy of `y`. Thus `join()` now inherits the intelligence already present in functions like `fsubset()`, `roworder()` and `funique()` - a key for efficient data manipulation is simply doing less.  
+
 * In `join()`, if `attr = TRUE`, the `count` option to `fmatch()` is always invoked, so that the attribute attached always has the same form, regardless of `verbose` or `validate` settings. 
 
+
 * `roworder[v]()` has optional setting `verbose = 2L` to indicate if `x` is already sorted, making the call to `roworder[v]()` redundant. 
 
 # collapse 2.0.13

diff --git a/R/global_macros.R b/R/global_macros.R
@@ -105,7 +105,7 @@ get_collapse <- function(opts = NULL) if(is.null(opts)) as.list(.op) else if(len
                             "%r-%", "%r*%", "%r/%", "%r+%", "%rr%", "add_stub", "add_vars",
                             "add_vars<-", "all_funs", "all_identical", "all_obj_equal", "allNA",
                             "alloc", "allv", "any_duplicated", "anyv", "as_character_factor",
-                            "as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as.character_factor",
+                            "as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as_integer_factor", "as.character_factor",
                             "as.factor_GRP", "as.factor_qG", "as.numeric_factor", "atomic_elem",
                             "atomic_elem<-", "av", "av<-", "B", "BY", "BY.data.frame", "BY.default",
                             "BY.matrix", "cat_vars", "cat_vars<-", "char_vars", "char_vars<-",
@@ -177,7 +177,7 @@ get_collapse <- function(opts = NULL) if(is.null(opts)) as.list(.op) else if(len
 .COLLAPSE_ALL <- sort(unique(c("%-=%", "%!=%", "%!iin%", "%!in%", "%*=%", "%/=%", "%+=%", "%=%", "%==%", "%c-%", "%c*%", "%c/%", "%c+%",
                                "%cr%", "%iin%", "%r-%", "%r*%", "%r/%", "%r+%", "%rr%", "add_stub", "add_vars", "add_vars<-", "all_funs",
                                "all_identical", "all_obj_equal", "allNA", "alloc", "allv", "any_duplicated", "anyv", "as_character_factor",
-                               "as_factor_GRP", "as_factor_qG", "as_numeric_factor", "atomic_elem", "atomic_elem<-", "av", "av<-", "B", "BY",
+                               "as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as_integer_factor", "atomic_elem", "atomic_elem<-", "av", "av<-", "B", "BY",
                                "cat_vars", "cat_vars<-", "char_vars", "char_vars<-", "cinv", "ckmatch", "collap", "collapg", "collapv", "colorder",
                                "colorderv", "copyAttrib", "copyMostAttrib", "copyv", "D", "dapply", "date_vars", "Date_vars", "date_vars<-",
                                "Date_vars<-", "descr", "Dlog", "fact_vars", "fact_vars<-", "fbetween", "fcompute", "fcomputev", "fcount",

diff --git a/R/join.R b/R/join.R
@@ -180,7 +180,7 @@ join <- function(x, y,
   # Core: do the joins
   res <- switch(how,
     left = {
-      y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE)
+      y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE)
       c(x, y_res)
     },
     inner = {
@@ -193,7 +193,7 @@ join <- function(x, y,
         # if(length(rn)) ax[["row.names"]] <- if(is.numeric(rn) || is.null(rn) || rn[1L] == "1")
         #             .set_row_names(length(x_ind)) else Csv(rn, x_ind)
       }
-      y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], FALSE)
+      y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], FALSE)
       c(x, y_res)
     },
     full = {
@@ -209,7 +209,7 @@ join <- function(x, y,
         }
       }
       if(cond) { # TODO: special case ? 1 distinct value etc.??
-        tind <- seq_len(tsize)[-um] # TODO: Table may not be unique.
+        tind <- if(length(um)) seq_len(tsize)[-um] else seq_len(tsize) # TODO: Table may not be unique.
         res_nrow <- length(m) + length(tind)
         x_res <- .Call(C_subsetDT, x, seq_len(res_nrow), seq_along(x)[-ixon], TRUE)  # Need check here because oversize indices !!
         y_res <- .Call(C_subsetDT, y, vec(list(m, tind)), seq_along(y)[-iyon], TRUE) # Need check here because oversize indices !!
@@ -225,12 +225,12 @@ join <- function(x, y,
         }
       } else { # If all elements of table are matched, this is simply a left join
         how <- if(multiple == 2L) "left_setrn" else "left"
-        y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE) # anyNA(um) ??
+        y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE) # anyNA(um) ??
         c(x, y_res)
       }
     },
     right = {
-      x_res <- .Call(C_subsetDT, x, m, seq_along(x)[-ixon], if(count) attr(m, "N.nomatch") else TRUE)
+      x_res <- if(identical(unattrib(m), seq_row(x))) x[-ixon] else .Call(C_subsetDT, x, m, seq_along(x)[-ixon], if(count) attr(m, "N.nomatch") else TRUE)
       # if(length(ax[["row.names"]])) ax[["row.names"]] <- .set_row_names(length(m))
       y_on <- y[iyon]
       names(y_on) <- xon

diff --git a/R/pivot.R b/R/pivot.R
@@ -100,6 +100,16 @@ add_labels <- function(l, labs) {
   .Call(C_setvlabels, l, "label", labs, NULL)
 }
 
+apply_external_FUN <- function(data, g, FUN, args, name) {
+  FUN <- match.fun(FUN)
+  if(is.null(args)) {
+    if(any(name == .FAST_STAT_FUN)) return(FUN(data, g = g, TRA = "fill"))
+    return(TRA(data, BY(data, g, FUN, use.g.names = FALSE, reorder = FALSE), "fill", g))
+  }
+  if(any(name == .FAST_STAT_FUN)) return(do.call(FUN, c(list(x = data, g = g, TRA = "fill"), args)))
+  TRA(data, do.call(BY, c(list(x = data, g = g, FUN = FUN, use.g.names = FALSE, reorder = FALSE), args)), "fill", g)
+}
+
 # TODO: Think about: values could be list input, names only atomic. that would make more sense...
 # Or: allow for both options... needs to be consistent with "labels" though...
 
@@ -133,6 +143,8 @@ pivot <- function(data,
                   na.rm = FALSE,
                   factor = c("names", "labels"),
                   check.dups = FALSE,
+                  FUN = "last",
+                  FUN.args = NULL,
                   nthreads = .op[["nthreads"]],
                   fill = NULL, # Fill is for pivot_wider
                   drop = TRUE, # Same as with dcast()
@@ -293,13 +305,21 @@ pivot <- function(data,
         if(length(values) > 1L) { # Multiple columns, as in dcast... TODO: check pivot_wider
           namv <- names(data)[values]
           attributes(data) <- NULL
-          value_cols <- lapply(data[values], function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads))
+          if(!is.character(FUN)) {
+            data[values] <- apply_external_FUN(data[values], group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
+            FUN <- "last"
+          }
+          value_cols <- lapply(data[values], function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads, FUN, na.rm))
           if(length(labels)) value_cols <- lapply(value_cols, add_labels, labels)
           value_cols <- unlist(if(transpose[1L]) t_list2(value_cols) else value_cols, FALSE, FALSE)
           namv_res <- if(transpose[2L]) t(outer(names, namv, paste, sep = "_")) else outer(namv, names, paste, sep = "_")
           names(value_cols) <- if(transpose[1L]) namv_res else t(namv_res)
         } else {
-          value_cols <- .Call(C_pivot_wide, g, g_v, data[[values]], fill, nthreads)
+          if(!is.character(FUN)) {
+            data[[values]] <- apply_external_FUN(data[[values]], group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
+            FUN <- "last"
+          }
+          value_cols <- .Call(C_pivot_wide, g, g_v, data[[values]], fill, nthreads, FUN, na.rm)
           names(value_cols) <- names
           if(length(labels)) vlabels(value_cols) <- labels
         }
@@ -375,7 +395,11 @@ pivot <- function(data,
           namv <- names(vd)
           attributes(vd) <- NULL
         }
-        value_cols <- lapply(vd, function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads))
+        if(!is.character(FUN)) {
+          vd <- apply_external_FUN(vd, group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
+          FUN <- "last"
+        }
+        value_cols <- lapply(vd, function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads, FUN, na.rm))
         if(length(id_cols)) id_cols <- .Call(C_rbindlist, alloc(id_cols, length(value_cols)), FALSE, FALSE, NULL)
         value_cols <- .Call(C_rbindlist, value_cols, FALSE, FALSE, names[[2L]]) # Final column is "variable" name
 

diff --git a/R/small_helper.R b/R/small_helper.R
@@ -501,6 +501,16 @@ as_numeric_factor <- function(X, keep.attr = TRUE) {
   res
 }
 
+as_integer_factor <- function(X, keep.attr = TRUE) {
+  if(is.atomic(X)) if(keep.attr) return(ffka(X, as.integer)) else
+    return(as.integer(attr(X, "levels"))[X])
+  res <- duplAttributes(lapply(unattrib(X),
+    if(keep.attr) (function(y) if(is.factor(y)) ffka(y, as.integer) else y) else
+                  (function(y) if(is.factor(y)) as.integer(attr(y, "levels"))[y] else y)), X)
+  if(inherits(X, "data.table")) return(alc(res))
+  res
+}
+
 as_character_factor <- function(X, keep.attr = TRUE) {
   if(is.atomic(X)) if(keep.attr) return(ffka(X, tochar)) else
     return(as.character.factor(X))

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
  [![Conda Downloads](https://img.shields.io/conda/dn/conda-forge/r-collapse.svg)](https://anaconda.org/conda-forge/r-collapse)
 [![Codecov test coverage](https://codecov.io/gh/SebKrantz/collapse/branch/master/graph/badge.svg)](https://app.codecov.io/gh/SebKrantz/collapse?branch=master)
 [![minimal R version](https://img.shields.io/badge/R%3E%3D-3.3.0-6666ff.svg)](https://cran.r-project.org/)
-[![status](https://tinyverse.netlify.com/badge/collapse)](https://CRAN.R-project.org/package=collapse)
+[![dependencies](https://tinyverse.netlify.app/badge/collapse)](https://CRAN.R-project.org/package=collapse)
 [![DOI](https://zenodo.org/badge/172910283.svg)](https://zenodo.org/badge/latestdoi/172910283)
 [![arXiv](https://img.shields.io/badge/arXiv-2403.05038-0969DA.svg)](https://arxiv.org/abs/2403.05038)
 <!-- badges: end -->
@@ -21,7 +21,7 @@
 * To facilitate complex data transformation, exploration and computing tasks in R.
 * To help make R code fast, flexible, parsimonious and programmer friendly. 
 
-It further implements a [class-agnostic approach to R programming](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html), supporting base R, *tibble*, *grouped_df* (*tidyverse*), *data.table*, *sf*, *pseries*, *pdata.frame* (*plm*), and preserving many others (e.g. *units*, *xts*/*zoo*, *tsibble*). 
+It further implements a [class-agnostic approach to R programming](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html), supporting base R, *tibble*, *grouped_df* (*tidyverse*), *data.table*, *sf*, *units*, *pseries*, *pdata.frame* (*plm*), and *xts*/*zoo*. 
 
 **Key Features:**
 

diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -210,6 +210,7 @@ articles:
   contents:
   - collapse_documentation
   - collapse_for_tidyverse_users
+  - collapse_and_sf
   - collapse_object_handling
 - title: Legacy (Pre v1.7)
   desc: Vignettes that cover functionality of versions <1.7. These
@@ -219,5 +220,4 @@ articles:
   - collapse_and_dplyr
   - collapse_and_data.table
   - collapse_and_plm
-  - collapse_and_sf
 
diff --git a/man/collapse-documentation.Rd b/man/collapse-documentation.Rd
@@ -26,7 +26,7 @@ The following table fully summarizes the contents of \emph{\link{collapse}}. The
 \link[=fast-data-manipulation]{Fast Data Manipulation} \tab\tab Fast and flexible select, subset, summarise, mutate/transform, sort/reorder, combine, join, reshape, rename and relabel data. Some functions modify by reference and/or allow assignment. In addition a set of (standard evaluation) functions for fast selecting, replacing or adding data frame columns, including shortcuts to select and replace variables by data type.
 \tab\tab \code{\link[=fselect]{fselect(<-)}}, \code{\link[=fsubset]{fsubset/ss}}, \code{\link{fsummarise}}, \code{\link{fmutate}}, \code{\link{across}}, \code{\link[=ftransform]{(f/set)transform(v)(<-)}}, \code{\link[=fcompute]{fcompute(v)}}, \code{\link[=roworder]{roworder(v)}}, \code{\link[=colorder]{colorder(v)}}, \code{\link{rowbind}}, \code{\link{join}}, \code{\link{pivot}}, \code{\link[=frename]{(f/set)rename}}, \code{\link[=relabel]{(set)relabel}}, \code{\link[=get_vars]{get_vars(<-)}}, \code{\link[=add_vars]{add_vars(<-)}}, \code{\link[=num_vars]{num_vars(<-)}}, \code{\link[=cat_vars]{cat_vars(<-)}}, \code{\link[=char_vars]{char_vars(<-)}}, \code{\link[=fact_vars]{fact_vars(<-)}}, \code{\link[=logi_vars]{logi_vars(<-)}}, \code{\link[=date_vars]{date_vars(<-)}} \cr \cr \cr
 
-\link[=quick-conversion]{Quick Data Conversion} \tab\tab Quick conversions: data.frame <> data.table <> tibble <> matrix (row- or column-wise) <> list | array > matrix, data.frame, data.table, tibble | vector > factor, matrix, data.frame, data.table, tibble; and converting factors / all factor columns. \tab\tab \code{qDF}, \code{qDT}, \code{qTBL}, \code{qM}, \code{qF}, \code{mrtl}, \code{mctl}, \code{as_numeric_factor}, \code{as_character_factor} \cr \cr \cr
+\link[=quick-conversion]{Quick Data Conversion} \tab\tab Quick conversions: data.frame <> data.table <> tibble <> matrix (row- or column-wise) <> list | array > matrix, data.frame, data.table, tibble | vector > factor, matrix, data.frame, data.table, tibble; and converting factors / all factor columns. \tab\tab \code{qDF}, \code{qDT}, \code{qTBL}, \code{qM}, \code{qF}, \code{mrtl}, \code{mctl}, \code{as_numeric_factor}, \code{as_integer_factor}, \code{as_character_factor} \cr \cr \cr
 
 \link[=advanced-aggregation]{Advanced Data Aggregation} \tab\tab Fast and easy (weighted and parallelized) aggregation of multi-type data, with different functions applied to numeric and categorical variables. Custom specifications allow mappings of functions to variables + renaming. \tab\tab \code{collap(v/g)} \cr \cr \cr
 

diff --git a/man/collapse-package.Rd b/man/collapse-package.Rd
@@ -12,7 +12,7 @@ Advanced and Fast Data Transformation
 \item To help make R code fast, flexible, parsimonious and programmer friendly. % \emph{collapse} is a fast  %to facilitate (advanced) data manipulation in R   % To achieve the latter,
 % collapse provides a broad set.. -> Nah, its not a misc package
 }
-It is made compatible with the \emph{tidyverse}, \emph{data.table}, \emph{sf} and the \emph{plm} approach to panel data, and non-destructively handles other classes such as \emph{xts}.
+It is made compatible with the \emph{tidyverse}, \emph{data.table}, \emph{sf}, \emph{units}, \emph{xts/zoo}, and the \emph{plm} approach to panel data.
 
 }
 \section{Getting Started}{