Skip to content

Commit

Permalink
Merge pull request #587 from SebKrantz/master
Browse files Browse the repository at this point in the history
Update
  • Loading branch information
SebKrantz authored Jun 1, 2024
2 parents e2a32f9 + 48b0fc1 commit 9029692
Show file tree
Hide file tree
Showing 60 changed files with 10,719 additions and 760 deletions.
5 changes: 5 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,8 @@ man/figures
_cache$
_snaps
^CITATION\.cff$
^\.DS_Store$
^revdep$
\.orig$


10 changes: 5 additions & 5 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ message: 'To cite package "collapse" in publications use:'
type: software
license: GPL-2.0-or-later
title: 'collapse: Advanced and Fast Data Transformation'
version: 2.0.14
version: 2.0.15
abstract: A C/C++ based package for advanced data transformation and statistical computing
in R that is extremely fast, class-agnostic, robust and programmer friendly. Core
functionality includes a rich set of S3 generic grouped and weighted statistical
Expand All @@ -21,8 +21,8 @@ abstract: A C/C++ based package for advanced data transformation and statistical
statistics, powerful tools to work with nested data, fast data object conversions,
functions for memory efficient R programming, and helpers to effectively deal with
variable labels, attributes, and missing data. It is well integrated with base R
classes, 'dplyr'/'tibble', 'data.table', 'sf', 'plm' (panel-series and data frames),
and 'xts'/'zoo'.
classes, 'dplyr'/'tibble', 'data.table', 'sf', 'units', 'plm' (panel-series and
data frames), and 'xts'/'zoo'.
authors:
- family-names: Krantz
given-names: Sebastian
Expand All @@ -42,7 +42,7 @@ preferred-citation:
repository: https://CRAN.R-project.org/package=collapse
repository-code: https://github.com/SebKrantz/collapse
url: https://sebkrantz.github.io/collapse/
date-released: '2024-04-30'
date-released: '2024-05-30'
contact:
- family-names: Krantz
given-names: Sebastian
Expand Down Expand Up @@ -73,7 +73,7 @@ references:
- family-names: Krantz
given-names: Sebastian
year: '2024'
notes: R package version 2.0.14
notes: R package version 2.0.15
doi: 10.5281/zenodo.8433090
url: https://sebkrantz.github.io/collapse/
- type: software
Expand Down
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: collapse
Title: Advanced and Fast Data Transformation
Version: 2.0.14
Date: 2024-04-30
Version: 2.0.15
Date: 2024-05-30
Authors@R: c(
person("Sebastian", "Krantz", role = c("aut", "cre"),
email = "sebastian.krantz@graduateinstitute.ch",
Expand All @@ -28,7 +28,7 @@ Description: A C/C++ based package for advanced data transformation and
(grouped, weighted) summary statistics, powerful tools to work with nested data,
fast data object conversions, functions for memory efficient R programming, and
helpers to effectively deal with variable labels, attributes, and missing data.
It is well integrated with base R classes, 'dplyr'/'tibble', 'data.table', 'sf',
It is well integrated with base R classes, 'dplyr'/'tibble', 'data.table', 'sf', 'units',
'plm' (panel-series and data frames), and 'xts'/'zoo'.
URL: https://sebkrantz.github.io/collapse/,
https://github.com/SebKrantz/collapse,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,7 @@ importFrom("stats", "as.formula", "complete.cases", "cor", "cov", "var", "pt",
export(fncol)
export(fdim)
export(as_numeric_factor)
export(as_integer_factor)
export(as_character_factor)
export(as.numeric_factor)
export(as.character_factor)
Expand Down
13 changes: 13 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# collapse 2.0.15

* `pivot()` has new arguments `FUN = "last"` and `FUN.args = NULL`, allowing wide and recast pivots with aggregation (default last value as before). `FUN` currently supports a single function returning a scalar value. *Fast Statistical Functions* receive vectorized execution. `FUN.args` can be used to supply a list of function arguments, including data-length arguments such as weights. There are also a couple of internal functions callable using function strings: `"first"`, `"last"`, `"count"`, `"sum"`, `"mean"`, `"min"`, or `"max"`. These are built into the reshaping C-code and thus extremely fast. Thanks @AdrianAntico for the request (#582).

# collapse 2.0.14

* Updated '*collapse* and *sf*' vignette to reflect the recent support for *units* objects, and added a few more examples.

* Fixed a bug in `join()` where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting.

* Added function `group_by_vars()`: A standard evaluation version of `fgroup_by()` that is slimmer and safer for programming, e.g. `data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))`. Or, using *magrittr*:
```r
library(magrittr)
Expand All @@ -15,8 +23,13 @@ data %>%
}
```

* Added function `as_integer_factor()` to turn factors/factor columns into integer vectors. `as_numeric_factor()` already exists, but is memory inefficient for most factors where levels can be integers.

* `join()` now internally checks if the rows of the joined datasets match exactly. This check, using `identical(m, seq_row(y))`, is inexpensive, but, if `TRUE`, saves a full subset and deep copy of `y`. Thus `join()` now inherits the intelligence already present in functions like `fsubset()`, `roworder()` and `funique()` - a key for efficient data manipulation is simply doing less.

* In `join()`, if `attr = TRUE`, the `count` option to `fmatch()` is always invoked, so that the attribute attached always has the same form, regardless of `verbose` or `validate` settings.


* `roworder[v]()` has optional setting `verbose = 2L` to indicate if `x` is already sorted, making the call to `roworder[v]()` redundant.

# collapse 2.0.13
Expand Down
4 changes: 2 additions & 2 deletions R/global_macros.R
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ get_collapse <- function(opts = NULL) if(is.null(opts)) as.list(.op) else if(len
"%r-%", "%r*%", "%r/%", "%r+%", "%rr%", "add_stub", "add_vars",
"add_vars<-", "all_funs", "all_identical", "all_obj_equal", "allNA",
"alloc", "allv", "any_duplicated", "anyv", "as_character_factor",
"as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as.character_factor",
"as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as_integer_factor", "as.character_factor",
"as.factor_GRP", "as.factor_qG", "as.numeric_factor", "atomic_elem",
"atomic_elem<-", "av", "av<-", "B", "BY", "BY.data.frame", "BY.default",
"BY.matrix", "cat_vars", "cat_vars<-", "char_vars", "char_vars<-",
Expand Down Expand Up @@ -177,7 +177,7 @@ get_collapse <- function(opts = NULL) if(is.null(opts)) as.list(.op) else if(len
.COLLAPSE_ALL <- sort(unique(c("%-=%", "%!=%", "%!iin%", "%!in%", "%*=%", "%/=%", "%+=%", "%=%", "%==%", "%c-%", "%c*%", "%c/%", "%c+%",
"%cr%", "%iin%", "%r-%", "%r*%", "%r/%", "%r+%", "%rr%", "add_stub", "add_vars", "add_vars<-", "all_funs",
"all_identical", "all_obj_equal", "allNA", "alloc", "allv", "any_duplicated", "anyv", "as_character_factor",
"as_factor_GRP", "as_factor_qG", "as_numeric_factor", "atomic_elem", "atomic_elem<-", "av", "av<-", "B", "BY",
"as_factor_GRP", "as_factor_qG", "as_numeric_factor", "as_integer_factor", "atomic_elem", "atomic_elem<-", "av", "av<-", "B", "BY",
"cat_vars", "cat_vars<-", "char_vars", "char_vars<-", "cinv", "ckmatch", "collap", "collapg", "collapv", "colorder",
"colorderv", "copyAttrib", "copyMostAttrib", "copyv", "D", "dapply", "date_vars", "Date_vars", "date_vars<-",
"Date_vars<-", "descr", "Dlog", "fact_vars", "fact_vars<-", "fbetween", "fcompute", "fcomputev", "fcount",
Expand Down
10 changes: 5 additions & 5 deletions R/join.R
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ join <- function(x, y,
# Core: do the joins
res <- switch(how,
left = {
y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE)
y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE)
c(x, y_res)
},
inner = {
Expand All @@ -193,7 +193,7 @@ join <- function(x, y,
# if(length(rn)) ax[["row.names"]] <- if(is.numeric(rn) || is.null(rn) || rn[1L] == "1")
# .set_row_names(length(x_ind)) else Csv(rn, x_ind)
}
y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], FALSE)
y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], FALSE)
c(x, y_res)
},
full = {
Expand All @@ -209,7 +209,7 @@ join <- function(x, y,
}
}
if(cond) { # TODO: special case ? 1 distinct value etc.??
tind <- seq_len(tsize)[-um] # TODO: Table may not be unique.
tind <- if(length(um)) seq_len(tsize)[-um] else seq_len(tsize) # TODO: Table may not be unique.
res_nrow <- length(m) + length(tind)
x_res <- .Call(C_subsetDT, x, seq_len(res_nrow), seq_along(x)[-ixon], TRUE) # Need check here because oversize indices !!
y_res <- .Call(C_subsetDT, y, vec(list(m, tind)), seq_along(y)[-iyon], TRUE) # Need check here because oversize indices !!
Expand All @@ -225,12 +225,12 @@ join <- function(x, y,
}
} else { # If all elements of table are matched, this is simply a left join
how <- if(multiple == 2L) "left_setrn" else "left"
y_res <- .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE) # anyNA(um) ??
y_res <- if(identical(unattrib(m), seq_row(y))) y[-iyon] else .Call(C_subsetDT, y, m, seq_along(y)[-iyon], if(count) attr(m, "N.nomatch") else TRUE) # anyNA(um) ??
c(x, y_res)
}
},
right = {
x_res <- .Call(C_subsetDT, x, m, seq_along(x)[-ixon], if(count) attr(m, "N.nomatch") else TRUE)
x_res <- if(identical(unattrib(m), seq_row(x))) x[-ixon] else .Call(C_subsetDT, x, m, seq_along(x)[-ixon], if(count) attr(m, "N.nomatch") else TRUE)
# if(length(ax[["row.names"]])) ax[["row.names"]] <- .set_row_names(length(m))
y_on <- y[iyon]
names(y_on) <- xon
Expand Down
30 changes: 27 additions & 3 deletions R/pivot.R
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,16 @@ add_labels <- function(l, labs) {
.Call(C_setvlabels, l, "label", labs, NULL)
}

apply_external_FUN <- function(data, g, FUN, args, name) {
FUN <- match.fun(FUN)
if(is.null(args)) {
if(any(name == .FAST_STAT_FUN)) return(FUN(data, g = g, TRA = "fill"))
return(TRA(data, BY(data, g, FUN, use.g.names = FALSE, reorder = FALSE), "fill", g))
}
if(any(name == .FAST_STAT_FUN)) return(do.call(FUN, c(list(x = data, g = g, TRA = "fill"), args)))
TRA(data, do.call(BY, c(list(x = data, g = g, FUN = FUN, use.g.names = FALSE, reorder = FALSE), args)), "fill", g)
}

# TODO: Think about: values could be list input, names only atomic. that would make more sense...
# Or: allow for both options... needs to be consistent with "labels" though...

Expand Down Expand Up @@ -133,6 +143,8 @@ pivot <- function(data,
na.rm = FALSE,
factor = c("names", "labels"),
check.dups = FALSE,
FUN = "last",
FUN.args = NULL,
nthreads = .op[["nthreads"]],
fill = NULL, # Fill is for pivot_wider
drop = TRUE, # Same as with dcast()
Expand Down Expand Up @@ -293,13 +305,21 @@ pivot <- function(data,
if(length(values) > 1L) { # Multiple columns, as in dcast... TODO: check pivot_wider
namv <- names(data)[values]
attributes(data) <- NULL
value_cols <- lapply(data[values], function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads))
if(!is.character(FUN)) {
data[values] <- apply_external_FUN(data[values], group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
FUN <- "last"
}
value_cols <- lapply(data[values], function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads, FUN, na.rm))
if(length(labels)) value_cols <- lapply(value_cols, add_labels, labels)
value_cols <- unlist(if(transpose[1L]) t_list2(value_cols) else value_cols, FALSE, FALSE)
namv_res <- if(transpose[2L]) t(outer(names, namv, paste, sep = "_")) else outer(namv, names, paste, sep = "_")
names(value_cols) <- if(transpose[1L]) namv_res else t(namv_res)
} else {
value_cols <- .Call(C_pivot_wide, g, g_v, data[[values]], fill, nthreads)
if(!is.character(FUN)) {
data[[values]] <- apply_external_FUN(data[[values]], group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
FUN <- "last"
}
value_cols <- .Call(C_pivot_wide, g, g_v, data[[values]], fill, nthreads, FUN, na.rm)
names(value_cols) <- names
if(length(labels)) vlabels(value_cols) <- labels
}
Expand Down Expand Up @@ -375,7 +395,11 @@ pivot <- function(data,
namv <- names(vd)
attributes(vd) <- NULL
}
value_cols <- lapply(vd, function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads))
if(!is.character(FUN)) {
vd <- apply_external_FUN(vd, group(list(g, g_v)), FUN, FUN.args, l1orlst(as.character(substitute(FUN))))
FUN <- "last"
}
value_cols <- lapply(vd, function(x) .Call(C_pivot_wide, g, g_v, x, fill, nthreads, FUN, na.rm))
if(length(id_cols)) id_cols <- .Call(C_rbindlist, alloc(id_cols, length(value_cols)), FALSE, FALSE, NULL)
value_cols <- .Call(C_rbindlist, value_cols, FALSE, FALSE, names[[2L]]) # Final column is "variable" name

Expand Down
10 changes: 10 additions & 0 deletions R/small_helper.R
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,16 @@ as_numeric_factor <- function(X, keep.attr = TRUE) {
res
}

as_integer_factor <- function(X, keep.attr = TRUE) {
if(is.atomic(X)) if(keep.attr) return(ffka(X, as.integer)) else
return(as.integer(attr(X, "levels"))[X])
res <- duplAttributes(lapply(unattrib(X),
if(keep.attr) (function(y) if(is.factor(y)) ffka(y, as.integer) else y) else
(function(y) if(is.factor(y)) as.integer(attr(y, "levels"))[y] else y)), X)
if(inherits(X, "data.table")) return(alc(res))
res
}

as_character_factor <- function(X, keep.attr = TRUE) {
if(is.atomic(X)) if(keep.attr) return(ffka(X, tochar)) else
return(as.character.factor(X))
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
[![Conda Downloads](https://img.shields.io/conda/dn/conda-forge/r-collapse.svg)](https://anaconda.org/conda-forge/r-collapse)
[![Codecov test coverage](https://codecov.io/gh/SebKrantz/collapse/branch/master/graph/badge.svg)](https://app.codecov.io/gh/SebKrantz/collapse?branch=master)
[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.3.0-6666ff.svg)](https://cran.r-project.org/)
[![status](https://tinyverse.netlify.com/badge/collapse)](https://CRAN.R-project.org/package=collapse)
[![dependencies](https://tinyverse.netlify.app/badge/collapse)](https://CRAN.R-project.org/package=collapse)
[![DOI](https://zenodo.org/badge/172910283.svg)](https://zenodo.org/badge/latestdoi/172910283)
[![arXiv](https://img.shields.io/badge/arXiv-2403.05038-0969DA.svg)](https://arxiv.org/abs/2403.05038)
<!-- badges: end -->
Expand All @@ -21,7 +21,7 @@
* To facilitate complex data transformation, exploration and computing tasks in R.
* To help make R code fast, flexible, parsimonious and programmer friendly.

It further implements a [class-agnostic approach to R programming](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html), supporting base R, *tibble*, *grouped_df* (*tidyverse*), *data.table*, *sf*, *pseries*, *pdata.frame* (*plm*), and preserving many others (e.g. *units*, *xts*/*zoo*, *tsibble*).
It further implements a [class-agnostic approach to R programming](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html), supporting base R, *tibble*, *grouped_df* (*tidyverse*), *data.table*, *sf*, *units*, *pseries*, *pdata.frame* (*plm*), and *xts*/*zoo*.

**Key Features:**

Expand Down
2 changes: 1 addition & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ articles:
contents:
- collapse_documentation
- collapse_for_tidyverse_users
- collapse_and_sf
- collapse_object_handling
- title: Legacy (Pre v1.7)
desc: Vignettes that cover functionality of versions <1.7. These
Expand All @@ -219,5 +220,4 @@ articles:
- collapse_and_dplyr
- collapse_and_data.table
- collapse_and_plm
- collapse_and_sf

2 changes: 1 addition & 1 deletion man/collapse-documentation.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The following table fully summarizes the contents of \emph{\link{collapse}}. The
\link[=fast-data-manipulation]{Fast Data Manipulation} \tab\tab Fast and flexible select, subset, summarise, mutate/transform, sort/reorder, combine, join, reshape, rename and relabel data. Some functions modify by reference and/or allow assignment. In addition a set of (standard evaluation) functions for fast selecting, replacing or adding data frame columns, including shortcuts to select and replace variables by data type.
\tab\tab \code{\link[=fselect]{fselect(<-)}}, \code{\link[=fsubset]{fsubset/ss}}, \code{\link{fsummarise}}, \code{\link{fmutate}}, \code{\link{across}}, \code{\link[=ftransform]{(f/set)transform(v)(<-)}}, \code{\link[=fcompute]{fcompute(v)}}, \code{\link[=roworder]{roworder(v)}}, \code{\link[=colorder]{colorder(v)}}, \code{\link{rowbind}}, \code{\link{join}}, \code{\link{pivot}}, \code{\link[=frename]{(f/set)rename}}, \code{\link[=relabel]{(set)relabel}}, \code{\link[=get_vars]{get_vars(<-)}}, \code{\link[=add_vars]{add_vars(<-)}}, \code{\link[=num_vars]{num_vars(<-)}}, \code{\link[=cat_vars]{cat_vars(<-)}}, \code{\link[=char_vars]{char_vars(<-)}}, \code{\link[=fact_vars]{fact_vars(<-)}}, \code{\link[=logi_vars]{logi_vars(<-)}}, \code{\link[=date_vars]{date_vars(<-)}} \cr \cr \cr
\link[=quick-conversion]{Quick Data Conversion} \tab\tab Quick conversions: data.frame <> data.table <> tibble <> matrix (row- or column-wise) <> list | array > matrix, data.frame, data.table, tibble | vector > factor, matrix, data.frame, data.table, tibble; and converting factors / all factor columns. \tab\tab \code{qDF}, \code{qDT}, \code{qTBL}, \code{qM}, \code{qF}, \code{mrtl}, \code{mctl}, \code{as_numeric_factor}, \code{as_character_factor} \cr \cr \cr
\link[=quick-conversion]{Quick Data Conversion} \tab\tab Quick conversions: data.frame <> data.table <> tibble <> matrix (row- or column-wise) <> list | array > matrix, data.frame, data.table, tibble | vector > factor, matrix, data.frame, data.table, tibble; and converting factors / all factor columns. \tab\tab \code{qDF}, \code{qDT}, \code{qTBL}, \code{qM}, \code{qF}, \code{mrtl}, \code{mctl}, \code{as_numeric_factor}, \code{as_integer_factor}, \code{as_character_factor} \cr \cr \cr
\link[=advanced-aggregation]{Advanced Data Aggregation} \tab\tab Fast and easy (weighted and parallelized) aggregation of multi-type data, with different functions applied to numeric and categorical variables. Custom specifications allow mappings of functions to variables + renaming. \tab\tab \code{collap(v/g)} \cr \cr \cr
Expand Down
2 changes: 1 addition & 1 deletion man/collapse-package.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Advanced and Fast Data Transformation
\item To help make R code fast, flexible, parsimonious and programmer friendly. % \emph{collapse} is a fast %to facilitate (advanced) data manipulation in R % To achieve the latter,
% collapse provides a broad set.. -> Nah, its not a misc package
}
It is made compatible with the \emph{tidyverse}, \emph{data.table}, \emph{sf} and the \emph{plm} approach to panel data, and non-destructively handles other classes such as \emph{xts}.
It is made compatible with the \emph{tidyverse}, \emph{data.table}, \emph{sf}, \emph{units}, \emph{xts/zoo}, and the \emph{plm} approach to panel data.

}
\section{Getting Started}{
Expand Down
Loading

0 comments on commit 9029692

Please sign in to comment.