Releases · LudvigOlsen/groupdata2

This version introduces collapse_groups() and friends, as well as summarize_balances() and ranked_balances(). It also improves numerical balancing in fold() which breaks reproducibility.

Changes

Breaking: The numerical balancing (num_col) in fold() gets multiple improvements. This breaks reproducibility in some contexts.
- Fixes bug with selection of groups to redistribute when extreme_pairing_levels > 1. The groupings were likely to be fine, but the fix should give better groupings on average.
- When possible, it redistributes the smallest and/or largest group if they are 1 standard deviation from the second smallest/largest group to avoid imbalances due to very small/large scores.
- Adds use of extreme triplet grouping when too few grouping columns are created with extreme pairing. This can lead to an increase in the number of created fold columns. In some cases, these groupings may be more balanced than with extreme pairing, but on average extreme pairing leads to more balanced groupings. See rearrr::triplet_extremes() for more on extreme triplet grouping.
- Adds argument use_of_triplets in fold() to allow using extreme triplet grouping instead of extreme pairing or disabling it completely.
Adds collapse_groups() for collapsing a set of existing groups into a smaller set of groups. Can balance the
new groups by size and by numeric, categorical and ID columns. The more of these you balance at a time, the less balanced each will tend to be. Compare settings by summarizing the balances with summarize_balances() afterwards. For creating the most balanced groups, enable auto_tune.
Adds collapse_groups_by_size(), collapse_groups_by_numeric(), collapse_groups_by_levels(), and collapse_groups_by_ids(). These are wrappers of collapse_groups() for a simplified interface.
Adds summarize_balances() for inspecting the balance of numeric, categorical, and ID columns in-and-between groups.
Adds ranked_balances() for extracting the across-group standard deviations of balances from the output of summarize_balances(). The standard deviations are a measure of how balanced a split is.
Adds "every" method to grouping functions. Groups every n data points together.
Prepares package's tests for checkmate 2.1.0.

Assets 2

03 Jul 13:03

LudvigOlsen

v1.5.0

d28b0c6

groupdata2 1.5.0

Breaking: Rewrites large parts of the numerical balancing engine used in fold() and partition(). This produces different groups in some cases. Outsources extreme pairing to rearrr::pair_extremes(). Now uses hierarchical shuffling (rearrr::shuffle_hierarchy()) in partition() and some cases of fold() (relevant when extreme_pairing_levels > 1).
If you need reproducibility, the last version prior to this breaking change can be installed with devtools::install_github("ludvigolsen/groupdata2@v1.4.2").
Imports rearrr for use in numerical balancing.
Minor improvements to vignettes.

Assets 2

19 Jun 20:03

LudvigOlsen

v1.4.2

5654ef4

groupdata2 1.4.2

Improves documentation for core grouping functions.

Assets 2

06 Mar 20:58

LudvigOlsen

v1.4.1

aa3e8f9

groupdata2 1.4.1

Adds summarize_group_cols() for finding the number of groups per fold column along with statistics about the number of rows per group.
Breaking: Fixes internal sorting of fold columns. This sometimes changes the order of fold columns, compared to the previous version.
Adds tidyr as a required dependency. Previously, it was suggested.

Assets 2

20 Feb 16:11

LudvigOlsen

v1.4.0

95ca697

groupdata2 1.4.0

Breaking: In fold(), the k argument can now be a multi-element vector with one k (number of folds) per fold column. This functionality required a minor rewrite, why you might see interchanged fold column names in comparison to the previous versions.
Bug fix: In fold() and partition(), when specifying multiple cat_col columns and num_col in the same call, it would fail. This now works.

Assets 2

15 Jun 15:55

LudvigOlsen

v1.3.0

56062bd

groupdata2 1.3.0

Breaking: The following functions now work with grouped data.frames (meaning that they are applied group-wise): fold(), partition(), group(), group_factor(), splt(), balance(), upsample(), downsample(), differs_from_previous(), and find_missing_starts(). A message is generated once per session, when the input is grouped, to help users understand why their code is breaking.

Assets 2

06 Jun 18:42

LudvigOlsen

v1.2.1

e2a2360

groupdata2 1.2.1

checkmate compatibility.
Small speed up of n_dist grouping method.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Releases: LudvigOlsen/groupdata2

groupdata 2.0.3

Contributors

groupdata2 2.0.2

groupdata2 2.0.1

groupdata 2.0.0

groupdata2 1.5.0

groupdata2 1.4.2

groupdata2 1.4.1

groupdata2 1.4.0

groupdata2 1.3.0

groupdata2 1.2.1