Skip to content

Releases: tidyverse/dplyr

dplyr 1.1.4

17 Nov 18:12
Compare
Choose a tag to compare
  • join_by() now allows its helper functions to be namespaced with dplyr::,
    like join_by(dplyr::between(x, lower, upper)) (#6838).

  • left_join() and friends now return a specialized error message if they
    detect that your join would return more rows than dplyr can handle (#6912).

  • slice_*() now throw the correct error if you forget to name n while also
    prefixing the call with dplyr:: (#6946).

  • dplyr_reconstruct()'s default method has been rewritten to avoid
    materializing duckplyr queries too early (#6947).

  • Updated the storms data to include 2022 data (#6937, @steveharoz).

  • Updated the starwars data to use a new API, because the old one is defunct.
    There are very minor changes to the data itself (#6938, @steveharoz).

dplyr 1.1.3

05 Sep 14:36
Compare
Choose a tag to compare
  • mutate_each() and summarise_each() now throw correct deprecation messages
    (#6869).

  • setequal() now requires the input data frames to be compatible, similar to
    the other set methods like setdiff() or intersect() (#6786).

dplyr 1.1.2

20 Apr 16:58
Compare
Choose a tag to compare
  • count() better documents that it has a .drop argument (#6820).

  • Fixed tests to maintain compatibility with the next version of waldo (#6823).

  • Joins better handle key columns will all NAs (#6804).

dplyr 1.1.1

22 Mar 13:19
Compare
Choose a tag to compare
  • Mutating joins now warn about multiple matches much less often. At a high
    level, a warning was previously being thrown when a one-to-many or
    many-to-many relationship was detected between the keys of x and y, but is
    now only thrown for a many-to-many relationship, which is much rarer and much
    more dangerous than one-to-many because it can result in a Cartesian explosion
    in the number of rows returned from the join (#6731, #6717).

    We've accomplished this in two steps:

    • multiple now defaults to "all", and the options of "error" and
      "warning" are now deprecated in favor of using relationship (see below).
      We are using an accelerated deprecation process for these two options
      because they've only been available for a few weeks, and relationship is
      a clearly superior alternative.

    • The mutating joins gain a new relationship argument, allowing you to
      optionally enforce one of the following relationship constraints between the
      keys of x and y: "one-to-one", "one-to-many", "many-to-one", or
      "many-to-many".

      For example, "many-to-one" enforces that each row in x can match at
      most 1 row in y. If a row in x matches >1 rows in y, an error is
      thrown. This option serves as the replacement for multiple = "error".

      The default behavior of relationship doesn't assume that there is any
      relationship between x and y. However, for equality joins it will check
      for the presence of a many-to-many relationship, and will warn if it detects
      one.

    This change unfortunately does mean that if you have set multiple = "all" to
    avoid a warning and you happened to be doing a many-to-many style join, then
    you will need to replace multiple = "all" with
    relationship = "many-to-many" to silence the new warning, but we believe
    this should be rare since many-to-many relationships are fairly uncommon.

  • Fixed a major performance regression in case_when(). It is still a little
    slower than in dplyr 1.0.10, but we plan to improve this further in the future
    (#6674).

  • Fixed a performance regression related to nth(), first(), and last()
    (#6682).

  • Fixed an issue where expressions involving infix operators had an abnormally
    large amount of overhead (#6681).

  • group_data() on ungrouped data frames is faster (#6736).

  • n() is a little faster when there are many groups (#6727).

  • pick() now returns a 1 row, 0 column tibble when ... evaluates to an
    empty selection. This makes it more compatible with tidyverse recycling
    rules
    in some
    edge cases (#6685).

  • if_else() and case_when() again accept logical conditions that have
    attributes (#6678).

  • arrange() can once again sort the numeric_version type from base R
    (#6680).

  • slice_sample() now works when the input has a column named replace.
    slice_min() and slice_max() now work when the input has columns named
    na_rm or with_ties (#6725).

  • nth() now errors informatively if n is NA (#6682).

  • Joins now throw a more informative error when y doesn't have the same
    source as x (#6798).

  • All major dplyr verbs now throw an informative error message if the input
    data frame contains a column named NA or "" (#6758).

  • Deprecation warnings thrown by filter() now mention the correct package
    where the problem originated from (#6679).

  • Fixed an issue where using <- within a grouped mutate() or summarise()
    could cross contaminate other groups (#6666).

  • The compatibility vignette has been replaced with a more general vignette on
    using dplyr in packages, vignette("in-packages") (#6702).

  • The developer documentation in ?dplyr_extending has been refreshed and
    brought up to date with all changes made in 1.1.0 (#6695).

  • rename_with() now includes an example of using paste0(recycle0 = TRUE) to
    correctly handle empty selections (#6688).

  • R >=3.5.0 is now explicitly required. This is in line with the tidyverse
    policy of supporting the 5 most recent versions of
    R
    .

dplyr 1.1.0

30 Jan 15:12
Compare
Choose a tag to compare

New features

  • .by/by is an
    experimental alternative to group_by() that supports per-operation grouping
    for mutate(), summarise(), filter(), and the slice() family (#6528).

    Rather than:

    starwars %>%
      group_by(species, homeworld) %>%
      summarise(mean_height = mean(height))
    

    You can now write:

    starwars %>%
      summarise(
        mean_height = mean(height),
        .by = c(species, homeworld)
      )
    

    The most useful reason to do this is because .by only affects a single
    operation. In the example above, an ungrouped data frame went into the
    summarise() call, so an ungrouped data frame will come out; with .by, you
    never need to remember to ungroup() afterwards and you never need to use
    the .groups argument.

    Additionally, using summarise() with .by will never sort the results by
    the group key, unlike with group_by(). Instead, the results are returned
    using the existing ordering of the groups from the original data. We feel this
    is more predictable, better maintains any ordering you might have already
    applied with a previous call to arrange(), and provides a way to maintain
    the current ordering without having to resort to factors.

    This feature was inspired by
    data.table, where the
    equivalent syntax looks like:

    starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]
    

    with_groups() is superseded in favor of .by (#6582).

  • reframe() is a new experimental verb that creates a new data frame by
    applying functions to columns of an existing data frame. It is very similar to
    summarise(), with two big differences:

    • reframe() can return an arbitrary number of rows per group, while
      summarise() reduces each group down to a single row.

    • reframe() always returns an ungrouped data frame, while summarise()
      might return a grouped or rowwise data frame, depending on the scenario.

    reframe() has been added in response to valid concern from the community
    that allowing summarise() to return any number of rows per group increases
    the chance for accidental bugs. We still feel that this is a powerful
    technique, and is a principled replacement for do(), so we have moved these
    features to reframe() (#6382).

  • group_by() now uses a new algorithm for computing groups. It is often faster
    than the previous approach (especially when there are many groups), and in
    most cases there should be no changes. The one exception is with character
    vectors, see the C locale news bullet below for more details (#4406, #6297).

  • arrange() now uses a faster algorithm for sorting character vectors, which
    is heavily inspired by data.table's forder(). See the C locale news bullet
    below for more details (#4962).

  • Joins have been completely overhauled to enable more flexible join operations
    and provide more tools for quality control. Many of these changes are inspired
    by data.table's join syntax (#5914, #5661, #5413, #2240).

    • A join specification can now be created through join_by(). This allows
      you to specify both the left and right hand side of a join using unquoted
      column names, such as join_by(sale_date == commercial_date). Join
      specifications can be supplied to any *_join() function as the by
      argument.

    • Join specifications allow for new types of joins:

      • Equality joins: The most common join, specified by ==. For example,
        join_by(sale_date == commercial_date).

      • Inequality joins: For joining on inequalities, i.e.>=, >, <, and
        <=. For example, use join_by(sale_date >= commercial_date) to find
        every commercial that aired before a particular sale.

      • Rolling joins: For "rolling" the closest match forward or backwards when
        there isn't an exact match, specified by using the rolling helper,
        closest(). For example,
        join_by(closest(sale_date >= commercial_date)) to find only the most
        recent commercial that aired before a particular sale.

      • Overlap joins: For detecting overlaps between sets of columns, specified
        by using one of the overlap helpers: between(), within(), or
        overlaps(). For example, use
        join_by(between(commercial_date, sale_date_lower, sale_date)) to
        find commercials that aired before a particular sale, as long as they
        occurred after some lower bound, such as 40 days before the sale was made.

      Note that you cannot use arbitrary expressions in the join conditions, like
      join_by(sale_date - 40 >= commercial_date). Instead, use mutate() to
      create a new column containing the result of sale_date - 40 and refer
      to that by name in join_by().

    • multiple is a new argument for controlling what happens when a row
      in x matches multiple rows in y. For equality joins and rolling joins,
      where this is usually surprising, this defaults to signalling a "warning",
      but still returns all of the matches. For inequality joins, where multiple
      matches are usually expected, this defaults to returning "all" of the
      matches. You can also return only the "first" or "last" match, "any"
      of the matches, or you can "error".

    • keep now defaults to NULL rather than FALSE. NULL implies
      keep = FALSE for equality conditions, but keep = TRUE for inequality
      conditions, since you generally want to preserve both sides of an
      inequality join.

    • unmatched is a new argument for controlling what happens when a row
      would be dropped because it doesn't have a match. For backwards
      compatibility, the default is "drop", but you can also choose to
      "error" if dropped rows would be surprising.

  • across() gains an experimental .unpack argument to optionally unpack
    (as in, tidyr::unpack()) data frames returned by functions in .fns
    (#6360).

  • consecutive_id() for creating groups based on contiguous runs of the
    same values, like data.table::rleid() (#1534).

  • case_match() is a "vectorised switch" variant of case_when() that matches
    on values rather than logical expressions. It is like a SQL "simple"
    CASE WHEN statement, whereas case_when() is like a SQL "searched"
    CASE WHEN statement (#6328).

  • cross_join() is a more explicit and slightly more correct replacement for
    using by = character() during a join (#6604).

  • pick() makes it easy to access a subset of columns from the current group.
    pick() is intended as a replacement for across(.fns = NULL), cur_data(),
    and cur_data_all(). We feel that pick() is a much more evocative name when
    you are just trying to select a subset of columns from your data (#6204).

  • symdiff() computes the symmetric difference (#4811).

Lifecycle changes

Breaking changes

  • arrange() and group_by() now use the C locale, not the system locale,
    when ordering or grouping character vectors. This brings substantial
    performance improvements, increases reproducibility across R sessions, makes
    dplyr more consistent with data.table, and we believe it should affect little
    existing code. If it does affect your code, you can use
    options(dplyr.legacy_locale = TRUE) to quickly revert to the previous
    behavior. However, in general, we instead recommend that you use the new
    .locale argument to precisely specify the desired locale. For a full
    explanation please read the associated
    grouping
    and ordering
    tidyups.

  • bench_tbls(), compare_tbls(), compare_tbls2(), eval_tbls(),
    eval_tbls2(), location() and changes(), deprecated in 1.0.0, are now
    defunct (#6387).

  • frame_data(), data_frame_(), lst_() and tbl_sum() are no longer
    re-exported from tibble (#6276, #6277, #6278, #6284).

  • select_vars(), rename_vars(), select_var() and current_vars(),
    deprecated in 0.8.4, are now defunct (#6387).

Newly deprecated

  • across(), c_across(), if_any(), and if_all() now require the
    .cols and .fns arguments. In general, we now recommend that you use
    pick() instead of an empty across() call or across() with no .fns
    (e.g. across(c(x, y)). (#6523).

    • Relying on the previous default of .cols = everything() is deprecated.
      We have skipped the soft-deprecation stage in this case, because indirect
      usage of across() and friends in this way is rare.

    • Relying on the previous default of .fns = NULL is not yet formally
      soft-deprecated, because there was no good alternative until now, but it is
      discouraged and will be soft-deprecated in the next minor release.

  • Passing ... to across() is soft-deprecated because it's ambiguous when
    those arguments are evaluated. Now, instead of (e.g.)
    across(a:b, mean, na.rm = TRUE) you should write
    across(a:b, ~ mean(.x, na.rm = TRUE)) (#6073).

  • all_equal() is deprecated. We've advised against it for some time, and
    we explicitly recommend you use all.equal(), manually reordering the rows
    and columns as needed (#6324).

  • cur_data() and cur_data_all() are soft-deprecated in favour of
    pick() (#6204).

  • Using by = character() to perform a cross join is now soft-deprecated in
    favor of cross_join() (#6604).

  • filter()ing with a 1-column matrix is deprecated (#6091).

  • progress_estimate() is deprecated for all uses (#6387).

  • Using `su...

Read more

dplyr 1.0.10

01 Sep 11:35
Compare
Choose a tag to compare

Hot patch release to resolve R CMD check failures.

dplyr 1.0.9

28 Apr 14:23
a6c1417
Compare
Choose a tag to compare
  • New rows_append() which works like rows_insert() but ignores keys and
    allows you to insert arbitrary rows with a guarantee that the type of x
    won't change (#6249, thanks to @krlmlr for the implementation and @mgirlich
    for the idea).

  • The rows_*() functions no longer require that the key values in x uniquely
    identify each row. Additionally, rows_insert() and rows_delete() no
    longer require that the key values in y uniquely identify each row. Relaxing
    this restriction should make these functions more practically useful for
    data frames, and alternative backends can enforce this in other ways as needed
    (i.e. through primary keys) (#5553).

  • rows_insert() gained a new conflict argument allowing you greater control
    over rows in y with keys that conflict with keys in x. A conflict arises
    if a key in y already exists in x. By default, a conflict results in an
    error, but you can now also "ignore" these y rows. This is very similar to
    the ON CONFLICT DO NOTHING command from SQL (#5588, with helpful additions
    from @mgirlich and @krlmlr).

  • rows_update(), rows_patch(), and rows_delete() gained a new unmatched
    argument allowing you greater control over rows in y with keys that are
    unmatched by the keys in x. By default, an unmatched key results in an
    error, but you can now also "ignore" these y rows (#5984, #5699).

  • rows_delete() no longer requires that the columns of y be a strict subset
    of x. Only the columns specified through by will be utilized from y,
    all others will be dropped with a message.

  • The rows_*() functions now always retain the column types of x. This
    behavior was documented, but previously wasn't being applied correctly
    (#6240).

  • The rows_*() functions now fail elegantly if y is a zero column data frame
    and by isn't specified (#6179).

dplyr 1.0.8

08 Feb 09:41
Compare
Choose a tag to compare
  • Better display of error messages thanks to rlang 1.0.0.

  • mutate(.keep = "none") is no longer identical to transmute().
    transmute() has not been changed, and completely ignores the column ordering
    of the existing data, instead relying on the ordering of expressions
    supplied through .... mutate(.keep = "none") has been changed to ensure
    that pre-existing columns are never moved, which aligns more closely with the
    other .keep options (#6086).

  • filter() forbids matrix results (#5973) and warns about data frame
    results, especially data frames created from across() with a hint
    to use if_any() or if_all().

  • slice() helpers (slice_head(), slice_tail(), slice_min(), slice_max())
    now accept negative values for n and prop (#5961).

  • slice() now indicates which group produces an error (#5931).

  • cur_data() and cur_data_all() don't simplify list columns in rowwise data frames (#5901).

  • dplyr now uses rlang::check_installed() to prompt you whether to install
    required packages that are missing.

  • storms data updated to 2020 (@steveharoz, #5899).

  • coalesce() accepts 1-D arrays (#5557).

  • The deprecated trunc_mat() is no longer reexported from dplyr (#6141).

dplyr 1.0.7

19 Jun 09:03
Compare
Choose a tag to compare
  • across() uses the formula environment when inlining them (#5886).

  • summarise.rowwise_df() is quiet when the result is ungrouped (#5875).

  • c_across() and across() key deparsing not confused by long calls (#5883).

  • across() handles named selections (#5207).

dplyr 1.0.6

05 May 16:01
Compare
Choose a tag to compare
  • add_count() is now generic (#5837).

  • if_any() and if_all() abort when a predicate is mistakingly used as .cols= (#5732).

  • Multiple calls to if_any() and/or if_all() in the same expression are now
    properly disambiguated (#5782).

  • filter() now inlines if_any() and if_all() expressions. This greatly
    improves performance with grouped data frames.

  • Fixed behaviour of ... in top-level across() calls (#5813, #5832).

  • across() now inlines lambda-formulas. This is slightly more performant and
    will allow more optimisations in the future.

  • Fixed issue in bind_rows() causing lists to be incorrectly transformed as
    data frames (#5417, #5749).

  • select() no longer creates duplicate variables when renaming a variable
    to the same name as a grouping variable (#5841).

  • dplyr_col_select() keeps attributes for bare data frames (#5294, #5831).

  • Fixed quosure handling in dplyr::group_by() that caused issues with extra
    arguments (tidyverse/lubridate#959).

  • Removed the name argument from the compute() generic (@ianmcook, #5783).

  • row-wise data frames of 0 rows and list columns are supported again (#5804).