Releases · tidyverse/dplyr

17 Nov 18:12

DavisVaughan

v1.1.4

74de244

dplyr 1.1.4 Latest

Latest

join_by() now allows its helper functions to be namespaced with dplyr::,
like join_by(dplyr::between(x, lower, upper)) (#6838).
left_join() and friends now return a specialized error message if they
detect that your join would return more rows than dplyr can handle (#6912).
slice_*() now throw the correct error if you forget to name n while also
prefixing the call with dplyr:: (#6946).
dplyr_reconstruct()'s default method has been rewritten to avoid
materializing duckplyr queries too early (#6947).
Updated the storms data to include 2022 data (#6937, @steveharoz).
Updated the starwars data to use a new API, because the old one is defunct.
There are very minor changes to the data itself (#6938, @steveharoz).

Contributors

steveharoz

Assets 2

05 Sep 14:36

DavisVaughan

v1.1.3

b4ebddb

dplyr 1.1.3

mutate_each() and summarise_each() now throw correct deprecation messages
(#6869).
setequal() now requires the input data frames to be compatible, similar to
the other set methods like setdiff() or intersect() (#6786).

Assets 2

20 Apr 16:58

hadley

v1.1.2

92ace94

dplyr 1.1.2

count() better documents that it has a .drop argument (#6820).
Fixed tests to maintain compatibility with the next version of waldo (#6823).
Joins better handle key columns will all NAs (#6804).

Assets 2

22 Mar 13:19

hadley

v1.1.1

d2f79bb

dplyr 1.1.1

Mutating joins now warn about multiple matches much less often. At a high
level, a warning was previously being thrown when a one-to-many or
many-to-many relationship was detected between the keys of x and y, but is
now only thrown for a many-to-many relationship, which is much rarer and much
more dangerous than one-to-many because it can result in a Cartesian explosion
in the number of rows returned from the join (#6731, #6717).

We've accomplished this in two steps:
- multiple now defaults to "all", and the options of "error" and
  "warning" are now deprecated in favor of using relationship (see below).
  We are using an accelerated deprecation process for these two options
  because they've only been available for a few weeks, and relationship is
  a clearly superior alternative.
- The mutating joins gain a new relationship argument, allowing you to
  optionally enforce one of the following relationship constraints between the
  keys of x and y: "one-to-one", "one-to-many", "many-to-one", or
  "many-to-many".
  
  For example, "many-to-one" enforces that each row in x can match at
  most 1 row in y. If a row in x matches >1 rows in y, an error is
  thrown. This option serves as the replacement for multiple = "error".
  
  The default behavior of relationship doesn't assume that there is any
  relationship between x and y. However, for equality joins it will check
  for the presence of a many-to-many relationship, and will warn if it detects
  one.
This change unfortunately does mean that if you have set multiple = "all" to
avoid a warning and you happened to be doing a many-to-many style join, then
you will need to replace multiple = "all" with
relationship = "many-to-many" to silence the new warning, but we believe
this should be rare since many-to-many relationships are fairly uncommon.
Fixed a major performance regression in case_when(). It is still a little
slower than in dplyr 1.0.10, but we plan to improve this further in the future
(#6674).
Fixed a performance regression related to nth(), first(), and last()
(#6682).
Fixed an issue where expressions involving infix operators had an abnormally
large amount of overhead (#6681).
group_data() on ungrouped data frames is faster (#6736).
n() is a little faster when there are many groups (#6727).
pick() now returns a 1 row, 0 column tibble when ... evaluates to an
empty selection. This makes it more compatible with tidyverse recycling
rules in some
edge cases (#6685).
if_else() and case_when() again accept logical conditions that have
attributes (#6678).
arrange() can once again sort the numeric_version type from base R
(#6680).
slice_sample() now works when the input has a column named replace.
slice_min() and slice_max() now work when the input has columns named
na_rm or with_ties (#6725).
nth() now errors informatively if n is NA (#6682).
Joins now throw a more informative error when y doesn't have the same
source as x (#6798).
All major dplyr verbs now throw an informative error message if the input
data frame contains a column named NA or "" (#6758).
Deprecation warnings thrown by filter() now mention the correct package
where the problem originated from (#6679).
Fixed an issue where using <- within a grouped mutate() or summarise()
could cross contaminate other groups (#6666).
The compatibility vignette has been replaced with a more general vignette on
using dplyr in packages, vignette("in-packages") (#6702).
The developer documentation in ?dplyr_extending has been refreshed and
brought up to date with all changes made in 1.1.0 (#6695).
rename_with() now includes an example of using paste0(recycle0 = TRUE) to
correctly handle empty selections (#6688).
R >=3.5.0 is now explicitly required. This is in line with the tidyverse
policy of supporting the 5 most recent versions of
R.

Assets 2

30 Jan 15:12

hadley

v1.1.0

b67769f

dplyr 1.1.0

New features

.by/by is an
experimental alternative to group_by() that supports per-operation grouping
for mutate(), summarise(), filter(), and the slice() family (#6528).

Rather than:
```
starwars %>%
  group_by(species, homeworld) %>%
  summarise(mean_height = mean(height))
```
You can now write:
```
starwars %>%
  summarise(
    mean_height = mean(height),
    .by = c(species, homeworld)
  )
```
The most useful reason to do this is because .by only affects a single
operation. In the example above, an ungrouped data frame went into the
summarise() call, so an ungrouped data frame will come out; with .by, you
never need to remember to ungroup() afterwards and you never need to use
the .groups argument.

Additionally, using summarise() with .by will never sort the results by
the group key, unlike with group_by(). Instead, the results are returned
using the existing ordering of the groups from the original data. We feel this
is more predictable, better maintains any ordering you might have already
applied with a previous call to arrange(), and provides a way to maintain
the current ordering without having to resort to factors.

This feature was inspired by
data.table, where the
equivalent syntax looks like:
```
starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]
```
with_groups() is superseded in favor of .by (#6582).
reframe() is a new experimental verb that creates a new data frame by
applying functions to columns of an existing data frame. It is very similar to
summarise(), with two big differences:
- reframe() can return an arbitrary number of rows per group, while
  summarise() reduces each group down to a single row.
- reframe() always returns an ungrouped data frame, while summarise()
  might return a grouped or rowwise data frame, depending on the scenario.
reframe() has been added in response to valid concern from the community
that allowing summarise() to return any number of rows per group increases
the chance for accidental bugs. We still feel that this is a powerful
technique, and is a principled replacement for do(), so we have moved these
features to reframe() (#6382).
group_by() now uses a new algorithm for computing groups. It is often faster
than the previous approach (especially when there are many groups), and in
most cases there should be no changes. The one exception is with character
vectors, see the C locale news bullet below for more details (#4406, #6297).
arrange() now uses a faster algorithm for sorting character vectors, which
is heavily inspired by data.table's forder(). See the C locale news bullet
below for more details (#4962).
Joins have been completely overhauled to enable more flexible join operations
and provide more tools for quality control. Many of these changes are inspired
by data.table's join syntax (#5914, #5661, #5413, #2240).
- A join specification can now be created through join_by(). This allows
  you to specify both the left and right hand side of a join using unquoted
  column names, such as join_by(sale_date == commercial_date). Join
  specifications can be supplied to any *_join() function as the by
  argument.
- Join specifications allow for new types of joins:
  - Equality joins: The most common join, specified by ==. For example,
    join_by(sale_date == commercial_date).
  - Inequality joins: For joining on inequalities, i.e.>=, >, <, and
    <=. For example, use join_by(sale_date >= commercial_date) to find
    every commercial that aired before a particular sale.
  - Rolling joins: For "rolling" the closest match forward or backwards when
    there isn't an exact match, specified by using the rolling helper,
    closest(). For example,
    join_by(closest(sale_date >= commercial_date)) to find only the most
    recent commercial that aired before a particular sale.
  - Overlap joins: For detecting overlaps between sets of columns, specified
    by using one of the overlap helpers: between(), within(), or
    overlaps(). For example, use
    join_by(between(commercial_date, sale_date_lower, sale_date)) to
    find commercials that aired before a particular sale, as long as they
    occurred after some lower bound, such as 40 days before the sale was made.
  Note that you cannot use arbitrary expressions in the join conditions, like
  join_by(sale_date - 40 >= commercial_date). Instead, use mutate() to
  create a new column containing the result of sale_date - 40 and refer
  to that by name in join_by().
- multiple is a new argument for controlling what happens when a row
  in x matches multiple rows in y. For equality joins and rolling joins,
  where this is usually surprising, this defaults to signalling a "warning",
  but still returns all of the matches. For inequality joins, where multiple
  matches are usually expected, this defaults to returning "all" of the
  matches. You can also return only the "first" or "last" match, "any"
  of the matches, or you can "error".
- keep now defaults to NULL rather than FALSE. NULL implies
  keep = FALSE for equality conditions, but keep = TRUE for inequality
  conditions, since you generally want to preserve both sides of an
  inequality join.
- unmatched is a new argument for controlling what happens when a row
  would be dropped because it doesn't have a match. For backwards
  compatibility, the default is "drop", but you can also choose to
  "error" if dropped rows would be surprising.
across() gains an experimental .unpack argument to optionally unpack
(as in, tidyr::unpack()) data frames returned by functions in .fns
(#6360).
consecutive_id() for creating groups based on contiguous runs of the
same values, like data.table::rleid() (#1534).
case_match() is a "vectorised switch" variant of case_when() that matches
on values rather than logical expressions. It is like a SQL "simple"
CASE WHEN statement, whereas case_when() is like a SQL "searched"
CASE WHEN statement (#6328).
cross_join() is a more explicit and slightly more correct replacement for
using by = character() during a join (#6604).
pick() makes it easy to access a subset of columns from the current group.
pick() is intended as a replacement for across(.fns = NULL), cur_data(),
and cur_data_all(). We feel that pick() is a much more evocative name when
you are just trying to select a subset of columns from your data (#6204).
symdiff() computes the symmetric difference (#4811).

Lifecycle changes

Breaking changes

arrange() and group_by() now use the C locale, not the system locale,
when ordering or grouping character vectors. This brings substantial
performance improvements, increases reproducibility across R sessions, makes
dplyr more consistent with data.table, and we believe it should affect little
existing code. If it does affect your code, you can use
options(dplyr.legacy_locale = TRUE) to quickly revert to the previous
behavior. However, in general, we instead recommend that you use the new
.locale argument to precisely specify the desired locale. For a full
explanation please read the associated
grouping
and ordering
tidyups.
bench_tbls(), compare_tbls(), compare_tbls2(), eval_tbls(),
eval_tbls2(), location() and changes(), deprecated in 1.0.0, are now
defunct (#6387).
frame_data(), data_frame_(), lst_() and tbl_sum() are no longer
re-exported from tibble (#6276, #6277, #6278, #6284).
select_vars(), rename_vars(), select_var() and current_vars(),
deprecated in 0.8.4, are now defunct (#6387).

Newly deprecated

across(), c_across(), if_any(), and if_all() now require the
.cols and .fns arguments. In general, we now recommend that you use
pick() instead of an empty across() call or across() with no .fns
(e.g. across(c(x, y)). (#6523).
- Relying on the previous default of .cols = everything() is deprecated.
  We have skipped the soft-deprecation stage in this case, because indirect
  usage of across() and friends in this way is rare.
- Relying on the previous default of .fns = NULL is not yet formally
  soft-deprecated, because there was no good alternative until now, but it is
  discouraged and will be soft-deprecated in the next minor release.
Passing ... to across() is soft-deprecated because it's ambiguous when
those arguments are evaluated. Now, instead of (e.g.)
across(a:b, mean, na.rm = TRUE) you should write
across(a:b, ~ mean(.x, na.rm = TRUE)) (#6073).
all_equal() is deprecated. We've advised against it for some time, and
we explicitly recommend you use all.equal(), manually reordering the rows
and columns as needed (#6324).
cur_data() and cur_data_all() are soft-deprecated in favour of
pick() (#6204).
Using by = character() to perform a cross join is now soft-deprecated in
favor of cross_join() (#6604).
filter()ing with a 1-column matrix is deprecated (#6091).
progress_estimate() is deprecated for all uses (#6387).
Using `su...

Contributors

tnederlof, steveharoz, and eutwt

Assets 2

01 Sep 11:35

hadley

v1.0.10

1293789

dplyr 1.0.10

Hot patch release to resolve R CMD check failures.

Assets 2

28 Apr 14:23

DavisVaughan

v1.0.9

a6c1417

dplyr 1.0.9

New rows_append() which works like rows_insert() but ignores keys and
allows you to insert arbitrary rows with a guarantee that the type of x
won't change (#6249, thanks to @krlmlr for the implementation and @mgirlich
for the idea).
The rows_*() functions no longer require that the key values in x uniquely
identify each row. Additionally, rows_insert() and rows_delete() no
longer require that the key values in y uniquely identify each row. Relaxing
this restriction should make these functions more practically useful for
data frames, and alternative backends can enforce this in other ways as needed
(i.e. through primary keys) (#5553).
rows_insert() gained a new conflict argument allowing you greater control
over rows in y with keys that conflict with keys in x. A conflict arises
if a key in y already exists in x. By default, a conflict results in an
error, but you can now also "ignore" these y rows. This is very similar to
the ON CONFLICT DO NOTHING command from SQL (#5588, with helpful additions
from @mgirlich and @krlmlr).
rows_update(), rows_patch(), and rows_delete() gained a new unmatched
argument allowing you greater control over rows in y with keys that are
unmatched by the keys in x. By default, an unmatched key results in an
error, but you can now also "ignore" these y rows (#5984, #5699).
rows_delete() no longer requires that the columns of y be a strict subset
of x. Only the columns specified through by will be utilized from y,
all others will be dropped with a message.
The rows_*() functions now always retain the column types of x. This
behavior was documented, but previously wasn't being applied correctly
(#6240).
The rows_*() functions now fail elegantly if y is a zero column data frame
and by isn't specified (#6179).

Contributors

krlmlr and mgirlich

Assets 2

08 Feb 09:41

romainfrancois

v1.0.8

46cf4e0

dplyr 1.0.8

Better display of error messages thanks to rlang 1.0.0.
mutate(.keep = "none") is no longer identical to transmute().
transmute() has not been changed, and completely ignores the column ordering
of the existing data, instead relying on the ordering of expressions
supplied through .... mutate(.keep = "none") has been changed to ensure
that pre-existing columns are never moved, which aligns more closely with the
other .keep options (#6086).
filter() forbids matrix results (#5973) and warns about data frame
results, especially data frames created from across() with a hint
to use if_any() or if_all().
slice() helpers (slice_head(), slice_tail(), slice_min(), slice_max())
now accept negative values for n and prop (#5961).
slice() now indicates which group produces an error (#5931).
cur_data() and cur_data_all() don't simplify list columns in rowwise data frames (#5901).
dplyr now uses rlang::check_installed() to prompt you whether to install
required packages that are missing.
storms data updated to 2020 (@steveharoz, #5899).
coalesce() accepts 1-D arrays (#5557).
The deprecated trunc_mat() is no longer reexported from dplyr (#6141).

Contributors

steveharoz

Assets 2

19 Jun 09:03

romainfrancois

v1.0.7

33d7782

dplyr 1.0.7

across() uses the formula environment when inlining them (#5886).
summarise.rowwise_df() is quiet when the result is ungrouped (#5875).
c_across() and across() key deparsing not confused by long calls (#5883).
across() handles named selections (#5207).

Assets 2

05 May 16:01

romainfrancois

v1.0.6

22def18

dplyr 1.0.6

add_count() is now generic (#5837).
if_any() and if_all() abort when a predicate is mistakingly used as .cols= (#5732).
Multiple calls to if_any() and/or if_all() in the same expression are now
properly disambiguated (#5782).
filter() now inlines if_any() and if_all() expressions. This greatly
improves performance with grouped data frames.
Fixed behaviour of ... in top-level across() calls (#5813, #5832).
across() now inlines lambda-formulas. This is slightly more performant and
will allow more optimisations in the future.
Fixed issue in bind_rows() causing lists to be incorrectly transformed as
data frames (#5417, #5749).
select() no longer creates duplicate variables when renaming a variable
to the same name as a grouping variable (#5841).
dplyr_col_select() keeps attributes for bare data frames (#5294, #5831).
Fixed quosure handling in dplyr::group_by() that caused issues with extra
arguments (tidyverse/lubridate#959).
Removed the name argument from the compute() generic (@ianmcook, #5783).
row-wise data frames of 0 rows and list columns are supported again (#5804).

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

New features

Lifecycle changes

Breaking changes

Newly deprecated

Contributors

Contributors

Contributors

Releases: tidyverse/dplyr

dplyr 1.1.4

Contributors

dplyr 1.1.3

dplyr 1.1.2

dplyr 1.1.1

dplyr 1.1.0

New features

Lifecycle changes

Breaking changes

Newly deprecated

Contributors

dplyr 1.0.10

dplyr 1.0.9

Contributors

dplyr 1.0.8

Contributors

dplyr 1.0.7

dplyr 1.0.6