Releases: tidyverse/dplyr
dplyr 1.1.4
-
join_by()
now allows its helper functions to be namespaced withdplyr::
,
likejoin_by(dplyr::between(x, lower, upper))
(#6838). -
left_join()
and friends now return a specialized error message if they
detect that your join would return more rows than dplyr can handle (#6912). -
slice_*()
now throw the correct error if you forget to namen
while also
prefixing the call withdplyr::
(#6946). -
dplyr_reconstruct()
's default method has been rewritten to avoid
materializing duckplyr queries too early (#6947). -
Updated the
storms
data to include 2022 data (#6937, @steveharoz). -
Updated the
starwars
data to use a new API, because the old one is defunct.
There are very minor changes to the data itself (#6938, @steveharoz).
dplyr 1.1.3
dplyr 1.1.2
dplyr 1.1.1
-
Mutating joins now warn about multiple matches much less often. At a high
level, a warning was previously being thrown when a one-to-many or
many-to-many relationship was detected between the keys ofx
andy
, but is
now only thrown for a many-to-many relationship, which is much rarer and much
more dangerous than one-to-many because it can result in a Cartesian explosion
in the number of rows returned from the join (#6731, #6717).We've accomplished this in two steps:
-
multiple
now defaults to"all"
, and the options of"error"
and
"warning"
are now deprecated in favor of usingrelationship
(see below).
We are using an accelerated deprecation process for these two options
because they've only been available for a few weeks, andrelationship
is
a clearly superior alternative. -
The mutating joins gain a new
relationship
argument, allowing you to
optionally enforce one of the following relationship constraints between the
keys ofx
andy
:"one-to-one"
,"one-to-many"
,"many-to-one"
, or
"many-to-many"
.For example,
"many-to-one"
enforces that each row inx
can match at
most 1 row iny
. If a row inx
matches >1 rows iny
, an error is
thrown. This option serves as the replacement formultiple = "error"
.The default behavior of
relationship
doesn't assume that there is any
relationship betweenx
andy
. However, for equality joins it will check
for the presence of a many-to-many relationship, and will warn if it detects
one.
This change unfortunately does mean that if you have set
multiple = "all"
to
avoid a warning and you happened to be doing a many-to-many style join, then
you will need to replacemultiple = "all"
with
relationship = "many-to-many"
to silence the new warning, but we believe
this should be rare since many-to-many relationships are fairly uncommon. -
-
Fixed a major performance regression in
case_when()
. It is still a little
slower than in dplyr 1.0.10, but we plan to improve this further in the future
(#6674). -
Fixed a performance regression related to
nth()
,first()
, andlast()
(#6682). -
Fixed an issue where expressions involving infix operators had an abnormally
large amount of overhead (#6681). -
group_data()
on ungrouped data frames is faster (#6736). -
n()
is a little faster when there are many groups (#6727). -
pick()
now returns a 1 row, 0 column tibble when...
evaluates to an
empty selection. This makes it more compatible with tidyverse recycling
rules in some
edge cases (#6685). -
if_else()
andcase_when()
again accept logical conditions that have
attributes (#6678). -
arrange()
can once again sort thenumeric_version
type from base R
(#6680). -
slice_sample()
now works when the input has a column namedreplace
.
slice_min()
andslice_max()
now work when the input has columns named
na_rm
orwith_ties
(#6725). -
nth()
now errors informatively ifn
isNA
(#6682). -
Joins now throw a more informative error when
y
doesn't have the same
source asx
(#6798). -
All major dplyr verbs now throw an informative error message if the input
data frame contains a column namedNA
or""
(#6758). -
Deprecation warnings thrown by
filter()
now mention the correct package
where the problem originated from (#6679). -
Fixed an issue where using
<-
within a groupedmutate()
orsummarise()
could cross contaminate other groups (#6666). -
The compatibility vignette has been replaced with a more general vignette on
using dplyr in packages,vignette("in-packages")
(#6702). -
The developer documentation in
?dplyr_extending
has been refreshed and
brought up to date with all changes made in 1.1.0 (#6695). -
rename_with()
now includes an example of usingpaste0(recycle0 = TRUE)
to
correctly handle empty selections (#6688). -
R >=3.5.0 is now explicitly required. This is in line with the tidyverse
policy of supporting the 5 most recent versions of
R.
dplyr 1.1.0
New features
-
.by
/by
is an
experimental alternative togroup_by()
that supports per-operation grouping
formutate()
,summarise()
,filter()
, and theslice()
family (#6528).Rather than:
starwars %>% group_by(species, homeworld) %>% summarise(mean_height = mean(height))
You can now write:
starwars %>% summarise( mean_height = mean(height), .by = c(species, homeworld) )
The most useful reason to do this is because
.by
only affects a single
operation. In the example above, an ungrouped data frame went into the
summarise()
call, so an ungrouped data frame will come out; with.by
, you
never need to remember toungroup()
afterwards and you never need to use
the.groups
argument.Additionally, using
summarise()
with.by
will never sort the results by
the group key, unlike withgroup_by()
. Instead, the results are returned
using the existing ordering of the groups from the original data. We feel this
is more predictable, better maintains any ordering you might have already
applied with a previous call toarrange()
, and provides a way to maintain
the current ordering without having to resort to factors.This feature was inspired by
data.table, where the
equivalent syntax looks like:starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]
with_groups()
is superseded in favor of.by
(#6582). -
reframe()
is a new experimental verb that creates a new data frame by
applying functions to columns of an existing data frame. It is very similar to
summarise()
, with two big differences:-
reframe()
can return an arbitrary number of rows per group, while
summarise()
reduces each group down to a single row. -
reframe()
always returns an ungrouped data frame, whilesummarise()
might return a grouped or rowwise data frame, depending on the scenario.
reframe()
has been added in response to valid concern from the community
that allowingsummarise()
to return any number of rows per group increases
the chance for accidental bugs. We still feel that this is a powerful
technique, and is a principled replacement fordo()
, so we have moved these
features toreframe()
(#6382). -
-
group_by()
now uses a new algorithm for computing groups. It is often faster
than the previous approach (especially when there are many groups), and in
most cases there should be no changes. The one exception is with character
vectors, see the C locale news bullet below for more details (#4406, #6297). -
arrange()
now uses a faster algorithm for sorting character vectors, which
is heavily inspired by data.table'sforder()
. See the C locale news bullet
below for more details (#4962). -
Joins have been completely overhauled to enable more flexible join operations
and provide more tools for quality control. Many of these changes are inspired
by data.table's join syntax (#5914, #5661, #5413, #2240).-
A join specification can now be created through
join_by()
. This allows
you to specify both the left and right hand side of a join using unquoted
column names, such asjoin_by(sale_date == commercial_date)
. Join
specifications can be supplied to any*_join()
function as theby
argument. -
Join specifications allow for new types of joins:
-
Equality joins: The most common join, specified by
==
. For example,
join_by(sale_date == commercial_date)
. -
Inequality joins: For joining on inequalities, i.e.
>=
,>
,<
, and
<=
. For example, usejoin_by(sale_date >= commercial_date)
to find
every commercial that aired before a particular sale. -
Rolling joins: For "rolling" the closest match forward or backwards when
there isn't an exact match, specified by using the rolling helper,
closest()
. For example,
join_by(closest(sale_date >= commercial_date))
to find only the most
recent commercial that aired before a particular sale. -
Overlap joins: For detecting overlaps between sets of columns, specified
by using one of the overlap helpers:between()
,within()
, or
overlaps()
. For example, use
join_by(between(commercial_date, sale_date_lower, sale_date))
to
find commercials that aired before a particular sale, as long as they
occurred after some lower bound, such as 40 days before the sale was made.
Note that you cannot use arbitrary expressions in the join conditions, like
join_by(sale_date - 40 >= commercial_date)
. Instead, usemutate()
to
create a new column containing the result ofsale_date - 40
and refer
to that by name injoin_by()
. -
-
multiple
is a new argument for controlling what happens when a row
inx
matches multiple rows iny
. For equality joins and rolling joins,
where this is usually surprising, this defaults to signalling a"warning"
,
but still returns all of the matches. For inequality joins, where multiple
matches are usually expected, this defaults to returning"all"
of the
matches. You can also return only the"first"
or"last"
match,"any"
of the matches, or you can"error"
. -
keep
now defaults toNULL
rather thanFALSE
.NULL
implies
keep = FALSE
for equality conditions, butkeep = TRUE
for inequality
conditions, since you generally want to preserve both sides of an
inequality join. -
unmatched
is a new argument for controlling what happens when a row
would be dropped because it doesn't have a match. For backwards
compatibility, the default is"drop"
, but you can also choose to
"error"
if dropped rows would be surprising.
-
-
across()
gains an experimental.unpack
argument to optionally unpack
(as in,tidyr::unpack()
) data frames returned by functions in.fns
(#6360). -
consecutive_id()
for creating groups based on contiguous runs of the
same values, likedata.table::rleid()
(#1534). -
case_match()
is a "vectorised switch" variant ofcase_when()
that matches
on values rather than logical expressions. It is like a SQL "simple"
CASE WHEN
statement, whereascase_when()
is like a SQL "searched"
CASE WHEN
statement (#6328). -
cross_join()
is a more explicit and slightly more correct replacement for
usingby = character()
during a join (#6604). -
pick()
makes it easy to access a subset of columns from the current group.
pick()
is intended as a replacement foracross(.fns = NULL)
,cur_data()
,
andcur_data_all()
. We feel thatpick()
is a much more evocative name when
you are just trying to select a subset of columns from your data (#6204). -
symdiff()
computes the symmetric difference (#4811).
Lifecycle changes
Breaking changes
-
arrange()
andgroup_by()
now use the C locale, not the system locale,
when ordering or grouping character vectors. This brings substantial
performance improvements, increases reproducibility across R sessions, makes
dplyr more consistent with data.table, and we believe it should affect little
existing code. If it does affect your code, you can use
options(dplyr.legacy_locale = TRUE)
to quickly revert to the previous
behavior. However, in general, we instead recommend that you use the new
.locale
argument to precisely specify the desired locale. For a full
explanation please read the associated
grouping
and ordering
tidyups. -
bench_tbls()
,compare_tbls()
,compare_tbls2()
,eval_tbls()
,
eval_tbls2()
,location()
andchanges()
, deprecated in 1.0.0, are now
defunct (#6387). -
frame_data()
,data_frame_()
,lst_()
andtbl_sum()
are no longer
re-exported from tibble (#6276, #6277, #6278, #6284). -
select_vars()
,rename_vars()
,select_var()
andcurrent_vars()
,
deprecated in 0.8.4, are now defunct (#6387).
Newly deprecated
-
across()
,c_across()
,if_any()
, andif_all()
now require the
.cols
and.fns
arguments. In general, we now recommend that you use
pick()
instead of an emptyacross()
call oracross()
with no.fns
(e.g.across(c(x, y))
. (#6523).-
Relying on the previous default of
.cols = everything()
is deprecated.
We have skipped the soft-deprecation stage in this case, because indirect
usage ofacross()
and friends in this way is rare. -
Relying on the previous default of
.fns = NULL
is not yet formally
soft-deprecated, because there was no good alternative until now, but it is
discouraged and will be soft-deprecated in the next minor release.
-
-
Passing
...
toacross()
is soft-deprecated because it's ambiguous when
those arguments are evaluated. Now, instead of (e.g.)
across(a:b, mean, na.rm = TRUE)
you should write
across(a:b, ~ mean(.x, na.rm = TRUE))
(#6073). -
all_equal()
is deprecated. We've advised against it for some time, and
we explicitly recommend you useall.equal()
, manually reordering the rows
and columns as needed (#6324). -
cur_data()
andcur_data_all()
are soft-deprecated in favour of
pick()
(#6204). -
Using
by = character()
to perform a cross join is now soft-deprecated in
favor ofcross_join()
(#6604). -
filter()
ing with a 1-column matrix is deprecated (#6091). -
progress_estimate()
is deprecated for all uses (#6387). -
Using `su...
dplyr 1.0.10
Hot patch release to resolve R CMD check failures.
dplyr 1.0.9
-
New
rows_append()
which works likerows_insert()
but ignores keys and
allows you to insert arbitrary rows with a guarantee that the type ofx
won't change (#6249, thanks to @krlmlr for the implementation and @mgirlich
for the idea). -
The
rows_*()
functions no longer require that the key values inx
uniquely
identify each row. Additionally,rows_insert()
androws_delete()
no
longer require that the key values iny
uniquely identify each row. Relaxing
this restriction should make these functions more practically useful for
data frames, and alternative backends can enforce this in other ways as needed
(i.e. through primary keys) (#5553). -
rows_insert()
gained a newconflict
argument allowing you greater control
over rows iny
with keys that conflict with keys inx
. A conflict arises
if a key iny
already exists inx
. By default, a conflict results in an
error, but you can now also"ignore"
thesey
rows. This is very similar to
theON CONFLICT DO NOTHING
command from SQL (#5588, with helpful additions
from @mgirlich and @krlmlr). -
rows_update()
,rows_patch()
, androws_delete()
gained a newunmatched
argument allowing you greater control over rows iny
with keys that are
unmatched by the keys inx
. By default, an unmatched key results in an
error, but you can now also"ignore"
thesey
rows (#5984, #5699). -
rows_delete()
no longer requires that the columns ofy
be a strict subset
ofx
. Only the columns specified throughby
will be utilized fromy
,
all others will be dropped with a message. -
The
rows_*()
functions now always retain the column types ofx
. This
behavior was documented, but previously wasn't being applied correctly
(#6240). -
The
rows_*()
functions now fail elegantly ify
is a zero column data frame
andby
isn't specified (#6179).
dplyr 1.0.8
-
Better display of error messages thanks to rlang 1.0.0.
-
mutate(.keep = "none")
is no longer identical totransmute()
.
transmute()
has not been changed, and completely ignores the column ordering
of the existing data, instead relying on the ordering of expressions
supplied through...
.mutate(.keep = "none")
has been changed to ensure
that pre-existing columns are never moved, which aligns more closely with the
other.keep
options (#6086). -
filter()
forbids matrix results (#5973) and warns about data frame
results, especially data frames created fromacross()
with a hint
to useif_any()
orif_all()
. -
slice()
helpers (slice_head()
,slice_tail()
,slice_min()
,slice_max()
)
now accept negative values forn
andprop
(#5961). -
slice()
now indicates which group produces an error (#5931). -
cur_data()
andcur_data_all()
don't simplify list columns in rowwise data frames (#5901). -
dplyr now uses
rlang::check_installed()
to prompt you whether to install
required packages that are missing. -
storms
data updated to 2020 (@steveharoz, #5899). -
coalesce()
accepts 1-D arrays (#5557). -
The deprecated
trunc_mat()
is no longer reexported from dplyr (#6141).
dplyr 1.0.7
dplyr 1.0.6
-
add_count()
is now generic (#5837). -
if_any()
andif_all()
abort when a predicate is mistakingly used as.cols=
(#5732). -
Multiple calls to
if_any()
and/orif_all()
in the same expression are now
properly disambiguated (#5782). -
filter()
now inlinesif_any()
andif_all()
expressions. This greatly
improves performance with grouped data frames. -
Fixed behaviour of
...
in top-levelacross()
calls (#5813, #5832). -
across()
now inlines lambda-formulas. This is slightly more performant and
will allow more optimisations in the future. -
Fixed issue in
bind_rows()
causing lists to be incorrectly transformed as
data frames (#5417, #5749). -
select()
no longer creates duplicate variables when renaming a variable
to the same name as a grouping variable (#5841). -
dplyr_col_select()
keeps attributes for bare data frames (#5294, #5831). -
Fixed quosure handling in
dplyr::group_by()
that caused issues with extra
arguments (tidyverse/lubridate#959). -
Removed the
name
argument from thecompute()
generic (@ianmcook, #5783). -
row-wise data frames of 0 rows and list columns are supported again (#5804).