Skip to content

Latest commit

 

History

History
471 lines (377 loc) · 15.6 KB

README.md

File metadata and controls

471 lines (377 loc) · 15.6 KB

neatRanges

CRAN_Status_Badge R-CMD-check codecov lifecycle

The aim of neatRanges is to provide tools for working with date & timestamp ranges, namely:

  • Collapsing,
  • Partitioning,
  • Combining,
  • Expanding,
  • Filling the gaps between ranges.

It primarily uses data.table in order to speed up the operations. One of the functions - collapse_ranges - also uses Rcpp, thanks to the idea from Patrikios/customerRelationship.

You can install it from CRAN by install.packages('neatRanges').

Below is a quick overview of all functions, followed by a detailed description of each one of them.

Function Description Supports dates Supports timestamps
collapse_ranges Collapse consecutive ranges Yes Yes
partition_ranges Split a range into multiple ranges (by month or year) Months and years No
expand_* Expand a range Yes Yes
combine_ranges Combine ranges from multiple tables Yes Yes
fill_ranges Add missing ranges Yes Yes

collapse_ranges

This function can be useful when we want to collapse several consecutive date or timestamp ranges.

Let’s say we have a data frame of different financial actors and their credit ratings for different time ranges.

Now you can see that some of them are actually consecutive. Instead of 6 rows, we’d therefore like to get only 4.

    id rating start_date   end_date marketCapStart marketCapEnd
1 1111     A+ 2014-01-01 2014-12-31            7.2          7.4
2 1111     AA 2015-01-01 2015-12-31            7.3          7.5
3 1111     AA 2016-01-01 2016-03-01            7.9          8.0
4 2222     B- 2017-01-01 2017-01-31            6.5          6.9
5 2222     B- 2018-01-01 2018-12-31            6.7          6.5
6 2222     B- 2019-01-01 2020-02-01            6.8          7.0

We can do this as below. A bit of explanation of some of the arguments:

  • dimension defines whether we are dealing with dates or timestamps; defaults to dates;

  • max_gap defines what is considered as consecutive. By default, it is set to 0, this means no gap whatsoever. If dimension is date, this means 0 days; in case of timestamp, it refers to 0 seconds. A gap of 1 day / 1 second would be expressed as max_gap = 1L;

  • fmt defines the format. By default, it uses %Y-%m-%d for dates, and %Y-%m-%d %H:%M:%OS for timestamps. If your ranges are in another format, you need to modify that accordingly;

  • groups argument is optional;

  • startAttr and endAttr are attributes of every date/timestamp record (optional as arguments). Once the ranges are collapsed, each record will keep the startAttr from the very beginning of the range while the endAttr will contain the attributes from the very end of the range. It’s possible to include multiple columns both in startAttr as well as endAttr. Note that all of them will be - for safety reasons - converted to character beforehand.

df_collapsed <- collapse_ranges(df, 
                                groups = c("id", "rating"), 
                                start_var = "start_date", 
                                end_var = "end_date",
                                startAttr = "marketCapStart",
                                endAttr = "marketCapEnd",
                                max_gap = 0L,
                                fmt = "%Y-%m-%d",
                                dimension = "date"
                                )

df_collapsed
    id rating start_date   end_date marketCapStart marketCapEnd
1 1111     A+ 2014-01-01 2014-12-31            7.2          7.4
2 1111     AA 2015-01-01 2016-03-01            7.3            8
3 2222     B- 2017-01-01 2017-01-31            6.5          6.9
4 2222     B- 2018-01-01 2020-02-01            6.7            7

We can address timestamps in a similar way, only now we need to specify the dimension as timestamp.

Note that here two additional arguments are important:

  • tz defines the time zone - by default, it is set to UTC;

  • origin specifies the origin for indexing/converting the dates; by default, it is set to 1970-01-01.

Let’s say that now we’re dealing with individuals and the way they spend their time:

    id       diary          start_time            end_time
1 1111     reading 2014-01-01 14:00:00 2014-01-01 14:59:59
2 1111 watching TV 2014-01-01 15:00:00 2014-01-01 16:29:59
3 1111 watching TV 2014-01-01 16:30:00 2014-01-01 19:00:00
4 2222     working 2015-01-01 15:00:00 2015-01-01 15:59:59
5 2222     working 2015-01-01 17:00:00 2015-01-01 18:59:59
6 2222     working 2015-01-01 19:00:00 2015-01-01 21:00:00

If we don’t specify the format, it will throw a warning:

df_collapsed <- collapse_ranges(df, 
                                groups = c("id", "diary"), 
                                start_var = "start_time", 
                                end_var = "end_time", 
                                dimension = "timestamp",
                                tz = 'UTC',
                                origin = '1970-01-01'
                                )
Warning in collapse_ranges(df, groups = c("id", "diary"), start_var =
"start_time", : Dimension 'timestamp' selected but format unchanged. Will try to
convert to '%Y-%m-%d %H:%M:%OS' ..
df_collapsed
    id       diary          start_time            end_time
1 1111     reading 2014-01-01 14:00:00 2014-01-01 14:59:59
2 1111 watching TV 2014-01-01 15:00:00 2014-01-01 19:00:00
3 2222     working 2015-01-01 15:00:00 2015-01-01 15:59:59
4 2222     working 2015-01-01 17:00:00 2015-01-01 21:00:00

partition_ranges

This function allows users to further split their ranges.

Currently, this allows only splitting of Date formats (partitioning by either year or month).

Consider the following data frame:

  group      start        end
1     a 2017-05-01 2018-09-01
2     a 2019-04-03 2020-04-03
3     b 2011-03-03 2012-05-03
4     b 2014-05-07 2016-04-02
5     c 2017-02-01 2017-04-05

Partitioning by year (default mode) would look like:

part_by_year <- partition_ranges(df,
                                 start_var = "start",
                                 end_var = "end",
                                 partition_by = "year",
                                 vars_to_keep = "group"
                                 )

head(part_by_year)
  group      start        end
1     a 2017-05-01 2017-12-31
2     a 2018-01-01 2018-09-01
3     a 2019-04-03 2019-12-31
4     a 2020-01-01 2020-04-03
5     b 2011-03-03 2011-12-31
6     b 2012-01-01 2012-05-03

On the other hand, partitioning by month would take the following format:

part_by_month <- partition_ranges(df,
                                  start_var = "start",
                                  end_var = "end",
                                  partition_by = "month",
                                  vars_to_keep = "group"
                                  )

head(part_by_month)
  group      start        end
1     a 2017-05-01 2017-05-31
2     a 2017-06-01 2017-06-30
3     a 2017-07-01 2017-07-31
4     a 2017-08-01 2017-08-31
5     a 2017-09-01 2017-09-30
6     a 2017-10-01 2017-10-31

Note that the vars_to_keep argument is optional and basically specifies which columns you’d like to keep.

There is also fmt argument that specifies the Date format (set to %Y-%m-%d by default).

expand_* family

These are light-weight functions that allow the user to expand date/timestamp range into a column of elements of the sequence.

Let’s say we have a data frame with start and end which we would like to expand:

    id gender      start        end
1 1111      M 2018-01-01 2018-01-05
2 2222      F 2019-01-01 2019-01-07
3 3333      F 2020-01-01 2020-01-08

This can be expanded with calling the expand_dates function with the following arguments:

  • start_var and end_var as column names of our range;

  • name as the future name of our column. This is optional and defaults to ‘Expanded’;

  • vars_to_keep as indicators of which columns we’d like to keep when expanding (optional);

  • unit which defines the unit for expansion - defaults to each day in the sequence.

Other optional argument is also fmt where you can set the format of your dates - by default, it is %Y-%m-%d.

df_exp <- expand_dates(df,
                       start_var = "start",
                       end_var = "end",
                       name = "exp_seqs",
                       vars_to_keep = c("id", "gender"),
                       unit = "day"
                       )
head(df_exp)
    id gender   exp_seqs
1 1111      M 2018-01-01
2 1111      M 2018-01-02
3 1111      M 2018-01-03
4 1111      M 2018-01-04
5 1111      M 2018-01-05
6 2222      F 2019-01-01

You can also tackle timestamp formats in a similar way with expand_times.

The arguments are pretty much the same, except that unit defaults to hour, fmt to %Y-%m-%d %H:%M:%OS, and you can set an additional argument tz (time zone; defaults to UTC).

    id gender               start                 end
1 1111      M 2018-01-01 15:00:00 2018-01-01 18:30:00
2 2222      F 2019-01-01 14:00:00 2019-01-01 17:30:00
3 3333      F 2020-01-01 19:00:00 2020-01-02 02:00:00
df_exp <- expand_times(df,
                       start_var = "start",
                       end_var = "end",
                       name = "exp_seqs",
                       vars_to_keep = c("id", "gender"),
                       unit = "hour"
                       )
head(df_exp)
    id gender            exp_seqs
1 1111      M 2018-01-01 15:00:00
2 1111      M 2018-01-01 16:00:00
3 1111      M 2018-01-01 17:00:00
4 1111      M 2018-01-01 18:00:00
5 2222      F 2019-01-01 14:00:00
6 2222      F 2019-01-01 15:00:00

combine_ranges

This function is essentially a wrapper around collapse_ranges that allows you to combine ranges scattered around different tables into one table with non-redundant splits & rows.

Let’s say we have the following data frames df1, df2 and df3:

       start        end group infoScores
1 2010-01-01 2010-08-05     a          0
2 2012-06-01 2013-03-03     a          3
3 2014-10-15 2015-01-01     b          2
         end group      start score
1 2012-04-05     b 2009-01-15     8
2 2014-06-09     a 2012-07-08     2
3 2009-02-01     b 2008-01-01     3
         end group      start scoreInfo
1 2011-04-05     a 2010-02-03         1
2 2014-12-09     b 2014-07-08         2
3 2009-02-01     c 2008-01-01         3

Note the column names: the range names (start and end) as well as the grouping variables (group) are the same across all data frames. This is a requirement for the function to run properly. However - as you can see from the example -, there is no need for them to be in the same order.

The arguments are almost identical to those of collapse_ranges, except that at the beginning it is possible to pass as many data frames as you would like to combine. They need to be combined together in a list.

For instance, we can combine the above data frames as follows (we don’t need to specify the dimension as it defaults to date):

df <- combine_ranges(dfs = list(df1, df2, df3), 
                     start_var = "start", 
                     end_var = "end", 
                     groups = "group"
                     )

df
  group      start        end
1     a 2010-01-01 2011-04-05
2     a 2012-06-01 2014-06-09
3     b 2008-01-01 2012-04-05
4     b 2014-07-08 2015-01-01
5     c 2008-01-01 2009-02-01

As you have probably noticed, the output contains only the relevant columns: range variables & grouping variables. Note that the groups argument is optional. On the other hand, you can also use the startAttr and endAttr arguments, the same as in collapse_ranges.

fill_ranges

The function adds missing ranges to a table. It supports both Date and POSIXct formats; by default, it assumes the range columns are dates.

  group      start        end cost score
1     a 2007-01-01 2008-02-05  143    99
2     a 2010-06-02 2013-04-05  144    33
3     b 2009-04-05 2009-06-03  105    44
4     b 2012-08-01 2013-02-17  153    22
5     b 2019-03-19 2021-04-21  124    33
6     c 2020-01-05 2020-01-09  105   105
7     d 2014-01-01 2014-12-31  153   153
8     d 2015-01-01 2016-12-31  124   124

The arguments are almost identical to those of collapse_ranges.

By dimension argument you indicate whether your data frame contains date or timestamp values (defaults to dates). groups argument is optional.

The output based on the above data frame:

df <- fill_ranges(df, 
                  start_var = "start", 
                  end_var = "end", 
                  groups = "group"
                  )

df
   group      start        end cost score
1      a 2007-01-01 2008-02-05  143    99
2      a 2008-02-06 2010-06-01   NA    NA
3      a 2010-06-02 2013-04-05  144    33
4      b 2009-04-05 2009-06-03  105    44
5      b 2009-06-04 2012-07-31   NA    NA
6      b 2012-08-01 2013-02-17  153    22
7      b 2013-02-18 2019-03-18   NA    NA
8      b 2019-03-19 2021-04-21  124    33
9      c 2020-01-05 2020-01-09  105   105
10     d 2014-01-01 2014-12-31  153   153
11     d 2015-01-01 2016-12-31  124   124

As you can see, all the original variables are returned.

The rows corresponding to added ranges will by default have NA in the columns that are not ranges or grouping variables (in the above case cost & score variables).

You can change this behaviour by adjusting the fill parameter, like below:

df <- fill_ranges(df, 
                  groups = "group", 
                  start_var = "start", 
                  end_var = "end", 
                  fill = "cost = 0, score = Missing"
                  )

df
   group      start        end cost   score
1      a 2007-01-01 2008-02-05  143      99
2      a 2008-02-06 2010-06-01    0 Missing
3      a 2010-06-02 2013-04-05  144      33
4      b 2009-04-05 2009-06-03  105      44
5      b 2009-06-04 2012-07-31    0 Missing
6      b 2012-08-01 2013-02-17  153      22
7      b 2013-02-18 2019-03-18    0 Missing
8      b 2019-03-19 2021-04-21  124      33
9      c 2020-01-05 2020-01-09  105     105
10     d 2014-01-01 2014-12-31  153     153
11     d 2015-01-01 2016-12-31  124     124

Note that this feature is somewhat experimental & currently the columns to be filled are - for safety reasons - automatically converted to character. Additional checks & more flexibility will be added in the future.