This repository has been archived by the owner on Apr 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
reference_home.Rmd
158 lines (127 loc) · 5.04 KB
/
reference_home.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
output:
github_document:
html_preview: false
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Reference
`polars` provides a large number of functions for numerous data types and this
can sometimes be a bit overwhelming. Overall, you should be able to do anything
you want with `polars` by specifying the **data structure** you want to use and
then by applying **expressions** in a particular **context**.
## Data structure
As explained in some vignettes, one of `polars` biggest strengths is the ability
to choose between eager and lazy evaluation, that require respectively a
`DataFrame` and a `LazyFrame` (with their counterparts `GroupBy` and `LazyGroupBy`
for grouped data).
We can apply functions directly on a `DataFrame` or `LazyFrame`, such as `rename()`
or `drop()`. Most functions that can be applied to `DataFrame`s can also be used
on `LazyFrame`s, but some are specific to one or the other. For example:
* `$equals()` exists for `DataFrame` but not for `LazyFrame`;
* `$collect()` executes a lazy query, which means it can only be applied on
a `LazyFrame`.
Another common data structure is the `Series`, which can be considered as the
equivalent of R vectors in `polars`' world. Therefore, a `DataFrame` is a list of
`Series`.
Operations on `DataFrame` or `LazyFrame` are useful, but many more operations
can be applied on columns themselves by using various **expressions** in different
**contexts**.
## Contexts
A context simply is the type of data modification that is done. There are 3 types
of contexts:
* select and modify columns with `select()` and `with_columns()`;
* filter rows with `filter()`;
* group and aggregate rows with `group_by()` and `agg()`
Inside each context, you can use various **expressions** (aka. `Expr`). Some
expressions cannot be used in some contexts. For example, in `with_columns()`,
you can only apply expressions that return either the same number of values or a
single value that will be duplicated on all rows:
```{r}
test = pl$DataFrame(mtcars)
```
```{r}
# this works
test$with_columns(
pl$col("mpg") + 1
)
```
```r
# this doesn't work because it returns only 2 values, while mtcars has 32 rows.
test$with_columns(
pl$col("mpg")$slice(0, 2)
)
```
By contrast, in an `agg()` context, any number of return values are possible, as
they are returned in a list, and only the new columns or the grouping columns
are returned.
```{r}
test$group_by(pl$col("cyl"))$agg(
pl$col("mpg"), # varying number of values
pl$col("mpg")$slice(0, 2)$name$suffix("_sliced"), # two values
# aggregated to one value and implicitly unpacks list
pl$col("mpg")$sum()$name$suffix("_summed")
)
```
## Expressions
Expressions are the building blocks that give all the flexibility we need to
modify or create new columns.
Two important expressions starters are `pl$col()` (names a column in the context)
and `pl$lit()` (wraps a literal value or vector/series in an Expr). Most other
expression starters are syntactic sugar derived from thereof, e.g. `pl$sum(_)` is
actually `pl$col(_)$sum()`.
Expressions can be chained with more than 170 expression methods such as `$sum()`
which aggregates e.g. the column with summing.
```{r}
# two examples of starting, chaining and combining expressions
pl$DataFrame(a = 1:4)$with_columns(
# compute the cosine of column "a"
a_cos = pl$col("a")$cos()$sin(),
# standardize the values of column "a"
a_stand = (pl$col("a") - pl$col("a")$mean()) / pl$col("a")$std(),
# take 1:3, name it, then sum, then multiply by two
lit_sum_add_two = pl$lit(1:3)$sum() * 2L
)
```
Some methods share a common name but their behavior might be very different
depending on the input type. For example, `$decode()` doesn't do the same thing
when it is applied on binary data or on string data.
To be able to distinguish those usages and to check the validity of a query,
`polars` stores methods in subnamespaces. For each datatype other than numeric
(floats and integers), there is a subnamespace containing the available methods:
`dt` (datetime), `list` (list), `str` (strings), `struct` (structs), `cat`
(categoricals) and `bin` (binary). As a sidenote, there is also an exotic
subnamespace called `meta` which is rarely used to manipulate the expressions
themselves. Each subsection in the "Expressions" section lists all operations
available for a specific subnamespace.
For a concrete example for `dt`, suppose we have a column containing dates and
that we want to extract the year from these dates:
```{r}
# Create the DataFrame
df = pl$DataFrame(
date = pl$date_range(
as.Date("2020-01-01"),
as.Date("2023-01-02"),
interval = "1y"
)
)
df
```
The function `year()` only makes sense for date-time data, so we look for functions
in the `dt` subnamespace (for **d**ate-**t**ime):
```{r}
df$with_columns(
year = pl$col("date")$dt$year()
)
```
Similarly, to convert a string column to uppercase, we use the `str` prefix
before using `to_uppercase()`:
```{r}
# Create the DataFrame
pl$DataFrame(foo = c("jake", "mary", "john peter"))$
with_columns(upper = pl$col("foo")$str$to_uppercase())
```