reference_home.Rmd

---
output:
  github_document:
    html_preview: false
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Reference

`polars` provides a large number of functions for numerous data types and this
can sometimes be a bit overwhelming. Overall, you should be able to do anything
you want with `polars` by specifying the **data structure** you want to use and
then by applying **expressions** in a particular **context**.

## Data structure

As explained in some vignettes, one of `polars` biggest strengths is the ability
to choose between eager and lazy evaluation, that require respectively a
`DataFrame` and a `LazyFrame` (with their counterparts `GroupBy` and `LazyGroupBy`
for grouped data).

We can apply functions directly on a `DataFrame` or `LazyFrame`, such as `rename()`
or `drop()`. Most functions that can be applied to `DataFrame`s can also be used
on `LazyFrame`s, but some are specific to one or the other. For example:

* `$equals()` exists for `DataFrame` but not for `LazyFrame`;
* `$collect()` executes a lazy query, which means it can only be applied on
  a `LazyFrame`.

Another common data structure is the `Series`, which can be considered as the
equivalent of R vectors in `polars`' world. Therefore, a `DataFrame` is a list of
`Series`.

Operations on `DataFrame` or `LazyFrame` are useful, but many more operations
can be applied on columns themselves by using various **expressions** in different
**contexts**.

## Contexts

A context simply is the type of data modification that is done. There are 3 types
of contexts:

* select and modify columns with `select()` and `with_columns()`;
* filter rows with `filter()`;
* group and aggregate rows with `group_by()` and `agg()`

Inside each context, you can use various **expressions** (aka. `Expr`). Some
expressions cannot be used in some contexts. For example, in `with_columns()`,
you can only apply expressions that return either the same number of values or a
single value that will be duplicated on all rows:

```{r}
test = pl$DataFrame(mtcars)
```

```{r}
# this works
test$with_columns(
  pl$col("mpg") + 1
)
```

```r
# this doesn't work because it returns only 2 values, while mtcars has 32 rows.
test$with_columns(
  pl$col("mpg")$slice(0, 2)
)
```

By contrast, in an `agg()` context, any number of return values are possible, as
they are returned in a list, and only the new columns or the grouping columns
are returned.

```{r}
test$group_by(pl$col("cyl"))$agg(
  pl$col("mpg"), # varying number of values
  pl$col("mpg")$slice(0, 2)$name$suffix("_sliced"), # two values
  # aggregated to one value and implicitly unpacks list
  pl$col("mpg")$sum()$name$suffix("_summed")
)
```

## Expressions

Expressions are the building blocks that give all the flexibility we need to
modify or create new columns.

Two important expressions starters are `pl$col()` (names a column in the context)
and `pl$lit()` (wraps a literal value or vector/series in an Expr). Most other
expression starters are syntactic sugar derived from thereof, e.g. `pl$sum(_)` is
actually `pl$col(_)$sum()`.

Expressions can be chained with more than 170 expression methods such as `$sum()`
which aggregates e.g. the column with summing.

```{r}
# two examples of starting, chaining and combining expressions
pl$DataFrame(a = 1:4)$with_columns(
  # compute the cosine of column "a"
  a_cos = pl$col("a")$cos()$sin(),
  # standardize the values of column "a"
  a_stand = (pl$col("a") - pl$col("a")$mean()) / pl$col("a")$std(),
  # take 1:3, name it, then sum, then multiply by two
  lit_sum_add_two = pl$lit(1:3)$sum() * 2L
)
```

Some methods share a common name but their behavior might be very different
depending on the input type. For example, `$decode()` doesn't do the same thing
when it is applied on binary data or on string data.

To be able to distinguish those usages and to check the validity of a query,
`polars` stores methods in subnamespaces. For each datatype other than numeric
(floats and integers), there is a subnamespace containing the available methods:
`dt` (datetime), `list` (list), `str` (strings), `struct` (structs), `cat`
(categoricals) and `bin` (binary). As a sidenote, there is also an exotic
subnamespace called `meta` which is rarely used to manipulate the expressions
themselves. Each subsection in the "Expressions" section lists all operations
available for a specific subnamespace.

For a concrete example for `dt`, suppose we have a column containing dates and
that we want to extract the year from these dates:

```{r}
# Create the DataFrame
df = pl$DataFrame(
  date = pl$date_range(
    as.Date("2020-01-01"),
    as.Date("2023-01-02"),
    interval = "1y"
  )
)
df
```

The function `year()` only makes sense for date-time data, so we look for functions
in the `dt` subnamespace (for **d**ate-**t**ime):

```{r}
df$with_columns(
  year = pl$col("date")$dt$year()
)
```

Similarly, to convert a string column to uppercase, we use the `str` prefix
before using `to_uppercase()`:

```{r}
# Create the DataFrame
pl$DataFrame(foo = c("jake", "mary", "john peter"))$
  with_columns(upper = pl$col("foo")$str$to_uppercase())
```