Rewrite of r-polars
I have been developing r-polars for over a year, and I felt that a significant rewrite was necessary. r-polars is a clone of py-polars / polars-python, but the package structure is currently quite different. Therefore, it was difficult to keep up with frequent updates.
I thought that now, around the release of Python Polars 1.0.0, is a good time for a complete rewrite, so I decided to try it.
There are several reasons to rewrite r-polars on both the Rust and R sides.
- Appropriate file division. Due to the limitations of extendr, it is not possible to place multiple impl blocks. (extendr/extendr#538)
- Error handling. There is a lot of custom code to use the Result type with extendr, which is quite different from other packages based on extendr. (extendr/extendr#650)
- Simplify type conversion. The code is difficult to follow because it
uses a macro called
robj_to
for type conversion (at least in rust-analyzer).
About 1 and 2, I expect that switching from extendr to savvy will improve the situation.
For 3, in py-polars and nodejs-polars, a thin Wrap
struct wraps other
types and processes them with standard From
traits etc., which I think
makes the code cleaner.
- The structure of classes. In py-polars, the strategy is that classes
defined on the Rust side (e.g.,
PyDataFrame
) are wrapped by classes defined on the Python side (e.g.,DataFrame
). In r-polars, a complex strategy is adopted to update classes created by Rust side/extendr (e.g.,RPolarsDataFrame
) with a lot of custom code. (This is also related to the fact that extendr makes associated functions of Rust structs members of R classes. savvy does not mix associated functions and methods.) - S3 methods first. This is also related to the Rust side, in the
current r-polars, generic functions like
as_polars_series
were added later, so there are several places where type conversion from R to Polars is done on the Rust side, making it difficult to understand where the type conversion is done. If type conversion from R to Polars is done with two generic functions,as_polars_series
andas_polars_expr
, the code will be much simpler and customization from the R side will be possible. - Error handling. Currently, r-polars has its own Result type on the R
side, and error handling is done through it. The backtrace generated
that is quite easy to understand, but it is not necessarily easy to
use when using polars internally in other packages, such as
testthat::expect_error()
. - Based on
rlang
. Currently, r-polars has no R package dependencies. This is great, but that includes a degraded copy oflist2()
instead of the convenient functions in therlang
package.rlang
is a lightweight R package, and I feel that it is more beneficial to depend on the convenient functions ofrlang
than to stick to no dependencies.
1 and 3 are also related to the fact that it is built with extendr, and
it seems that switching to savvy is appropriate here as well. If we
abandon the current Result type on the R side, it is natural to use
rlang
for error handling, so from that perspective, it is reasonable
to depend on rlang
in 4.
The directory structure on the Rust side is a complete copy of py-polars. The structure of R classes is also the same as py-polars.
The basic classes such as DataFrame
, Series
, Expr
, and LazyFrame
have been implemented, and some functions work correctly.
df <- pl$DataFrame(
A = 1:5,
fruits = c("banana", "banana", "apple", "apple", "banana"),
B = 5:1,
cars = c("beetle", "audi", "beetle", "beetle", "beetle"),
)
df$sort("fruits")$select(
"fruits",
"cars",
pl$lit("fruits")$alias("literal_string_fruits"),
pl$col("B")$filter(pl$col("cars") == "beetle")$sum(),
pl$col("A")$filter(pl$col("B") > 2)$sum()$over("cars")$alias("sum_A_by_cars"),
pl$col("A")$sum()$over("fruits")$alias("sum_A_by_fruits"),
pl$col("A")$reverse()$over("fruits")$alias("rev_A_by_fruits"),
pl$col("A")$sort_by("B")$over("fruits")$alias("sort_A_by_B_by_fruits"),
)
#> shape: (5, 8)
#> ┌────────┬────────┬───────────────────────┬─────┬───────────────┬─────────────────┬─────────────────┬───────────────────────┐
#> │ fruits ┆ cars ┆ literal_string_fruits ┆ B ┆ sum_A_by_cars ┆ sum_A_by_fruits ┆ rev_A_by_fruits ┆ sort_A_by_B_by_fruits │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ str ┆ str ┆ str ┆ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
#> ╞════════╪════════╪═══════════════════════╪═════╪═══════════════╪═════════════════╪═════════════════╪═══════════════════════╡
#> │ apple ┆ beetle ┆ fruits ┆ 11 ┆ 4 ┆ 7 ┆ 4 ┆ 4 │
#> │ apple ┆ beetle ┆ fruits ┆ 11 ┆ 4 ┆ 7 ┆ 3 ┆ 3 │
#> │ banana ┆ beetle ┆ fruits ┆ 11 ┆ 4 ┆ 8 ┆ 5 ┆ 5 │
#> │ banana ┆ audi ┆ fruits ┆ 11 ┆ 2 ┆ 8 ┆ 2 ┆ 2 │
#> │ banana ┆ beetle ┆ fruits ┆ 11 ┆ 4 ┆ 8 ┆ 1 ┆ 1 │
#> └────────┴────────┴───────────────────────┴─────┴───────────────┴─────────────────┴─────────────────┴───────────────────────┘
Errors is displayed in a way that is not as bad. (Thanks, @etiennebacher)
# Error from the Rust side
pl$DataFrame(a = "a")$cast(a = pl$Int8)
#> Error:
#> ! Evaluation failed in `$cast()`.
#> Caused by error:
#> ! Evaluation failed in `$collect()`.
#> Caused by error:
#> ! Invalid operation: conversion from `str` to `i8` failed in column 'a' for 1 out of 1 values: ["a"]
# Error from the R side
pl$DataFrame(a = "a")$cast(a = integer)
#> Error:
#> ! Evaluation failed in `$cast()`.
#> Caused by error:
#> ! Evaluation failed in `$cast()`.
#> Caused by error:
#> ! Dynamic dots `...` must be polars data types, got function
The functionality to dispatch the methods of Expr
to Series
has also
been implemented.
s <- as_polars_series(mtcars)
s$struct$field |>
body()
#> {
#> wrap({
#> expr <- do.call(fn, as.list(match.call()[-1]), envir = parent.frame())
#> wrap(`_s`)$to_frame()$select(expr)$to_series()
#> })
#> }
s$struct$field("am")
#> shape: (32,)
#> Series: 'am' [f64]
#> [
#> 1.0
#> 1.0
#> 1.0
#> 0.0
#> 0.0
#> …
#> 1.0
#> 1.0
#> 1.0
#> 1.0
#> 1.0
#> ]
Due to the changes in the package structure, it is now possible to add namespaces, which was not possible with the current r-polars.
math_shortcuts <- function(s) {
# Create a new environment to store the methods
self <- new.env(parent = emptyenv())
# Store the series
self$`_s` <- s
# Add methods
self$square <- function() self$`_s` * self$`_s`
self$cube <- function() self$`_s` * self$`_s` * self$`_s`
# Set the class
class(self) <- c("polars_namespace_series", "polars_object")
# Return the environment
self
}
pl$api$register_series_namespace("math", math_shortcuts)
s <- as_polars_series(c(1.5, 31, 42, 64.5))
s$math$square()$rename("s^2")
#> shape: (4,)
#> Series: 's^2' [f64]
#> [
#> 2.25
#> 961.0
#> 1764.0
#> 4160.25
#> ]
It is now possible to have different bindings for each instance.
# The `fields` binding is only available for `Struct`.
pl$Struct(a = pl$Int32)$fields
#> $a
#> Int32
pl$Int32$fields
#> Error in `pl$Int32$fields`:
#> ! $ - syntax error: `fields` is not a member of this polars object
Due to the changes in the R class structure, the methods are now dynamically added by a loop each time an R class is built. So I’m worried that the performance will degrade after a large number of methods are available. However, it is difficult to compare this at the moment.
I would like to check if it is possible to implement a process like
map_elements
that calls the R from the Rust side.