Create DataFrame from Expression.evaluate(), but out of core #2105

NickCrews · 2022-06-24T20:22:40Z

NickCrews
Jun 24, 2022

Hi! Thanks for your help.

So I currently have a need for "making the cartesian product, elementwise, of two columns of lists", but I think the general case of this is if I need to transform a DF in a way that modifies the number of rows. I have something working, but only for DFs that fit in memory:

Starting with:

	x	y
0	[1 2]	[11 12]
1	[3 4]	[13 14]

My goal is to get:

	x	y
0	1	11
1	1	12
2	2	11
3	2	12
4	3	13
5	3	14
6	4	13
7	4	14

Here is an implementation that works:

import pyarrow.compute as pc
from vaex.dataframe import DataFrame
from vaex.expression import Expression

# Don't worry about understanding this, it's just here to make it work
def explode_table(table: pa.Table, column: str) -> pa.Table:
    """Analogous to pandas.DataFrame.explode()
    
    https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html
    """
    null_filled = pc.fill_null(table[column], [None])
    flattened = pc.list_flatten(null_filled)
    other_columns = list(table.schema.names)
    other_columns.remove(column)
    if len(other_columns) == 0:
        return pa.table({column: flattened})
    else:
        indices = pc.list_parent_indices(null_filled)
        result = table.select(other_columns).take(indices)
        result = result.append_column(
            pa.field(column, table.schema.field(column).type.value_type),
            flattened,
        )
        return result

def cartesian_product_chunk(left: pa.ListArray, right: pa.ListArray) -> pa.Array:
    result: pa.Table = pa.table(
        {
            "left": left,
            "right": right,
        }
    )
    result = explode_table(result, "left")
    result = explode_table(result, "right")
    # Result is a pa.Table, but when calling vaex.DataFrame.apply, our
    # function can only return a 1-D array, because the result of apply is
    # supposed to be an expression.
    # So, turn this pa.Table in a pa.StructArray. Should be minimal copying.
    # We must call .combine_chunks() because .from_arrays() expects pa.Arrays,
    # not pa.ChunkedArrays.
    x = result["left"].combine_chunks()
    y = result["right"].combine_chunks()
    return pa.StructArray.from_arrays((x, y), names=('x', 'y'))

def cartesian_product(df: DataFrame, left: str, right: str) -> Expression:
    return df.apply(cartesian_product_chunk, [left, right], vectorize=True)

df = vaex.from_arrays(
    x=pa.array([[1, 2], [3, 4]]),
    y=pa.array([[11, 12], [13, 14]]),
)
df2 = vaex.from_arrays(
    carted=cartesian_product(df, "x", "y").evaluate()
)
# Upack the 1-D array back into separate columns
df2["x"] = df2["carted"].struct.get("x")
df2["y"] = df2["carted"].struct.get("y")
df2

	carted	x	y
0	{'x': 1, 'y': 11}	1	11
1	{'x': 1, 'y': 12}	1	12
2	{'x': 2, 'y': 11}	2	11
3	{'x': 2, 'y': 12}	2	12
4	{'x': 3, 'y': 13}	3	13
5	{'x': 3, 'y': 14}	3	14
6	{'x': 4, 'y': 13}	4	13
7	{'x': 4, 'y': 14}	4	14

So the issue here is when I have to call the Expression.evaluate(), and the entire result is materialized. I can't add that expression back into the original dataframe, because the cartesian_product() increases the number of rows. I think this also is a problem if rows are thrown out, or if rows are transposed, or basically if there isn't a 1:1 mapping of rows in input to output.

Is there a way to create a new DF by evaluating the expression in chunks? I don't want to, but I could evaluate it in chunks and stream the results to a file using pyarrow, and then read this file back in with vaex. Seems like a vaex.from_expression() constructor could do this? But I'm assuming that this doesn't exist because it feels against the design principles, and would encourage people to greedily evaluate expressions all the time, which is not the style of vaex?

maartenbreddels · 2022-07-26T15:48:11Z

maartenbreddels
Jul 26, 2022
Maintainer

I'm not 100% sure how to attack this issue, but I think this can be lazily implemented using a column, see:
https://github.com/vaexio/vaex/blob/master/packages/vaex-core/vaex/column.py

which is used for vaex.vrange, and several other places.
So I think we could have an 'explode' using the column classes as a first step, does that make sense?

1 reply

NickCrews Jul 26, 2022
Author

You're suggesting adding explode to the DF API? as in I could just do:

df = vaex.from_arrays(
    x=pa.array([[1, 2], [3, 4]]),
    y=pa.array([[11, 12], [13, 14]]),
)
df2 = df.explode("x").explode("y")

and be done? That would be great for my use case.

It doesn't solve the larger more general use case that this is an example of, which is "transform a DF in a way that modifies the number of rows, using apply to operate in chunks". Do you want to tackle that use case? I think this would be related to the generalized groupby-apply problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create DataFrame from Expression.evaluate(), but out of core #2105

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Create DataFrame from Expression.evaluate(), but out of core #2105

NickCrews Jun 24, 2022

Replies: 1 comment · 1 reply

maartenbreddels Jul 26, 2022 Maintainer

NickCrews Jul 26, 2022 Author

NickCrews
Jun 24, 2022

Replies: 1 comment 1 reply

maartenbreddels
Jul 26, 2022
Maintainer

NickCrews Jul 26, 2022
Author