test speed improvements on agg functions by modifying scan_parquet() arguments and using collect_async() #4

claireboyd · 2024-02-19T19:20:16Z

In thinking through opportunities to improve the speed of the aggregation functions (e.g. rank) on large parquet files, two real opportunities emerged:

changing the arguments of scan_parquet()
using collect_async() rather than collect(), which uses asyncio to process tasks/threads concurrently

In a few tests using 1.1MB and 1.2GB sized parquet files, there were a few key takeaways (see all data collected from tests in the summary table):

In general, collect_async() saw larger performance gains when tested on a larger file size (notably when using parallel=“auto”)
It seems like collect_async works best when one or both of use_statistics or hive_partitioning are turned on

The current implementation uses (collect(), parallel=’auto’, use_statistics=True, hive_partitioning=True). This would be most similar to the first row of the summary table with speed 762 (using this as the benchmark to create recommendations).

Here are recommendations of 3 tests to try based on the 1.2GB run:

collect(), parallel=’row_groups’, use_statistics=False, hive_partitioning=False (~15% improvement from benchmark)
collect_async(), parallel=’row_groups’, use_statistics=True, hive_partitioning=True (~15% improvement from benchmark)
collect(), parallel=’columns’, use_statistics=False, hive_partitioning=True (~13% improvement from benchmark)

claireboyd · 2024-02-22T17:44:47Z

@nmarchio Here's an example of how collect_async works (because it needs to be wrapped in a function & run by asyncio instead of piped into the process). The below code chunk uses the params for the second recommended test above:

INPUT_FILENAME = <filepath to parquet file>
PARAMS_test2 = {'parallel': "row_groups", 'use_statistics': True,'hive_partitioning': True}

async def test_collect_async(input_dir, **kwargs):
    return await (
        (pl.scan_parquet(Path(input_dir), low_memory=True, **kwargs)
        .with_columns([
            # REPLACE COL NAMES HERE FOR THE RELEVANT OPERATION
            (pl.col("AssdTotalValue").rank(method="random", descending = True, seed = 1).over(['SaleRecordingYear', "county_code"]).alias("highestvalbycountybyyear")),
        ])
        ).collect_async(streaming=True, simplify_expression=True)
    )

#Returns a DataFrame (not a LazyDataframe)
df = asyncio.run(test_collect_async(INPUT_FILENAME, **PARAMS_test2))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test speed improvements on agg functions by modifying scan_parquet() arguments and using collect_async() #4

test speed improvements on agg functions by modifying scan_parquet() arguments and using collect_async() #4

claireboyd commented Feb 19, 2024

claireboyd commented Feb 22, 2024

test speed improvements on agg functions by modifying scan_parquet() arguments and using collect_async() #4

test speed improvements on agg functions by modifying scan_parquet() arguments and using collect_async() #4

Comments

claireboyd commented Feb 19, 2024

claireboyd commented Feb 22, 2024