Handle non-capturing groups in regex transforms (`partial-train/*.parquet`). #774

marcenacp · 2024-11-29T12:37:44Z

The initial issue handled in this PR is to cover for Hugging Face's regex pattern: "default/(?:partial-)?(train|test)/.+parquet$".

Indeed, in the parquet branch, when the dataset is not entirely moved to parquet, the folder may be called partial-, so we have to handle those files.

The difficulty is that transform.regex is a regular expression while includes is a glob pattern, so we have to convert from one to the other.

Before this PR, the following command fails:

mlcroissant load --jsonld https://huggingface.co/api/datasets/mlfoundations/dclm-baseline-1.0-parquet/croissant --record_set default --num_records 1 --debug --filters '{"default/split": "train"}'

After this PR, it does succeed, although there's another regression: we download all parquet files before doing the join instead of joining file by file.

github-actions · 2024-11-29T12:37:56Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ccl-core

Very nice! Thank you :)

marcenacp requested a review from a team as a code owner November 29, 2024 12:37

marcenacp force-pushed the feature/efficient-filtering-2 branch from 58fff40 to f0ef1aa Compare November 29, 2024 12:37

marcenacp changed the title ~~Handle non-capturing groups in regex transforms.~~ Handle non-capturing groups in regex transforms (partial-train/*.parquet). Nov 29, 2024

marcenacp force-pushed the feature/efficient-filtering-2 branch from f0ef1aa to 0cbd32f Compare November 29, 2024 12:45

Handle non-capturing groups in regex transforms.

2a5ee7e

marcenacp force-pushed the feature/efficient-filtering-2 branch from 0cbd32f to 2a5ee7e Compare November 29, 2024 12:53

ccl-core approved these changes Nov 29, 2024

View reviewed changes

marcenacp merged commit d1e81bd into main Nov 29, 2024
12 checks passed

marcenacp deleted the feature/efficient-filtering-2 branch November 29, 2024 13:24

github-actions bot locked and limited conversation to collaborators Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-capturing groups in regex transforms (`partial-train/*.parquet`). #774

Handle non-capturing groups in regex transforms (`partial-train/*.parquet`). #774

marcenacp commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024 •

edited

Loading

ccl-core left a comment

Handle non-capturing groups in regex transforms (partial-train/*.parquet). #774

Handle non-capturing groups in regex transforms (partial-train/*.parquet). #774

Conversation

marcenacp commented Nov 29, 2024 • edited Loading

github-actions bot commented Nov 29, 2024 • edited Loading

ccl-core left a comment

Choose a reason for hiding this comment

Handle non-capturing groups in regex transforms (`partial-train/*.parquet`). #774

Handle non-capturing groups in regex transforms (`partial-train/*.parquet`). #774

marcenacp commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024 •

edited

Loading