Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-capturing groups in regex transforms (partial-train/*.parquet). #774

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

marcenacp
Copy link
Contributor

@marcenacp marcenacp commented Nov 29, 2024

The initial issue handled in this PR is to cover for Hugging Face's regex pattern: "default/(?:partial-)?(train|test)/.+parquet$".

Indeed, in the parquet branch, when the dataset is not entirely moved to parquet, the folder may be called partial-, so we have to handle those files.

The difficulty is that transform.regex is a regular expression while includes is a glob pattern, so we have to convert from one to the other.

Before this PR, the following command fails:

mlcroissant load --jsonld https://huggingface.co/api/datasets/mlfoundations/dclm-baseline-1.0-parquet/croissant --record_set default --num_records 1 --debug --filters '{"default/split": "train"}'

After this PR, it does succeed, although there's another regression: we download all parquet files before doing the join instead of joining file by file.

@marcenacp marcenacp requested a review from a team as a code owner November 29, 2024 12:37
Copy link

github-actions bot commented Nov 29, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@marcenacp marcenacp changed the title Handle non-capturing groups in regex transforms. Handle non-capturing groups in regex transforms (partial-train/*.parquet). Nov 29, 2024
Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thank you :)

@marcenacp marcenacp merged commit d1e81bd into main Nov 29, 2024
12 checks passed
@marcenacp marcenacp deleted the feature/efficient-filtering-2 branch November 29, 2024 13:24
@github-actions github-actions bot locked and limited conversation to collaborators Nov 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants