Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop all useless operations when we filter on a field - so we know its value in advance. #775

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

marcenacp
Copy link
Contributor

That way, we can:

  • download parquet 1
  • yield examples parquet 1
  • download parquet 2
  • etc.

Instead of:

  • dowloading all parquets
  • yielding all examples from all parquets

I also added a non-hermetic test which should timeout if there's a regression on this feature.

@marcenacp marcenacp requested a review from a team as a code owner November 29, 2024 14:00
Copy link

github-actions bot commented Nov 29, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core ccl-core self-requested a review November 29, 2024 14:57
Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@marcenacp marcenacp changed the title Drop all useless operations when we use filtering on a field - so we know its value in advance. Drop all useless operations when we filter on a field - so we know its value in advance. Nov 29, 2024
@@ -248,6 +253,15 @@ def test_nonhermetic_loading(version, dataset_name, record_set_name, num_records
["huggingface-c4/metadata.json", "data", 1, {"data/variant": "en"}],
["huggingface-levanti/metadata.json", "levanti_train", 10, None],
["huggingface-open-hermes/metadata.json", "default", 3, None],
# This dataset will timeout if the following feature is broken: mlcroissant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a meaningful error message somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep it simple with a comment - otherwise it means intercepting pytest's timeout or create a timeout, etc

@marcenacp marcenacp merged commit b11924c into main Nov 29, 2024
12 checks passed
@marcenacp marcenacp deleted the feature/efficient-filtering-3 branch November 29, 2024 15:15
@github-actions github-actions bot locked and limited conversation to collaborators Nov 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants