Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug computing multiple predictions with the same classifier type #69

Closed
sarahyurick opened this issue Aug 7, 2024 · 1 comment
Closed

Comments

@sarahyurick
Copy link
Collaborator

sarahyurick commented Aug 7, 2024

I'm currently working on NVIDIA/NeMo-Curator#173 for NeMo Curator, which uses a multifold quality classifier to generate text quality predictions and their probabilities. The goal is to generate different probabilities per model fold and average them to generate a final prediction. However, I'm finding that only the results for the first quality model used in the pipeline are being saved, despite ensuring that the column names are different. See this notebook for an example.

@VibhuJawa suggested that the bug might be caused by CrossFit modifying the same internal flag in the Dask DataFrame. Also, using persist() on the Dask DataFrames produces the correct results, but from my understanding this isn't desirable because the intended use is to read, modify, and write very large JSONL files.

@sarahyurick sarahyurick changed the title Bug computing multiple predictions of the same classifier type Bug computing multiple predictions with the same classifier type Aug 7, 2024
@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Nov 10, 2024

Closed by #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant