You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently working on NVIDIA/NeMo-Curator#173 for NeMo Curator, which uses a multifold quality classifier to generate text quality predictions and their probabilities. The goal is to generate different probabilities per model fold and average them to generate a final prediction. However, I'm finding that only the results for the first quality model used in the pipeline are being saved, despite ensuring that the column names are different. See this notebook for an example.
@VibhuJawa suggested that the bug might be caused by CrossFit modifying the same internal flag in the Dask DataFrame. Also, using persist() on the Dask DataFrames produces the correct results, but from my understanding this isn't desirable because the intended use is to read, modify, and write very large JSONL files.
The text was updated successfully, but these errors were encountered:
sarahyurick
changed the title
Bug computing multiple predictions of the same classifier type
Bug computing multiple predictions with the same classifier type
Aug 7, 2024
I'm currently working on NVIDIA/NeMo-Curator#173 for NeMo Curator, which uses a multifold quality classifier to generate text quality predictions and their probabilities. The goal is to generate different probabilities per model fold and average them to generate a final prediction. However, I'm finding that only the results for the first quality model used in the pipeline are being saved, despite ensuring that the column names are different. See this notebook for an example.
@VibhuJawa suggested that the bug might be caused by CrossFit modifying the same internal flag in the Dask DataFrame. Also, using
persist()
on the Dask DataFrames produces the correct results, but from my understanding this isn't desirable because the intended use is to read, modify, and write very large JSONL files.The text was updated successfully, but these errors were encountered: