Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CroissantBuilder does not work on Windows machines #5546

Open
zwouter opened this issue Aug 6, 2024 · 6 comments
Open

CroissantBuilder does not work on Windows machines #5546

zwouter opened this issue Aug 6, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@zwouter
Copy link

zwouter commented Aug 6, 2024

Short description
When using a simple example code snippet of the CroissantBuilder to load datasets using the croissant format, it only seems to work on Linux.
The code snippet below correctly downloads and prepares a dataset on Collab, or WSL, but results in an error on Windows. All tested on a clean virtual environment.

Environment information

  • Operating System: Windows 11

  • Python version: 3.11.1

  • tensorflow-datasets/tfds-nightly version: tfds-nightly 4.9.6.dev202408050044

  • tensorflow/tf-nightly version: tensorflow 2.17.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?
    Yes

Reproduction instructions

import mlcroissant as mlc
import tensorflow_datasets as tfds

url = "https://huggingface.co/api/datasets/fashion_mnist/croissant"
builder = tfds.core.dataset_builders.CroissantBuilder(jsonld=url, file_format='array_record')
builder.download_and_prepare()

Link to logs
https://pastebin.com/fRrfn8jj

Expected behavior
A dataset builder is prepared such that I can use .as_data_source() later.

@zwouter zwouter added the bug Something isn't working label Aug 6, 2024
@marcenacp
Copy link
Collaborator

marcenacp commented Aug 7, 2024

Hey @zwouter, thanks a lot for opening the issue!

I don't have access to a Windows machine. Can you help us investigate? From the logs, it seems to come from mlc not yielding any example from the default split:

AssertionError: Failed to finalize writing of split "default"No examples were yielded.

For some reasons, it tries to load the default split (but from the JSON-LD, it seems only the fashion_mnist split works.

  • Do you have the latest versions of both mlcroissant and tfds-nightly installed?
  • Can you confirm that the following code yields an example on Windows?
import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
  print(x)

Thanks!

@zwouter
Copy link
Author

zwouter commented Aug 7, 2024

Hi @marcenacp, thanks for the reply!

I have the latest versions of mlcroissant and tfds-nightly installed, I created a new virtual environment yesterday to test this.

That piece of code does not print anything if I run it on on Windows.

@marcenacp
Copy link
Collaborator

Weird!

Can you please try to delete all local caches? (Caches are located in ~/.cache/croissant for Croissant and ~/tensorflow_datasets for TFDS)

@zwouter
Copy link
Author

zwouter commented Aug 7, 2024

Yess, just deleted the relevant chaches, same results.

@marcenacp
Copy link
Collaborator

marcenacp commented Aug 7, 2024

Sorry, that was a blind guess as I cannot reproduce what happens in Windows. Could you please help us understand why the following snippet doesn't print anything?

import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
  print(x)
  break

You can install mlcroissant in dev mode:

pip uninstall mlcroissant
git clone https://github.com/mlcommons/croissant
cd croissant/python/mlcroissant
pip install -e .[dev]

Adding prints/debug points in records and in sub functions should help you find something. The potential culprits could be:

  • Is a RecordSet even recognized (source)?
  • Is the data correctly downloaded and read (source)?
  • Are files properly filtered (source)? Maybe glob patterns work differently on Windows.

For each source I gave you, you could debug the input/output to follow the data flow.

Thanks in advance for your help and contribution!

@zwouter
Copy link
Author

zwouter commented Aug 8, 2024

No problem, I'm happy with any help I can get :)

And thanks for the resources! Unfortunately, I don't think I have the time to completely debug this right now.
I might look into it further if I find some spare time in the coming weeks.

@marcenacp marcenacp assigned marcenacp and zwouter and unassigned marcenacp Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants