Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load reddit_tifu: NonMatchingChecksumError #5729

Open
bermeitinger-b opened this issue Nov 8, 2024 · 1 comment
Open

Can't load reddit_tifu: NonMatchingChecksumError #5729

bermeitinger-b opened this issue Nov 8, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@bermeitinger-b
Copy link

Short description
reddit_tifu cannot be loaded due to mismatching checksum

Environment information

  • Operating System: Colab and Linux

  • Python version: 3.10, 3.11, 3.12

  • tensorflow-datasets/tfds-nightly version: tfds==4.9.6 and tfds-nightly==4.9.7+nightly

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? yes

Reproduction instructions

import tensorflow_datasets as tfds
ds = tfds.load("reddit_tifu/short", split="train", as_supervised=True)

https://colab.research.google.com/drive/12x9Ch4u-eb5bzYrEW4FM65zbrvRc83Zn?usp=sharing

Link to logs

NonMatchingChecksumError                  Traceback (most recent call last)

[<ipython-input-3-257e87e56f6f>](https://localhost:8080/#) in <cell line: 1>()
----> 1 ds = tfds.load("reddit_tifu/short", split="train", as_supervised=True)

19 frames

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/download/download_manager.py](https://localhost:8080/#) in _register_or_validate_checksums(self, url, url_info, path)
    519             'https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror'
    520         )
--> 521         raise NonMatchingChecksumError(msg)
    522 
    523   def _is_checksum_registered(self, url: str) -> bool:

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=1ffWfITKFMJeqjT8loC8aiCLRNJpc_XnF, downloaded to /root/tensorflow_datasets/downloads/reddit_tifu/ucexport_download_id_1ffWfITKFMJeqjT8loC8a_XnFKBkIaeZ0glhNGWombXK7QgN8uq9nDHA-eFtk8ZIIqCA.tmp.aac8a1d89e2b4836802068de76df5ab1/download, has wrong checksum:
* Expected: UrlInfo(size=639.54 MiB, checksum='f175cafe348e0521c2424cd419c934d10c6af613ed8cbe8eaa8cfbaa06377f1a', filename='tifu_all_tokenized_and_filtered.json')
* Got: UrlInfo(size=2.39 KiB, checksum='f9f5d613a6fb71a51c9e1b9622c61b61eea8caf15066bf80007ec8b10b28992e', filename='download')
To debug, see: https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror

Expected behavior
The dataset should load.

@bermeitinger-b bermeitinger-b added the bug Something isn't working label Nov 8, 2024
@fylux
Copy link
Collaborator

fylux commented Nov 11, 2024

Seems like this data is stored in Google Drive and Google Drive decided to show a page before the download stating that it cannot check it for viruses. Thus it downloads not the data but the warning page as HTML.

There are 2 potential solutions:

  • Data is stored in a place oriented towards public programmatic download like GCS or HuggingFace
  • Set this (and other Drive datasets) to manual download, so that the data is not automatically downloaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants