Unable to load model in offline mode using local files #968

yarden4998 · 2024-10-22T14:14:03Z

Description:

I'm attempting to load a pretrained model (ViT-B-16-SigLIP-i18n-256) entirely in offline mode within a Docker container and AWS Lambda environment. Despite setting the appropriate environment variables for offline mode, the system still attempts to reach Hugging Face Hub, leading to the following error:
"We couldn't connect to 'https://huggingface.co/' to load this file, couldn't find it in the cached files and it looks like timm/ViT-B-16-SigLIP-i18n-256 is not the path to a directory containing a file named config.json.\nCheckout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'."

The open_clip functions I have used:

from open_clip import create_model_from_pretrained, get_tokenizer
model, preprocessor = create_model_from_pretrained(model_name="ViT-B-16-SigLIP-i18n-256", pretrained=model_bin_file_path)
tokenizer = get_tokenizer("ViT-B-16-SigLIP-i18n-256")

Steps Taken:

Set the following environment variables in the dockerfile:

HF_DATASETS_OFFLINE=”1”
TRANSFORMERS_OFFLINE=”1”
HF_HUB_OFFLINE=”1”
TRANSFORMERS_OFFLINE=”1”
HF_HUB_CACHE= my_cache_dir

Tried to define these in my lambda handler:

    os.environ['HF_HUB_CACHE'] = my_cache_dir
    os.environ["HF_DATASETS_OFFLINE"] = "1"
    os.environ["TRANSFORMERS_OFFLINE"] = "1"
    os.environ['HF_HUB_OFFLINE'] = "1"

Ensured that the following files are present in my cache directory in the correct structure:
open_clip_config.json
open_clip_pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json

Environment:

Docker environment running Lambda function
Transformers version: 4.45.0
Open clip version: 2.26.1

Could you please provide guidance on why the offline mode isn’t functioning as expected, or if there are additional steps required to force the use of local files only?

The text was updated successfully, but these errors were encountered:

rwightman · 2024-10-22T16:32:54Z

So, I didn't mess around changing the cache dir or anything, but I downloaded the model once using default cache setup, unplugged my internet connection, and ran again with HF_HUB_OFFLINE = 1 and it works fine, loads everything from cache. I didn't bother using pretrained='/path/to/checkpoint' since the tokenizer for this model needs to use the cache

contents related to the model end up under /mycachedir/hub/models--timm--ViT-B-16-SigLIP-i18n-256/ ... but it's not a simple set of files, there are snapshot ids, refs, etc

If you pre-load the cache and then copy as is into the docker container and make sure the base HF cache dir (not just a dir with the model files) matches that, I feel it should work

rwightman · 2024-10-22T17:25:50Z

Also, not using the global cache dir but specifying a local one for this specific model instantiation

Doing this in a python console, first with connection, then without, works for me. Though possible some other cache hit may have been done to the global cache that I didn't notice, I didn't do in isolation as it would be in a container.

NOTE, the tokenizer for siglip and other models where the tokenizer is on the hub needs a cache dir arg too.

TODO: I need to make the tokenizer cache_dir arg explicit and more noticible, right now it's implicit, passed through kwargs to the underlying HF tokenizer wrapper (and it errors out on the models that don't pass through and need any tokenizer files).

import os
os.environ['HF_HUB_OFFLINE'] = '1'

import open_clip
mm = open_clip.create_model_and_transforms('ViT-B-16-SigLIP-i18n-256', pretrained='webli', cache_dir='./cc')
tok = open_clip.get_tokenizer('ViT-B-16-SigLIP-i18n-256', cache_dir='./cc')

yarden4998 · 2024-10-23T08:36:28Z

So, I didn't mess around changing the cache dir or anything, but I downloaded the model once using default cache setup, unplugged my internet connection, and ran again with HF_HUB_OFFLINE = 1 and it works fine, loads everything from cache. I didn't bother using pretrained='/path/to/checkpoint' since the tokenizer for this model needs to use the cache

contents related to the model end up under /mycachedir/hub/models--timm--ViT-B-16-SigLIP-i18n-256/ ... but it's not a simple set of files, there are snapshot ids, refs, etc

If you pre-load the cache and then copy as is into the docker container and make sure the base HF cache dir (not just a dir with the model files) matches that, I feel it should work

As you were saying, running it locally after unplugging my internet connection and after first downloading the models works for me as well.
When I do have internet connection, running the same code with os.environ['HF_HUB_OFFLINE'] = '1' lead to an http request to the huggingface hub. Why is that?
For the docker container, I did pre-load the cache and than copy the models--timm--Vit-B-16-SigLIP-i18n-256 dir with all its sub folders and files into a dir named model. I than set the model dir as HF_HUB_CACHE and got the error above (while running an aws lambda function).

Also, not using the global cache dir but specifying a local one for this specific model instantiation

Doing this in a python console, first with connection, then without, works for me. Though possible some other cache hit may have been done to the global cache that I didn't notice, I didn't do in isolation as it would be in a container.

NOTE, the tokenizer for siglip and other models where the tokenizer is on the hub needs a cache dir arg too.

TODO: I need to make the tokenizer cache_dir arg explicit and more noticible, right now it's implicit, passed through kwargs to the underlying HF tokenizer wrapper (and it errors out on the models that don't pass through and need any tokenizer files).
import os
os.environ['HF_HUB_OFFLINE'] = '1'

import open_clip
mm = open_clip.create_model_and_transforms('ViT-B-16-SigLIP-i18n-256', pretrained='webli', cache_dir='./cc')
tok = open_clip.get_tokenizer('ViT-B-16-SigLIP-i18n-256', cache_dir='./cc')

about that part, if i understand correctly, today we are not able to pass cache_dir param to the get_tokenizer function and it eventually takes the cache dir I have set as HF_CACHE_DIR?

rwightman · 2024-10-23T15:00:24Z

@yarden4998 no the cache_dir argument for get_tokenizer does work right now for models that pass through to a HF tokenizer like this model, when I tried above I verified both the tokenizer and model files ended up in ./cc folder. It is just not clear that it does so. If you don't do this it will try to load tokenzier from the default cache dir, and if that fails it will get stuck.

I am not sure why there would still be any request with HF_HUB_OFFLINefE is set, it must be non-blocking as I don't see any hangs when my network connection is killed.

rwightman · 2024-10-23T18:55:16Z

@yarden4998 I made a few cache_dir related improvements (and fixed one instance where it was missing for getting model config), can by tried on #970 this branch

rwightman closed this as completed Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load model in offline mode using local files #968

Unable to load model in offline mode using local files #968

yarden4998 commented Oct 22, 2024

rwightman commented Oct 22, 2024 •

edited

Loading

rwightman commented Oct 22, 2024

yarden4998 commented Oct 23, 2024

rwightman commented Oct 23, 2024

rwightman commented Oct 23, 2024

Unable to load model in offline mode using local files #968

Unable to load model in offline mode using local files #968

Comments

yarden4998 commented Oct 22, 2024

rwightman commented Oct 22, 2024 • edited Loading

rwightman commented Oct 22, 2024

yarden4998 commented Oct 23, 2024

rwightman commented Oct 23, 2024

rwightman commented Oct 23, 2024

rwightman commented Oct 22, 2024 •

edited

Loading