Openai-L-14-336 #19

LKELN · 2024-12-09T11:52:54Z

Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to("cuda")
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

When I try to use LLM2CLIP-Openai-L-14-336. this erro is appear？Can you fix it？

The text was updated successfully, but these errors were encountered:

LKELN · 2024-12-09T12:08:39Z

the model struct that load by your code is is:
CLIPModel(
(text_model): CLIPTextTransformer(
(embeddings): CLIPTextEmbeddings(
(token_embedding): Embedding(49408, 512)
(position_embedding): Embedding(77, 512)
)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-11): 12 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
)
(layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(vision_model): CLIPVisionTransformer(
(embeddings): CLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(position_embedding): Embedding(577, 1024)
)
(pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-23): 24 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
(layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(visual_projection): Linear(in_features=1024, out_features=1280, bias=False)
(text_projection): Linear(in_features=512, out_features=1280, bias=False)
)

raytrun · 2024-12-13T09:34:39Z

Could you provide a more complete code? The example given here is able to run correctly.

LKELN · 2024-12-17T07:48:31Z

processor = CLIPImageProcessor.from_pretrained("/group/40048/keningliu/tools/models/clip-vit-large-patch14-336")
model_name_or_path = "/group/40048/keningliu/tools/models/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=False).to('cuda').eval()
print(model)
print(type(model).mro)
captions = ["a diagram", "a dog", "a cat"]
image_path = "/group/40048/keningliu/tools/FlagData/pipeline.png"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')
print(model)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

LKELN · 2024-12-17T07:50:51Z

I found that when loading the model, the weight files mentioned in your paper were not loaded. Instead, only the CLIP structure was loaded, which makes it impossible to conduct inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openai-L-14-336 #19

Openai-L-14-336 #19

LKELN commented Dec 9, 2024

LKELN commented Dec 9, 2024

raytrun commented Dec 13, 2024

LKELN commented Dec 17, 2024

LKELN commented Dec 17, 2024

Openai-L-14-336 #19

Openai-L-14-336 #19

Comments

LKELN commented Dec 9, 2024

LKELN commented Dec 9, 2024

raytrun commented Dec 13, 2024

LKELN commented Dec 17, 2024

LKELN commented Dec 17, 2024