arXiv | Colab | Documentation | Hugging Face
Top 30 predictions with probabilities from our model on the image of "The Legend of Zelda: Tears of the Kingdom" 1.
This is the official PyTorch implementation for the paper Object Recognition as Next Token Prediction accepted at CVPR 2024 (Highlight).
@inproceedings{nxtp,
title = {{Object Recognition as Next Token Prediction}},
author = {Kaiyu Yue and Bor-Chun Chen and Jonas Geiping and Hengduo Li and Tom Goldstein and Ser-Nam Lim},
booktitle = {Computer Vision and Pattern Recognition Conference (CVPR)},
year = {2024}
}
May 26, 2024
- add ImageNet experiments: see src/imagenet
- visualize attention maps in decoder layers during inference: see examples
Mar 17, 2024
- release the best 1.78B model trained on G70M
- export onnx models: docs/onnx-export
Mar 03, 2024
- add examples with top-20 predictions to this readme
- add CLIP ViT- L/14 as the textual embedding model in evaluation metric (Table A.8 of the paper)
This project delves into a fundamental problem in computer vision − object recognition − translating an image into object labels.
Linear models (such as ResNet) and contrastive models (such as CLIP) require predefined labels before inference, limiting their flexibility in real-world applications.
We extend W to cover the entire textual space using language models like LLaMA's 32K token embeddings. Our model predicts labels in a real-open manner through auto-regressive processing.
Additionally, our one-shot sampling technique enables efficient large-scale discriminative predictions, such as the top-100 labels.
The released models have 1.78B parameters. Truncating the model to 0.77B parameters still achieves competitive performance (Table 3 in the paper), which only has one transformer block in the decoder.
Image w/ Top-20 Predictions | Attention Map | Image w/ Top-20 Predictions | Attention Map |
---|---|---|---|
click to review 1prob: 0.13949 - legend prob: 0.12399 - sky prob: 0.04723 - cloud prob: 0.04642 - game prob: 0.04500 - screenshot prob: 0.03189 - top prob: 0.03024 - mountain prob: 0.02262 - cliff prob: 0.01790 - world prob: 0.01483 - wii prob: 0.01440 - video prob: 0.01310 - breath prob: 0.01087 - zeo prob: 0.00982 - zelda prob: 0.00959 - character prob: 0.00865 - rock prob: 0.00816 - link prob: 0.00788 - island prob: 0.00624 - adventure prob: 0.00591 - woman |
attention map infodecoder: layer 0: head 25 |
click to review 2prob: 0.23237 - rocket prob: 0.10435 - launch prob: 0.06144 - soyuz prob: 0.04314 - space prob: 0.03541 - smoke prob: 0.03249 - sky prob: 0.01971 - shuttle prob: 0.01566 - tower prob: 0.01551 - paris prob: 0.01229 - cloud prob: 0.01067 - pad prob: 0.01050 - cape prob: 0.00983 - falcon prob: 0.00956 - photo prob: 0.00834 - lift prob: 0.00814 - air prob: 0.00779 - mission prob: 0.00710 - station prob: 0.00688 - july prob: 0.00647 - satellite |
attention map infodecoder: layer 0: head 0 |
click to review 3prob: 0.30731 - dog prob: 0.13647 - sweater prob: 0.11870 - hat prob: 0.06812 - scarf prob: 0.04131 - brick prob: 0.03114 - wall prob: 0.01796 - shirt prob: 0.01471 - cute prob: 0.01156 - cap prob: 0.00982 - neck prob: 0.00929 - top prob: 0.00797 - head prob: 0.00777 - beanie prob: 0.00658 - man prob: 0.00588 - sits prob: 0.00582 - coat prob: 0.00524 - jacket prob: 0.00476 - collar prob: 0.00460 - face prob: 0.00119 - bone |
attention map infodecoder: layer 0: head 25 |
click to review 4prob: 0.14861 - coffee prob: 0.10409 - shop prob: 0.08065 - counter prob: 0.04603 - bar prob: 0.04055 - restaurant prob: 0.03691 - inside prob: 0.03468 - area prob: 0.02638 - store prob: 0.02219 - table prob: 0.01930 - interior prob: 0.01347 - lot prob: 0.01156 - food prob: 0.01058 - customer prob: 0.01001 - room prob: 0.00923 - starbucks prob: 0.00853 - bakery prob: 0.00738 - view prob: 0.00738 - floor prob: 0.00733 - cafe prob: 0.00633 - shelf |
attention map infodecoder: layer 0: head 8 |
click to review 3prob: 0.47652 - monster prob: 0.09664 - cartoon prob: 0.03812 - character prob: 0.03724 - group prob: 0.03312 - creature prob: 0.02111 - cute prob: 0.01929 - vector prob: 0.01481 - animal prob: 0.00955 - art prob: 0.00924 - alien prob: 0.00837 - pose prob: 0.00604 - bubble prob: 0.00553 - eye prob: 0.00533 - color prob: 0.00528 - hand prob: 0.00477 - design prob: 0.00474 - wallpaper prob: 0.00462 - child prob: 0.00445 - people prob: 0.00445 - family |
attention map infodecoder: layer 2: head 7 |
click to review 3prob: 0.54375 - cloud prob: 0.09932 - word prob: 0.07571 - sky prob: 0.03153 - letter prob: 0.01862 - sora prob: 0.01380 - logo prob: 0.00995 - text prob: 0.00715 - top prob: 0.00715 - blue prob: 0.00677 - title prob: 0.00608 - photo prob: 0.00427 - picture prob: 0.00288 - sonora prob: 0.00269 - middle prob: 0.00257 - storm prob: 0.00202 - cloudscape prob: 0.00190 - sun prob: 0.00189 - art prob: 0.00156 - soar prob: 0.00041 - icy |
attention map infodecoder: layer 1: head 13 |
click to review 3prob: 0.15317 - building prob: 0.13619 - wave prob: 0.04782 - room prob: 0.03498 - middle prob: 0.03188 - hall prob: 0.02367 - people prob: 0.02135 - ocean prob: 0.02087 - floor prob: 0.01867 - world prob: 0.01773 - inside prob: 0.01548 - man prob: 0.01380 - water prob: 0.01205 - view prob: 0.01200 - surfer prob: 0.01109 - photo prob: 0.00798 - hotel prob: 0.00734 - city prob: 0.00662 - pool prob: 0.00566 - art prob: 0.00319 - mural |
attention map infodecoder: layer 1: head 16 |
click to review 3prob: 0.25673 - bird prob: 0.21676 - feather prob: 0.18550 - peacock prob: 0.04251 - head prob: 0.03240 - blue prob: 0.02507 - pigeon prob: 0.02183 - tail prob: 0.01339 - hair prob: 0.01187 - top prob: 0.00677 - face prob: 0.00631 - camera prob: 0.00463 - beak prob: 0.00451 - eye prob: 0.00419 - fence prob: 0.00370 - sits prob: 0.00333 - perch prob: 0.00330 - photo prob: 0.00318 - wall prob: 0.00269 - animal prob: 0.00106 - jay |
attention map infodecoder: layer 1: head 25 |
click to review 5prob: 0.07247 - tablet prob: 0.06770 - coffee prob: 0.06562 - window prob: 0.05829 - controller prob: 0.05668 - game prob: 0.04802 - switch prob: 0.04043 - wii prob: 0.03798 - console prob: 0.03563 - cup prob: 0.02570 - top prob: 0.02067 - mug prob: 0.01808 - screen prob: 0.01344 - video prob: 0.01105 - star prob: 0.01092 - nintendo prob: 0.01055 - computer prob: 0.00819 - mario prob: 0.00815 - remote prob: 0.00736 - control prob: 0.00393 - sill |
attention map infodecoder: layer 0: head 12 |
click to review 6prob: 0.36523 - airplane prob: 0.09151 - cargo prob: 0.07531 - plane prob: 0.05538 - ship prob: 0.04223 - container prob: 0.03105 - water prob: 0.03040 - view prob: 0.02277 - dock prob: 0.01685 - port prob: 0.01434 - sky prob: 0.01328 - shipping prob: 0.00788 - middle prob: 0.00751 - body prob: 0.00717 - photo prob: 0.00715 - jet prob: 0.00714 - city prob: 0.00621 - ocean prob: 0.00615 - freight prob: 0.00609 - boat prob: 0.00320 - transportation |
attention map infodecoder: layer 2: head 14 |
click to review 6prob: 0.15236 - candy prob: 0.12271 - sweater prob: 0.11457 - glass prob: 0.10593 - dog prob: 0.08311 - chair prob: 0.07111 - cane prob: 0.04701 - sunglass prob: 0.04589 - christmas prob: 0.02361 - costume prob: 0.02085 - wearing prob: 0.01870 - hat prob: 0.00734 - head prob: 0.00636 - top prob: 0.00577 - outfit prob: 0.00520 - chocolate prob: 0.00437 - holi prob: 0.00362 - suit prob: 0.00344 - shirt prob: 0.00322 - strawberry prob: 0.00211 - wig |
attention map infodecoder: layer 1: head 16 |
click to review 6prob: 0.19960 - living prob: 0.16291 - room prob: 0.11353 - sofa prob: 0.06036 - couch prob: 0.04741 - rug prob: 0.04704 - coffee prob: 0.03795 - dog prob: 0.03659 - wall prob: 0.02980 - table prob: 0.01611 - floor prob: 0.01594 - grey prob: 0.01472 - wood prob: 0.01353 - furniture prob: 0.01314 - plant prob: 0.01274 - fireplace prob: 0.01161 - pillow prob: 0.00941 - chair prob: 0.00512 - home prob: 0.00434 - blanket prob: 0.00351 - art |
attention map infodecoder: layer 1: head 16 |
The following table shows the reproduced results of recall (R column in Table 1 of the paper) on the validation splits with top-10 predictions.
# params | training group | checkpoint | md5 | CC3M | COCO | OpenImages |
---|---|---|---|---|---|---|
1.78B | G3M | Hugging Face | b2a69b | 0.740 | 0.703 | 0.616 |
1.78B | G70M | Hugging Face | e177c7 | 0.721 | 0.765 | 0.662 |
The checkpoints can be downloaded from the links in the table above. For downloading from Hugging Face, one option is to use git-lfs:
# install git lfs
git lfs install
# download the checkpoint in terminal
git clone https://huggingface.co/kaiyuyue/nxtp
Also, the checkpoint can be downloaded from the model page in the web browser.
There is an image assets/starbux.jpg for a quick test. First, please follow the instructions in Dependencies to prepare the environment.
To infer an image, please run
python src/infer.py \
--ckpt-path path/to/model/checkpoint \
--img-path assets/starbux.jpg \
--num-labels 20
The output from model trained on G3M will be
top-20 predictions:
| prob: 0.05742 - coffee
| prob: 0.05525 - restaurant
| prob: 0.04402 - shop
| prob: 0.02528 - room
| prob: 0.02468 - store
| prob: 0.02381 - interior
| prob: 0.01732 - area
| prob: 0.01640 - building
| prob: 0.01616 - food
| prob: 0.01408 - bar
| prob: 0.01247 - customer
| prob: 0.01134 - view
| prob: 0.01059 - floor
| prob: 0.01045 - table
| prob: 0.00933 - kitchen
| prob: 0.00926 - home
| prob: 0.00872 - look
| prob: 0.00841 - people
| prob: 0.00693 - cup
| prob: 0.00665 - counter
The output from model trained on G70M is
top-20 predictions:
| prob: 0.15203 - coffee
| prob: 0.09728 - shop
| prob: 0.09182 - counter
| prob: 0.03848 - interior
| prob: 0.03389 - bar
| prob: 0.03215 - restaurant
| prob: 0.02440 - table
| prob: 0.02245 - store
| prob: 0.01950 - area
| prob: 0.01905 - inside
| prob: 0.01590 - starbucks
| prob: 0.01313 - cafe
| prob: 0.01220 - chair
| prob: 0.01172 - floor
| prob: 0.01020 - cup
| prob: 0.00879 - drink
| prob: 0.00794 - room
| prob: 0.00746 - customer
| prob: 0.00635 - wood
| prob: 0.00345 - bakery
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
Footnotes
-
Image credit: ゼルダの伝説 The Legend of Zelda: Tears of the Kingdom. ↩ ↩2
-
Image credit: Photo taken by the author at a Starbucks store. ↩
-
Image credit: Super Mario Bros Wonder. ↩
-
Image credit: Demo in Segment Anything | Meta AI. ↩ ↩2 ↩3