Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
zahid-isu committed Jun 9, 2024
2 parents 397265a + 39cc2e6 commit 8b8a5c4
Show file tree
Hide file tree
Showing 281 changed files with 577,607 additions and 1 deletion.
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@


# Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
## [Prject page](https://baskargroup.github.io/Arboretum/)
## [Project page](https://baskargroup.github.io/Arboretum/)


### Academic Project Page Template
Expand Down Expand Up @@ -35,6 +35,12 @@ To edit the websites contents edit the `index.html` file. It contains different
- PDF Poster
- Bibtex citation

## Model validation

For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the [VLHub](https://github.com/penfever/vlhub) repository with some slight modifications. See the README in the linked repo for basic instructions on usage of VLHub for evaluation.

To use the open_clip model, you have to first run the `model_validation/load_openclip.py` script to get the model weights from Huggingface and save them locally. The Birds 525 dataset can be downloaded here [here](https://www.kaggle.com/datasets/gpiosenka/100-bird-species) and the IP102 Insects dataset can be found [here](https://drive.google.com/drive/folders/1svFSy2Da3cVMvekBwe13mzyx38XZ9xWo). Once you've downloaded the dataset you can reproduce our model evaluations by running `model_validation/src/training/main.py` with specifying the dataset with the `--ds-filter` argument.

## Tips:
- The `index.html` file contains comments instructing you what to replace, you should follow these comments.
- The `meta` tags in the `index.html` file are used to provide metadata about your paper
Expand Down
29 changes: 29 additions & 0 deletions model_validation/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Portions of OpenCLIP were incorporated into this codebase. OpenCLIP is copyright (c) 2012-2021 Gabriel Ilharco, Mitchell Wortsman,
Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar,
John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi,
Ludwig Schmidt under the following license;

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Portions of GradCache were incorporated into this codebase. GradCache was released under an Apache 2.0 license, and is available at [this site.](https://github.com/luyug/GradCache)

A [SimCLR implementation](https://github.com/facebookresearch/SLIP/blob/main/LICENSE) was incorporated into this codebase which was originally (c) Meta Platforms, Inc, and is released under an MIT license.

The remainder of this software is copyright (c) 2022 Benjamin Feuer, and is released under the same license as OpenCLIP.
183 changes: 183 additions & 0 deletions model_validation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# VL Hub[](#vl-hub)

**VL Hub** integrates [CLIP pretraining, LiT-Tuning](https://github.com/mlfoundations/open_clip), [alternate CLIP-like architectures](https://github.com/lucidrains/x-clip), [CoCa](https://github.com/lucidrains/coca-pytorch), conventional [timm](https://github.com/rwightman/pytorch-image-models) vision models and [SimCLR](https://github.com/facebookresearch/SLIP/blob/main/LICENSE) contrastive models into a single test-train-eval framework, making it easier to compare models across architectures and objectives.

## Attribution

This software is heavily indebted to all of the above listed packages, and particularly to [open clip](https://github.com/mlfoundations/open_clip).

## How to use this repository

If you're new to CLIP, you might want to familiarize yourself with the basics of the architecture.

If you haven't worked with the OpenCLIP implementation of CLIP before, the best place to start is the OpenCLIP readme in the docs folder.

If you're in need of a vision-language dataset for your training experiments, check out [CaptionNet](https://github.com/penfever/CaptionNet), a dataset designed precisely for that purpose.

If you're planning to use one of the alternate architectures for training, we recommend using the links [above](#vl-hub) to familiarize yourself with the details of that architecture before proceeding.

In this readme, we focus on features that are new in our implementation.

### First Run

After cloning the VLHub repository and installing the requirements, run the following commands -- these commands must be run every time before executing a VLHub command.

```bash
cd vlhub
export PYTHONPATH="$PYTHONPATH:$PWD/src";
```

### Sample Command

Here is an example of an evaluation command in VLHub:

```bash
python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --imagenet-v2 "/scratch/projects/hegdelab/bf996/datasets" --imagenet-s "/imagenet-sketch" --imagenet-a "/imagenet-a" --imagenet-r "/imagenet-r" --objectnet "/objectnet-1.0/images" --model="resnet50" --zeroshot-frequency=1 --linear-probe=True --image-size=224 --resume "/scratch/bf996/pytorch-image-models/output/train/20240219-081711-resnet50-192/model_best.pth.tar" --report-to wandb
```

# Evaluation

## Extended Evaluation Metrics[](#extended-evaluation-metrics)

VL Hub integrates support for a wide range of zero-shot evaluation metrics, including ImageNet and its distribution shifts, food101, fgvc-Aircraft, Stanford Cars, and iNaturalist. Simply add the dataset you wish to evaluate on, and include the following flag(s) when at evaluation time, substituting the appropriate PATH --

--food "/", --air "/", --stanfordcars "/", --imagenet-val "/imagenet/val/", --imagenet-a "/imagenet-a", --imagenet-r "/imagenet-r", --imagenet-v2 "/imagenet-v2", --imagenet-s "/imagenet-sketch", --inat2021 "/inat2021"

## Subset Evaluation[](#subset-evaluation)

ImageNet and its distribution shifts also support evaluation on a 100-class subset of ImageNet-1k; this is particularly useful when training models on smaller datasets such as [CaptionNet](https://github.com/penfever/CaptionNet).

To evaluate on a subset of ImageNet, include the flag --caption-subset=True

## Extended Metrics[](#extended-metrics)

VL Hub supports extended evaluation metrics such as confusion matrices and per-class accuracy results. To utilize these features, pass the flag --extended-metrics=True.

# Training

## Supported Training Architectures

VL Hub currently supports training the following architectures;

* CLIP-loss models

As this is the default training mode, you need only pass the type of model you wish to train, EG, --model=RN50

* LiT-tuned models

Please refer to docs/openclip_readme for details

--lock-image-unlocked-groups will unlock only the last $n$ layers of the image tower during training

--lock-text will lock the text tower during training (not recommended)

* Conventional cross-entropy loss models

In order to train conventional cross-entropy loss models without using a text tower, if your dataset is in CSV format, you must specify a caption-key column which contains either integers or a list of integers.

--csv-caption-key idx

If you are training on webdataset, this step is not necessary, but you should specify a dataset filter --

--ds-filter=imagenet_classnames

Integer labels will be generated using a subset matching strategy. For more details, please see our paper.

In either case, you must also pass

--integer-labels

You should then choose a model architecture with an appropriate head. For instance, if training an ImageNet model, you might choose

--model=RN50-in1k

* CoCa

In order to train a CoCa model, simply pass

--model=coca

Please note that at this time, support for CoCa is limited to a single vision backbone, and the loss weighting has to be adjusted manually.

* DeCLIP, VSSL, filip

To train using one of these alternate objectives, pass the model architecture you wish to use as your base, and flag the objective you wish to train with. For instance:

--model=vit_base_patch32_224 --mlm=True

* Changing image and text weighting

By passing float values between .1 and .9 to the flags --text-weight and --image-weight, it is possible to change how heavily CLIP weights both text and image loss.

## Training Schema[](#training-schema)

In order to make training on blended or combined datasets more convenient when using webdataset, we implement training schema. Pass the flag

--schema=PATH

and do NOT pass a path to any training data in order to use schema.

Sample schema are provided in the repository.

Please note; schema training is only supported when using webdataset format; if you wish to combine CSV datasets, simply merge the CSVs you wish to combine.

## SimCLR Augmentation[](#simclr-augmentation)

When

--sim-clr-trans=True

is passed, the model will use SimCLR augmentations instead of standard CLIP augmentations. This has been shown to improve model zero-shot performance by as much as 10 percent.

## Gradient Caching[](#gradient-caching)

VL Hub offers support for gradient caching, as described in [GradCache](https://github.com/luyug/GradCache).

Models like CLIP are typically trained with very large batch sizes -- 32,768 is standard. This has been shown to improve rates of loss convergence.

However, most researchers and even most businesses do not have access to the distributed training setups required for such batch sizes. Gradient caching saves computed gradients in RAM instead of in VRAM, allowing for much larger batch sizes on a single node.

Unlike gradient accumulation, gradient caching is mathematically identical to training with a large batch size.

To try gradient caching, use these three new command line arguments --

--gc=True: If set to True, gradient caching will be enabled. If gc=True, you should also set gpumaxbatch to an appropriate size for your GPU.

--gpumaxbatch=INT: This is an integer value representing the maximum batch size that your GPU can handle

--no_eval=True: Skips the evaluation phase, accelerating training

# Subset Matching

The default subset matching strategy is single-class, as this has been shown to perform best. However, other strategies are available. --multiclass enforces multiclass subset matching, whereas --strict utilizes a strict subset matching strategy. For more details on subset matching strategies, please refer to our paper.

# Captioning

VL Hub supports many modifications to captions, both during inference and during training.

"--zs-upper" forces classes to UPPER-CASE during inference, "zs-lower" forces lower-case. "--csv-cleaned" cleans the captions prior to training (this flag also works for webdatasets).

--token-strip strips all tokens not used for evaluation.

--token-scrambled scrambles token order during training.

"--simplecaptions" changes caption text to 'An image of CLASSNAME, CLASSNAME'

--shift-cipher=3 applies a 3-step shift-cipher to the caption space.

# Citations

If you find this work useful, please consider citing our paper --

```
@article{
feuer2023distributionally,
title={Distributionally Robust Classification on a Data Budget},
author={Benjamin Feuer and Ameya Joshi and Minh Pham and Chinmay Hegde},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=D5Z2E8CNsD},
note={}
}
```
Loading

0 comments on commit 8b8a5c4

Please sign in to comment.