Skip to content

Commit

Permalink
Fix clipa tokenizer documentation (#804)
Browse files Browse the repository at this point in the history
* Update docs for how to change the masking strategy for tokenization for CLIPA

* Fix markdown
  • Loading branch information
humzaiqbal authored Feb 5, 2024
1 parent 3ff1faf commit 73fa7f0
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/clipa.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,15 @@ Eight token length reduction strategies are investigated in this work, detailed

## Text token length reduction

* `syntax mask`: Assign different masking priorities to parts of speech. Specify `"text_mask": syntax` in `"text_cfg"` of model config `json` file to use.
* `syntax mask`: Assign different masking priorities to parts of speech. Specify `"text_mask": syntax` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use.
Specifically, we prioritize retaining nouns, followed by adjectives, and then other words.
We find this strategy generally works the best as it retains critical information for contrastive learning.

* `truncate`: Truncation selects the first N text tokens and discards the rest. This is the default setting of `open_clip`.

* `random mask`: Randomly drops a portion of the text tokens. Specify `"text_mask": random` in `"text_cfg"` of model config `json` file to use.
* `random mask`: Randomly drops a portion of the text tokens. Specify `"text_mask": random` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use.

* `block mask`: Randomly preserves consecutive text sequences. Specify `"text_mask": block` in `"text_cfg"` of model config `json` file to use.
* `block mask`: Randomly preserves consecutive text sequences. Specify `"text_mask": block` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use.


## Installation
Expand Down

0 comments on commit 73fa7f0

Please sign in to comment.