Skip to content

Commit

Permalink
Update README.md and training hello world colabs (#23)
Browse files Browse the repository at this point in the history
* train notebooks and readme update

* Update README.md

* Update CONTRIBUTING.md

* Delete notebooks/train_hello_world_tf.ipynb

* Add train_tpu.ipynb notebook

* rerun retvec train notebook

* delete training files becuase they are unused, update notebook

* add tfjs

* rerun notebook
  • Loading branch information
MarinaZhang committed Oct 11, 2023
1 parent e951457 commit b381c8c
Show file tree
Hide file tree
Showing 26 changed files with 1,248 additions and 1,403 deletions.
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# How to Contribute
Thanks for considering contributing to TF similarity!
Thanks for considering contributing to RETVec!

Here is what you need to know to make a successful contribution. There are
just a few small guidelines you need to follow.
Expand Down Expand Up @@ -31,7 +31,7 @@ pull request:
- Ideally one PR corespond to one feature or improvement to make it easier to
review. So **try** to split your contribution in meaning logical units.
- Your code **must** pass the unit-tests. We use `pytest` so simply run it at the root of the project.
- Your code **must** passs static analyis. We use `mypy` so simply run `mypy tensorflow_similarity/` from the root of the project.
- Your code **must** passs static analyis. We use `mypy` so simply run `mypy retvec/` from the root of the project.
- Your code **must** comes with unit-tests to ensure long term quality
- Your functions **must** be documented except obvious ones using the Google style.
- Your functions **must** be typed.
Expand Down
61 changes: 49 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,20 @@


## Overview
RETVec is a next-gen text vectorizer designed to offer built-in adversarial resilience using robust word embeddings. Read the paper here: https://arxiv.org/abs/2302.09207.
RETVec is a next-gen text vectorizer designed to be efficient, multilingual, and provide built-in adversarial resilience using robust word embeddings trained with [similarity learning](https://github.com/tensorflow/similarity/). You can read the paper [here](https://arxiv.org/abs/2302.09207).

RETVec is trained to be resilient against character manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character embedding which can encode all UTF-8 characters and words. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.
RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.

RETVec's speed and size (~200k instead of millions) also makes it a great choice for on-device and web use cases. It is natively supported in TensorFlow Lite via custom ops implemented in TensorFlow Text, and we provide a Javascript implementation of RETVec which allows you to deploy web models via TensorFlow.js.

### Getting started
Please see our example colabs on how to get started with training your own models with RETVec.

#### Installation
## Getting started

You can use pip to install the TensorFlow version of RETVec:

### Installation

You can use pip to install the latest TensorFlow version of RETVec:

```python
pip install retvec
Expand All @@ -21,19 +25,52 @@ RETVec has been tested on TensorFlow 2.6+ and python 3.7+.

### Basic Usage

`training/train_tf_retvec_models.py` is the RETVec model training script. Example usage:
You can use RETVec as the vectorization layer in any TensorFlow model with just a single line of code. RETVec operates on raw strings with pre-processing options built-in (e.g. lowercasing text). For example:

```python
train_tf_retvec_models.py --train_config <train_config_path> --model_config <model_config_path> --output_dir <output_path>
```
import tensorflow as tf
from tensorflow.keras import layers
# Define the input layer, which accepts raw strings
inputs = layers.Input(shape=(1, ), name="input", dtype=tf.string)
# Add the RETVec Tokenizer layer using the RETVec embedding model -- that's it!
x = RETVecTokenizer(sequence_length=128)(inputs)
# Create your model like normal
# e.g. a simple LSTM model for classification with NUM_CLASSES classes
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
```

Then you can compile, train and save your model like usual! As demonstrated in our paper, models trained using RETVec are more resilient against adversarial attacks and typos, as well as computationally efficient. RETVec also offers support in TFJS and TF Lite, making it perfect for on-device mobile and web use cases.

### Colabs

Configurations for our base models are under the `configs/` folder.
Detailed example colabs for RETVec can be found at under [notebooks](notebooks/). These are a good way to get started with using RETVec. You can run the notebooks in Google Colab by clicking the Google Colab button. If none of the examples are similar to your use case, please let us know!

### Colab
We have the following example colabs:

Colab for training and releasing a new RETVec model: `notebooks/train_and_relase_a_rewnet.ipynb`
- Training RETVec-based models using TensorFlow: [train_hello_world_tf.ipynb](notebooks/train_hello_world_tf.ipynb) for GPU/CPU training, and [train_tpu.ipynb](notebooks/train_tpu.ipynb) for a TPU-compatible training example.
- (Coming soon!) Converting RETVec models into TF Lite models to run on-device.
- (Coming soon!) Using RETVec JS to deploy RETVec models in the web using TensorFlow.js

## Citing
Please cite this reference if you use RETVec in your research:

```bibtex
@article{retvec2023,
title={RETVec: Resilient and Efficient Text Vectorizer},
author={Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin},
year={2023},
eprint={2302.09207}
}
```

Hello world colab: `notebooks/hello_world.ipynb`
## Contributing
To contribute to the project, please check out the [contribution guidelines](CONTRIBUTING.md). Thank you!

## Disclaimer
This is not an official Google product.
1 change: 1 addition & 0 deletions notebooks/demo_models/emotion_model/fingerprint.pb
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
�ɫ߀ٰ�*���°�����ە���� �à����N(֜�׸���}2
Loading

0 comments on commit b381c8c

Please sign in to comment.