Updated for v0.2 release

morganmcg1 · Jul 28, 2020 · aad2bdf · aad2bdf
1 parent 73d7b85
commit aad2bdf
Showing 1 changed file with 44 additions and 0 deletions.
diff --git a/RELEASES.md b/RELEASES.md
@@ -1,5 +1,49 @@
 # Release History
 
+## v0.2
+Better data (HuggingFace nlp paracrawl), cleaned data (via UMAP), larger vocab (+2k for en, + 10k for ga) therefore larger model (74M -> 86M), added SacreBLEU measurement on Tatoeba, 10e training (via 2 x `fit_one_cycle` runs)
+
+#### Data
+- Paracrawl en-ga from HuggingFace `nlp` lib, 334k rows
+
+#### Data Processing
+- Removed noisy data via UMAP/bokeh visualisation
+- lowercase
+- Removed samples longer than 60 tokens (90th percentile was 58 tokens long)
+
+#### Tokenizer
+- Spacy tokenizer
+- [fastai rules](http://dev.fast.ai/text.core#Preprocessing-rules): [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, replace_all_caps, replace_maj, lowercase]
+- Vocab size: en :22.9k, ga: 30k
+
+#### Model
+- Positional encoding (sin/cos)
+- PyTorch nn.Transformer
+    - Param count: 86M
+    - enc_layers: 6
+    - dec_layers: 6
+    - n_heads: 8
+    - d_model: 512
+    - d_inner: 2048
+    - vocab size: 22.9k en, 30k ga
+
+#### Training
+- Fastai: 
+    - fit_one_cycle(5, 5e-4, div=5)
+    - fit_one_cycle(5, 1e-5, div=5)
+    - 15m per epoch
+
+#### Performance
+- Tatoeba: 25.14 SacreBELU
+- CorpusBLEU: 0.503 (20% random validation, random seed = 42)
+- Val loss: 0.528, Val Loss: 0.813
+- Val Accuracy: 0.613
+- Val Perplexity: 2.256
+
+#### Serving
+- Added decoding of special tokens for output
+- Logging: added inference time logging
+
 ## v0.1
 Baseline release to be improved upon