Skip to content

Commit

Permalink
Updated for v0.2 release
Browse files Browse the repository at this point in the history
  • Loading branch information
morganmcg1 authored Jul 28, 2020
1 parent 73d7b85 commit aad2bdf
Showing 1 changed file with 44 additions and 0 deletions.
44 changes: 44 additions & 0 deletions RELEASES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,49 @@
# Release History

## v0.2
Better data (HuggingFace nlp paracrawl), cleaned data (via UMAP), larger vocab (+2k for en, + 10k for ga) therefore larger model (74M -> 86M), added SacreBLEU measurement on Tatoeba, 10e training (via 2 x `fit_one_cycle` runs)

#### Data
- Paracrawl en-ga from HuggingFace `nlp` lib, 334k rows

#### Data Processing
- Removed noisy data via UMAP/bokeh visualisation
- lowercase
- Removed samples longer than 60 tokens (90th percentile was 58 tokens long)

#### Tokenizer
- Spacy tokenizer
- [fastai rules](http://dev.fast.ai/text.core#Preprocessing-rules): [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, replace_all_caps, replace_maj, lowercase]
- Vocab size: en :22.9k, ga: 30k

#### Model
- Positional encoding (sin/cos)
- PyTorch nn.Transformer
- Param count: 86M
- enc_layers: 6
- dec_layers: 6
- n_heads: 8
- d_model: 512
- d_inner: 2048
- vocab size: 22.9k en, 30k ga

#### Training
- Fastai:
- fit_one_cycle(5, 5e-4, div=5)
- fit_one_cycle(5, 1e-5, div=5)
- 15m per epoch

#### Performance
- Tatoeba: 25.14 SacreBELU
- CorpusBLEU: 0.503 (20% random validation, random seed = 42)
- Val loss: 0.528, Val Loss: 0.813
- Val Accuracy: 0.613
- Val Perplexity: 2.256

#### Serving
- Added decoding of special tokens for output
- Logging: added inference time logging

## v0.1
Baseline release to be improved upon

Expand Down

0 comments on commit aad2bdf

Please sign in to comment.