Skip to content

v0.1

Pre-release
Pre-release
Compare
Choose a tag to compare
@morganmcg1 morganmcg1 released this 03 Jul 10:46
· 25 commits to master since this release

v0.1

Baseline release to be improved upon

Data

  • Paracrawl en-ga

Data Processing

  • lowercase
  • Removed samples longer than 60 tokens (90th percentile was 58 tokens long)

Tokenizer

  • Spacy tokenizer
  • fastai rules: [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces, replace_all_caps, replace_maj, lowercase]

Model

  • PyTorch nn.Transformer
    • Param count: 74M
    • enc_layers: 6
    • dec_layers: 6
    • n_heads: 8
    • d_model: 512
    • d_inner: 2048
    • vocab size: 20k en, 20k ga

Training

  • CorpusBLEU: 0.468 (20% random validation)
  • Fastai: fit_one_cycle(20, 5e-4, div=5), 23min per epoch
  • Train loss: 0.389287, Val Loss: 0.942813

Serving

  • Streamlit app
  • logging: input, output, datetime, feedback (null), model version