Skip to content

Latest commit

 

History

History
430 lines (409 loc) · 23.3 KB

MORE.md

File metadata and controls

430 lines (409 loc) · 23.3 KB

History ( ~ 2020. 2. 25)

Evaluation

results

  • QRNN
    • Glove
      • setting : experiments 14, test 8
      • per-token(partial) f1 : 0.8892680845877263
      • per-chunk(exact) f1 : 0.8809544851966417 (conlleval)
      • average processing time per bucket
        • 1 GPU(TITAN X(Pascal), 12196MiB)
          • restore version : 0.013028464151645457 sec
        • 32 processor CPU(multi-threading)
          • python : 0.004297458387741437 sec
          • C++ : 0.004124 sec
        • 1 CPU(single-thread)
          • python : 0.004832443533451109 sec
          • C++ : 0.004734 sec
  • Transformer
    • Glove
      • setting : experiments 7, test 9
      • per-token(partial) f1 : 0.9083215796897038
      • per-chunk(exact) f1 : 0.904078014184397 (chunk_eval)
      • average processing time per bucket
        • 1 GPU(TITAN X (Pascal), 12196MiB)
          • restore version : 0.013825567226844812 sec
          • frozen version : 0.015376264122228799 sec
          • tensorRT(FP16) version : no meaningful difference
        • 32 processor CPU(multi-threading)
          • python : 0.017238136546748987 sec
          • C++ : 0.013 sec
        • 1 CPU(single-thread)
          • python : 0.03358284470571628 sec
          • C++ : 0.021510 sec
  • BiLSTM
    • Glove
      • setting : experiments 9, test 1
      • per-token(partial) f1 : 0.9152852267186738
      • per-chunk(exact) f1 : 0.9094911075893644 (chunk_eval)
      • average processing time per bucket
        • 1 GPU(TITAN X (Pascal), 12196MiB)
          • restore version : 0.010454932072004718 sec
          • frozen version : 0.011339560587942018 sec
          • tensorRT(FP16) version : no meaningful difference
        • 32 processor CPU(multi-threading)
          • rnn_num_layers 2 : 0.006132203450549827 sec
          • rnn_num_layers 1
            • python
              • 0.0041805055967241884 sec
              • 0.003053264560968687 sec (experiments 12, test 5)
            • C++
              • 0.002735 sec
              • 0.002175 sec (experiments 9, test 2), 0.8800
              • 0.002783 sec (experiments 9, test 3), 0.8858
              • 0.004407 sec (experiments 9, test 4), 0.8887
              • 0.003687 sec (experiments 9, test 5), 0.8835
              • 0.002976 sec (experiments 9, test 6), 0.8782
              • 0.002855 sec (experiments 9, test 7), 0.8906
                • 0.002697 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
              • 0.002040 sec (experiments 12, test 5), 0.9047
        • 1 CPU(single-thread)
          • rnn_num_layers 2 : 0.008001159379070668 sec
          • rnn_num_layers 1
            • python
              • 0.0051817628640952506 sec
              • 0.0042755354628630235 sec (experiments 12, test 5)
            • C++
              • 0.003998 sec
              • 0.002853 sec (experiments 9, test 2)
              • 0.003474 sec (experiments 9, test 3)
              • 0.005118 sec (experiments 9, test 4)
              • 0.004139 sec (experiments 9, test 5)
              • 0.004133 sec (experiments 9, test 6)
              • 0.003334 sec (experiments 9, test 7)
                • 0.003078 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
              • 0.002683 sec (experiments 12, test 5)
    • ELMo
      • setting : experiments 8, test 2
      • per-token(partial) f1 : 0.9322728663199756
      • per-chunk(exact) f1 : 0.9253625751680227 (chunk_eval)
      $ etc/conlleval < pred.txt
      processed 46666 tokens with 5648 phrases; found: 5662 phrases; correct: 5234.
      accuracy:  98.44%; precision:  92.44%; recall:  92.67%; FB1:  92.56
                    LOC: precision:  94.29%; recall:  92.99%; FB1:  93.63  1645
                   MISC: precision:  84.38%; recall:  84.62%; FB1:  84.50  704
                    ORG: precision:  89.43%; recall:  91.69%; FB1:  90.55  1703
                    PER: precision:  97.27%; recall:  96.85%; FB1:  97.06  1610
      
      • average processing time per bucket
        • 1 GPU(TITAN X (Pascal), 12196MiB) : 0.06133532517637155 sec -> need to recompute
        • 1 GPU(Tesla V100) : 0.029950057644797457 sec
        • 32 processor CPU(multi-threading) : 0.40098162731570347 sec
        • 1 CPU(single-thread) : 0.7398052649182165 sec
    • ELMo + Glove
      • setting : experiments 10, test 16
      • per-token(partial) f1 : 0.9322386962382061
      • per-chunk(exact) f1 : 0.928729526339088 (chunk_eval)
      processed 46666 tokens with 5648 phrases; found: 5657 phrases; correct: 5247.
      accuracy:  98.44%; precision:  92.75%; recall:  92.90%; FB1:  92.83
                    LOC: precision:  93.89%; recall:  94.00%; FB1:  93.95  1670
                   MISC: precision:  85.03%; recall:  83.33%; FB1:  84.17  688
                    ORG: precision:  90.17%; recall:  91.63%; FB1:  90.89  1688
                    PER: precision:  97.58%; recall:  97.22%; FB1:  97.40  1611
      
      • average processing time per bucket
        • 1 GPU(TITAN X (Pascal), 12196MiB) : 0.036233977567360014 sec
        • 1 GPU(Tesla V100, 32510MiB) : 0.031166194639816864 sec
    • BERT new result, aligned wordpiece+word embeddings)
      • BERT(large) + Glove + ELMo
        • setting : experiments 15, test 7
        • per-token(partial) f1 : 0.9306700873495816
        • per-chunk(exact) f1 : 0.9264420532721821(chunk_eval), 92.64(conlleval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : pass
      • BERT(large) + Glove
        • setting : experiments 15, test 6
        • per-token(partial) f1 : 0.9217156200073737
        • per-chunk(exact) f1 : 0.9158398299078666(chunk_eval), 91.58(conlleval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : pass
      • BERT(large)
        • BERT + LSTM + CRF only
        • setting : experiments 15, test 2
        • per-token(partial) f1 : 0.9120832058733557
        • per-chunk(exact) f1 : 0.9015151515151516(chunk_eval), 90.14(conlleval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : pass
    • BERT old result, extending word embeddings for wordpieces
      • BERT(base)
        • setting : experiments 11, test 1
        • per-token(partial) f1 : 0.9234725113260683
        • per-chunk(exact) f1 : 0.9131509267431598 (chunk_eval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : 0.026964144585057526 sec
      • BERT(base) + Glove
        • setting : experiments 11, test 2`
        • per-token(partial) f1 : 0.921535076998289
        • per-chunk(exact) f1 : 0.9123210182075304 (chunk_eval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : 0.029030597688838533 sec
      • BERT(large)
        • BERT + CRF only
        • setting : experiments 11, test 15
        • per-token(partial) f1 : 0.929012534393152
        • per-chunk(exact) f1 : 0.9215426705498191 (chunk_eval), 92.00(conlleval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : pass
      • BERT(large)
        • BERT + LSTM + CRF only
        • setting : experiments 11, test 19
        • per-token(partial) f1 : 0.9310957309977338
        • per-chunk(exact) f1 : 0.9240976645435245 (chunk_eval), 92.23(conlleval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : pass
      • BERT(large) + Glove
        • setting : experiments 11, test 3
        • per-token(partial) f1 : 0.9278869778869779
        • per-chunk(exact) f1 : 0.918813634351483 (chunk_eval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : 0.040225753178425645 sec
      • BERT(large) + Glove + Transformer
        • setting : experiments 11, test 7
        • per-token(partial) f1 : 0.9244949032533724
        • per-chunk(exact) f1 : 0.9170714474962465 (chunk_eval)
        • average processing time per bucket
          • 1 GPU(Tesla V100) : 0.05737522856032033 sec
  • BiLSTM + Transformer
    • Glove
      • setting : experiments 7, test 10
      • per-token(partial) f1 : 0.910979409787988
      • per-chunk(exact) f1 : 0.9047451049567825 (chunk_eval)
  • BiLSTM + multi-head attention
    • Glove
      • setting : experiments 6, test 7
      • per-token(partial) f1 : 0.9157317073170732
      • per-chunk(exact) f1 : 0.9102156238953694 (chunk_eval)

comparision to previous research

Development note

accuracy and loss

abnormal case when using multi-head

  • why?
i guess that the softmax(applied in multi-head attention functions) was corrupted by paddings.
  -> so, i replaced the multi-head attention code to `https://github.com/Kyubyong/transformer/blob/master/modules.py`
     which applies key and query masking for paddings.
  -> however, simillar corruption was happended.
  -> it was caused by the tf.contrib.layers.layer_norm() which normalizes over [begin_norm_axis ~ R-1] dimensions.
  -> what about remove the layer_norm()? performance goes down!
  -> try to use other layer normalization code from `https://github.com/Kyubyong/transformer/blob/master/modules.py`
     which normalizes over the last dimension only.
     this code perfectly matches to my intention.
  • after replacing layer_norm() to normalize() and applying the dropout of word embeddings

train, dev accuracy after applying LSTMBlockFusedCell

tips for training speed up

  • filter out words(which are not in train/dev/test data) from glove840B word embeddings. but not for service.
  • use LSTMBlockFusedCell for bidirectional LSTM. this is faster than LSTMCell.
    • about 3.13 times faster during training time.
      • 297.6699993610382 sec -> 94.96637988090515 sec for 1 epoch
    • about 1.26 times faster during inference time.
      • 0.010652577061606541 sec -> 0.008411417501886556 sec for 1 sentence
    • where is the LSTMBlockFusedCell() defined?
    https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/contrib/rnn/python/ops/lstm_ops.py
    vi ../lib/python3.6/site-packages/tensorflow/contrib/rnn/ops/gen_lstm_ops.py
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/ops/lstm_ops.cc
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/kernels/lstm_ops.cc
    
  • use early stopping

tips for Transformer

  • start with small learning rate.
  • be careful to use residual connection after multi-head attention or feed forward net.
    • x = tf.nn.dropout(x + y) -> x = tf.nn.dropout(x_norm + y)
  • the f1 of train/dev by token are relatively lower than the f1 of the BiLSTM. but after applying the CRF layer, those f1 by token are increased very sharply.
    • does it mean that the Transformer is weak for collecting context for deciding label at the current position? then, how to overcome?
    • try to revise the position-wise feed forward net
      • padding before and after
        • (batch_size, sentence_length, model_dim) -> (batch_size, 1+sentence_length+1, model_dim)
      • conv1d with kernel size 1 -> 3
      • this is the key to sequence taggging problems.
    • after applying kernel_size 3

tips in general

  • save best model by using token-based f1. token-based f1 is slightly better than chunk-based f1
  • be careful for word lowercase when you are using glove6B embeddings. those are all lowercased.
  • feed max sentence length to session. this yields huge improvement of inference speed.
  • when it comes to using import_meta_graph(), you should run global_variable_initialzer() before restore().

tips for BERT fine-tuning

  • it seems that the warmup and exponential decay of learing rate are worth to use.

References

general

character convolution

Transformer

CRF

pretrained LM

tensorflow

$ python -c "import tensorflow as tf; print(tf.sysconfig.get_lib())"
$ python -c "import tensorflow as tf; print(tf.sysconfig.get_include())"
$ python -c "import tensorflow as tf; print(int(tf.test.is_built_with_cuda()))"
  • tensorflow backend
- implementations of BLAS specification
  - OpenBlas, intel MKL, Eigen(more functionality, high level library in C++)
- Nvidia GPU
  - CUDA language specification and library
  - cuDNN(more functionality, high level library)
- tensorflow
  - GPU
    - use mainly cuDNN
    - some cuBlas, GOOGLE CUDA(customized by google)
  - CPU
    - use basically Eigen
    - support MKL, MKL-DNN
    - or Eigen with MKL-DNN backend

etc