- Complexity of the output layer of the Encoder-decoder-attention NMT model:
$O(V)$ - What we desire:
- Memory efficiency
- Time efficiency (for CPU under actual production envir)
- Compatibility with parallel computation (for GPU under training)
- In this paper, we propose a method that satisfies all of conditions above.
- The method works by not predicting a softmax over the entire output vocabulary,
- but instead by encoding each vocabulary word as a vector of binary variables, then independently predicting the bits of this binary representation.
$O(V) \rightarrow (log V)$
-
对softmax的加速问题提出了使用近似计算的方法,近似计算的主要方法如下:
a) 基于采样的近似方法:重要性采样;噪声对比估计方法;简单负采样;mini-batching负采样
b) 基于结构的近似方法:基于分类的softmax;基于层次结构的softmax;二进制编码预测
-
Hierarchical softmax (Morin and Bengio, 2005)
- Time Complexity
$O(HlogV)$ - but still requires Space Complexity
$O(HV)$
- Time Complexity
-
$\mathbf{b}(w) := [b_1(w), b_1(w), ..., b_B(w)] \in {0,1}^B$ (${0,1}^B$可以看做是$R^n$一类的记号) - 编码满足:1)每个word必有一个编码。2)不同word编码不同。3)留有冗余,即同一word可以有多个编码。因此$B \geq log_2V$。
- 默认UNK为二进制的0,BOS为二进制的1,EOS为二进制的2,其余的单词按照频率排序得到一个rank(w),最终表示为二进制的2+rank(w)。
- Loss Function
- 可以使用平方误差,也可以使用交叉熵。
- While this idea is simple and intuitive, we found that it alone was not enough to achieve competitive accuracy with real NMT models. Thus we make two improvements: