1. Paper title

Multiscale Collaborative Deep Models for Neural Machine Translation

2. link

https://www.aclweb.org/anthology/2020.acl-main.40.pdf

3. 摘要

Recent evidence reveals that Neural Machine Translation (NMT) models with deeper neural networks can be more effective but are difficult to train. In this paper, we present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously. We explicitly boost the gradient backpropagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models. Then, instead of forcing the whole encoder stack directly learns a desired representation, we let each encoder block learns a fine-grained representation and enhance it by encoding spatial dependencies using a context-scale collaboration. We provide empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth. On IWSLT translation tasks with three translation directions, our extremely deep models (with 72-layer encoders) surpass strong baselines by +2.2∼+3.1 BLEU points. In addition, our deep MSC achieves a BLEU score of 30.56 on WMT14 English→German task that significantly outperforms state-of-the-art deep NMT models.

4. 要解决什么问题

NMT模型的结构通常为encoder-decoder + attention

深度神经网络难以训练，CV解决此问题的方法有：

residual connection
densely connected networks
deep layer aggregation

深度神经网络在NLP以下方向有用：

语言模型
QA
文本分类
自然语言推断

NMT变深后性能会恶化，以下方法可以优化深度NMT的性能：

transparent attention mechanism
pre-norm method
但以上技术只能让NMT深度达到20-30层，30层以上性能仍会恶化。

因此作者提出问题：

How to break the limitation of depth in NMT models?
How to fully utilize the deeper structure to further improve the translation quality?

5. 作者的主要贡献

MultiScale Collaborative framework --- MSC
引入block-scale collabration机制，具体做法如下：

把整个encoder stack分成多个encoder block
encoder和decoder有同样的block数
每个encoder block各自学习一个fine-grained表示。
用bottom-up网络网络做encoding spatial dependencies[?].
使用context-scale collabration对spatial dependency进行编码来增强上述表示。

MCS的优点：

传播路径变短
防止底层信息丢失

6. 得到了什么结果

深度变深时，传统模型性能差，而MMSC易于优化
72层的MSC的翻译性能有大量提升，性能优于Benchmark。
WMT14 English→German BLEU 30.56 ，优于baseline 2.5, 优于SOTA且参数更少

7. 关键字

Encoder, deep

2020.acl-main.40