1. Paper title
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
2. link
https://www.aclweb.org/anthology/2020.acl-main.148.pdf
3. 摘要
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both oneto-many and many-to-many settings, and improves zero-shot performance by ∼10 BLEU, approaching conventional pivot-based methods.
backtranslation:反向翻译
4. motivation
多语言翻译是指用一个NMT模型来做多个语言之间的翻译。
多语言NMT的优点:
- 便于模型部署
- 促进相关语言之间的知识迁移
- 提升low-resource翻译
- 使zero-shot翻译成为可能
多语言NMT存在的问题:
问题1:多语言NMT性能差于双语NMT。
问题2:多语言NMT处理zero-shot数据时(相对于pivot-based模型)会出现“off-target translation问题”,即翻译成一个错误的语言。
[?] pivot-based methods
5. 已有的解决问题的方法
5.1. 针对问题1:
- 每个语言都有对应的encoder/decoder
例如:
一对多翻译,共享encoder
多对多翻译,多个语言共享attention mechanism
缺点:scalability受到限制。
- 把不同语言映射到同一个表示空间
例如:
with a target language symbol guiding the translation direction
缺点:
忽略了不同语言的linguistic diversity
- 在2的基础上,加入“语言的linguistic diversity”的考虑
例如:
reorganizing parameter sharing
designing language-specific parameter generator
decoupling multilingual word encoding
本文是以2为baseline探索3的方法
5.2. 针对问题2:
多语言NMT处理zero-shot数据时(相对于pivot-based模型)会出现“off-target translation问题”,即翻译成一个错误的语言。
出现问题的原因:
- missing ingredient problem
- spurious correlation issue
解决方法:
- 跨语言正则化
- generating artificial parallel data with backtranslation
本文探索3的方法来解决zero-shot问题
6. 作者的主要贡献
针对问题1:
作者认为造成问题1的原因是模型容量的不足。
(1)language-aware层归一化
(2)线性变换
[?] 1和2的目的是relax the representation constraint,为什么1和2能得到这样的效果呢?
[?] 2位于encoder和decoder之间,目的是facilitate the induction of language-specific translation correspondences。这句话是什么意思呢?
(3)深层的NMT架构
针对问题2:
(1)作者提供random online backtranslation (ROBT)算法
finetunes a pretrained multilingual NMT model for unseen training language pairs with pseudo parallel batches generated by back-trainlating the target-side training data.
[?] 不太懂
提供OPUS-100数据集:
(1)55M条句子对
(2)包含100种语言
以Transformer model (Vaswani et al., 2017)为benchmark
7. 得到了什么结果
- 增加模型容量可以提升性能,减少多语言NMT与双语NMT之间的gap
- language-specific模型和尝试NMT能提升zero-shot的性能,但对解决off-target tranlation问题没有帮助
- ROBT算法减少off-target出现的概率,在zero-shot问题上性能比pivot-based methods提升10 BLEU。
8. 关键字
zero-shot、多语言翻译