1. References
[Bahdanau et al.2015] D. Bahdanau, K. Cho, andY. Bengio. 2015. Neural machine translation byjointly learning to align and translate. InICLR.[Buck et al.2014] Christian Buck, Kenneth Heafield,and Bas van Ooyen. 2014. N-gram counts and lan-guage models from the common crawl. InLREC.[Cho et al.2014] Kyunghyun Cho, Bart van Merrien-boer, Caglar Gulcehre, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio.2014.Learningphrase representations using RNN encoder-decoderfor statistical machine translation. InEMNLP.[Fraser and Marcu2007] Alexander Fraser and DanielMarcu. 2007. Measuring word alignment qualityfor statistical machine translation.ComputationalLinguistics, 33(3):293–303.[Gregor et al.2015] Karol Gregor, Ivo Danihelka, AlexGraves, Danilo Jimenez Rezende, and Daan Wier-stra. 2015. DRAW: A recurrent neural network forimage generation. InICML.[Jean et al.2015] S ́ebastien Jean,Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2015. Onusing very large target vocabulary for neural ma-chine translation. InACL.[Kalchbrenner and Blunsom2013] N. Kalchbrenner andP. Blunsom. 2013. Recurrent continuous translationmodels. InEMNLP.[Koehn et al.2003] Philipp Koehn, Franz Josef Och,and Daniel Marcu. 2003. Statistical phrase-basedtranslation. InNAACL.[Liang et al.2006] P. Liang, B. Taskar, and D. Klein.2006. Alignment by agreement. InNAACL.[Luong et al.2015] M.-T. Luong, I. Sutskever, Q. V. Le,O. Vinyals, and W. Zaremba. 2015. Addressing therare word problem in neural machine translation. InACL.[Mnih et al.2014] Volodymyr Mnih, Nicolas Heess,Alex Graves, and Koray Kavukcuoglu. 2014. Re-current models of visual attention. InNIPS.[Papineni et al.2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei jing Zhu.2002. Bleu: amethod for automatic evaluation of machine trans-lation. InACL.[Sutskever et al.2014] I. Sutskever, O. Vinyals, andQ. V. Le. 2014. Sequence to sequence learning withneural networks. InNIPS.[Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,Kyunghyun Cho, Aaron C. Courville, RuslanSalakhutdinov, Richard S. Zemel, and Yoshua Ben-gio. 2015. Show, attend and tell: Neural image cap-tion generation with visual attention. InICML.[Zaremba et al.2015] WojciechZaremba,IlyaSutskever, and Oriol Vinyals.2015.Recurrentneural network regularization. InICLR.A
2. Alignment Visualization
We visualize the alignment weights produced byour different attention models in Figure 7. The vi-sualization of the local attention model is muchsharper than that of the global one. This contrastmatches our expectation that local attention is de-signed to only focus on a subset of words eachtime. Also, since we translate from English to Ger-man and reverse the source English sentence, thewhite strides at the words“reality”and“.”in theglobal attention model reveals an interesting ac-cess pattern: it tends to refer back to the beginningof the source sequence.Compared to the alignment visualizations in(Bahdanau et al., 2015), our alignment patternsare not as sharp as theirs. Such difference couldpossibly be due to the fact that translating fromEnglish to German is harder than translating intoFrench as done in (Bahdanau et al., 2015), whichis an interesting point to examine in future work. TheydonotunderstandwhyEuropeexistsintheorybutnotinreality.Sieverstehennicht,warumEuropatheoretischzwarexistiert,abernichtinWirklichkeit.TheydonotunderstandwhyEuro