1. Analysis
We conduct extensive analysis to better understandour models in terms of learning, the ability to handle long sentences, choices of attentional architec-tures, and alignment quality. All results reportedhere are on English-German newstest2014.
1.1. Learning curves
We compare models built on top of one another aslisted in Table 1. It is pleasant to observe in Fig-ure 5 a clear separation between non-attentionaland attentional models.The input-feeding ap-proach and the local attention model also demon-strate their abilities in driving the test costs lower.The non-attentional model with dropout (the blueSystemPpl.BLEUWMT’15 systemsSOTA –phrase-based(Edinburgh)29.2NMT + 5-gram rerank (MILA)27.6Our NMT systemsBase (reverse)14.316.9+ global (location)12.719.1 (+2.2)+ global (location) + feed10.920.1 (+1.0)+ global (dot) + drop + feed9.722.8 (+2.7)+ global (dot) + drop + feed + unk24.9 (+2.1)Table 3:WMT’15 German-English results–performances of various systems (similar to Ta-ble 1). Thebasesystem already includes sourcereversing on which we addglobalattention,dropout, inputfeeding, andunkreplacement.0.20.40.60.811.21.41.61.8x 10523456Mini−batchesTest costbasicbasic+reversebasic+reverse+dropoutbasic+reverse+dropout+globalAttnbasic+reverse+dropout+globalAttn+feedInputbasic+reverse+dropout+pLocalAttn+feedInputFigure 5:Learning curves– test cost (lnperplex-ity) on newstest2014 for English-German NMTsas training progresses.+ curve) learns slower than other non-dropoutmodels, but as time goes by, it becomes more ro-bust in terms of minimizing test errors.
1.2. Effects of Translating Long Sentences
We follow (Bahdanau et al., 2015) to group sen-tences of similar lengths together and computea BLEU score per group. Figure 6 shows thatour attentional models are more effective than thenon-attentional one in handling long sentences:the quality does not degrade as sentences becomelonger. Our best model (the blue + curve) outperforms all other systems in all length buckets.
1.3. Choices of Attentional Architectures
We examine different attention models (global,local-m, local-p) and different alignment functions (location, dot, general, concat) as describedin Section 3. Due to limited resources, we can-not run all the possible combinations. However,results in Table 4 do give us some idea about dif-ferent choices. Thelocation-basedfunction does not learn good alignments: theglobal (location)model can only obtain a small gain when per-forming unknown word replacement compared tousing other alignment functions.14Forcontent-basedfunctions, our implementationconcatdoesnot yield good performances and more analysisshould be done to understand the reason.15It isinteresting to observe thatdotworks well for theglobal attention andgeneralis better for the localattention. Among the different models, the localattention model with predictive alignments (local-p) is best, both in terms of perplexities and BLEU.
1.4. Alignment Quality
A by-product of attentional models are word alignments.
[info] by-product:副产品
While (Bahdanau et al., 2015) visualized14There is a subtle difference in how we retrieve align-ments for the different alignment functions. At time steptinwhich we receiveyt−1as input and then computeht,at,ct,and ̃htbefore predictingyt, the alignment vectoratis usedas alignment weights for (a) the predicted wordytin thelocation-basedalignment functions and (b) the input wordyt−1in thecontent-basedfunctions.15Withconcat, the perplexities achieved by different mod-els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Suchhigh perplexities could be due to the fact that we simplify thematrixWato set the part that corresponds to ̄hsto identity.MethodAERglobal (location)0.39local-m (general)0.34local-p (general)0.36ensemble0.34Berkeley Aligner0.32Table 6:AER scores– results of various modelson the RWTH English-German alignment data.alignments for some sample sentences and ob-served gains in translation quality as an indica-tion of a working attention model, no work has as-sessed the alignments learned as a whole. In con-trast, we set out to evaluate the alignment qualityusing the alignment error rate (AER) metric.
Given the gold alignment data provided byRWTH for 508 English-German Europarl sen-tences, we “force” decode our attentional modelsto produce translations that match the references.We extract only one-to-one alignments by select-ing the source word with the highest alignmentweight per target word. Nevertheless, as shown inTable 6, we were able to achieve AER scores com-parable to the one-to-many alignments obtainedby the Berkeley aligner (Liang et al., 2006).16
We also found that the alignments produced bylocal attention models achieve lower AERs thanthose of the global one. The AER obtained by theensemble, while good, is not better than the local-m AER, suggesting the well-known observationthat AER and translation scores are not well cor-related (Fraser and Marcu, 2007). We show some alignment visualizations in Appendix A.
1.5. Sample Translations
We  show  in  Table  5  sample  translations  in  bothdirections.It  it  appealing  to  observe  the  ef-fect of attentional models in correctly translatingnames such as “Miranda Kerr” and “Roger Dow”.Non-attentional  models,  while  producing  sensi-ble  names  from  a  language  model  perspective,lack  the  direct  connections  from  the  source  sideto  make  correct  translations.   We  also  observedan interesting case in the second example, whichrequires  translating  thedoubly-negatedphrase,“not  incompatible”.   The  attentional  model  cor-rectly produces “nicht. . .unvereinbar”;  whereasthe non-attentional model generates “nicht verein-16We concatenate the 508 sentence pairs with 1M sentencepairs from WMT and run the Berkeley aligner.
English-German translationssrcOrlando Bloom and Miranda Kerr still love each otherrefOrlando Bloom undMiranda Kerrlieben sich noch immerbestOrlando Bloom undMiranda Kerrlieben einander noch immer .baseOrlando Bloom undLucas Mirandalieben einander noch immer .src′′We′re pleased the FAA recognizes that an enjoyable passenger experience is not incompatiblewith safety and security ,′′said Roger Dow , CEO of the U.S. Travel Association .ref“ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-spruch zur Sicherheit steht ” , sagteRoger Dow, CEO der U.S. Travel Association .best′′Wir freuen uns ,  dass die FAA anerkennt  ,  dass ein angenehmes  ist nicht mit Sicherheit  undSicherheitunvereinbarist′′, sagteRoger Dow, CEO der US - die .base′′Wir freuen uns  ̈uber die