1. Analysis

We conduct extensive analysis to better understandour models in terms of learning, the ability to handle long sentences, choices of attentional architec-tures, and alignment quality. All results reportedhere are on English-German newstest2014.

1.1. Learning curves

We compare models built on top of one another aslisted in Table 1. It is pleasant to observe in Fig-ure 5 a clear separation between non-attentionaland attentional models.The input-feeding ap-proach and the local attention model also demon-strate their abilities in driving the test costs lower.The non-attentional model with dropout (the blueSystemPpl.BLEUWMT’15 systemsSOTA –phrase-based(Edinburgh)29.2NMT + 5-gram rerank (MILA)27.6Our NMT systemsBase (reverse)14.316.9+ global (location)12.719.1 (+2.2)+ global (location) + feed10.920.1 (+1.0)+ global (dot) + drop + feed9.722.8 (+2.7)+ global (dot) + drop + feed + unk24.9 (+2.1)Table 3:WMT’15 German-English results–performances of various systems (similar to Ta-ble 1). Thebasesystem already includes sourcereversing on which we addglobalattention,dropout, inputfeeding, andunkreplacement.0.20.40.60.811.21.41.61.8x 10523456Mini−batchesTest costbasicbasic+reversebasic+reverse+dropoutbasic+reverse+dropout+globalAttnbasic+reverse+dropout+globalAttn+feedInputbasic+reverse+dropout+pLocalAttn+feedInputFigure 5:Learning curves– test cost (lnperplex-ity) on newstest2014 for English-German NMTsas training progresses.+ curve) learns slower than other non-dropoutmodels, but as time goes by, it becomes more ro-bust in terms of minimizing test errors.

1.2. Effects of Translating Long Sentences

We follow (Bahdanau et al., 2015) to group sen-tences of similar lengths together and computea BLEU score per group. Figure 6 shows thatour attentional models are more effective than thenon-attentional one in handling long sentences:the quality does not degrade as sentences becomelonger. Our best model (the blue + curve) outperforms all other systems in all length buckets.

1.3. Choices of Attentional Architectures

We examine different attention models (global,local-m, local-p) and different alignment functions (location, dot, general, concat) as describedin Section 3. Due to limited resources, we can-not run all the possible combinations. However,results in Table 4 do give us some idea about dif-ferent choices. Thelocation-basedfunction does not learn good alignments: theglobal (location)model can only obtain a small gain when per-forming unknown word replacement compared tousing other alignment functions.14Forcontent-basedfunctions, our implementationconcatdoesnot yield good performances and more analysisshould be done to understand the reason.15It isinteresting to observe thatdotworks well for theglobal attention andgeneralis better for the localattention. Among the different models, the localattention model with predictive alignments (local-p) is best, both in terms of perplexities and BLEU.

1.4. Alignment Quality

A by-product of attentional models are word alignments.

[info] by-product：副产品

While (Bahdanau et al., 2015) visualized14There is a subtle difference in how we retrieve align-ments for the different alignment functions. At time steptinwhich we receiveyt−1as input and then computeht,at,ct,and ̃htbefore predictingyt, the alignment vectoratis usedas alignment weights for (a) the predicted wordytin thelocation-basedalignment functions and (b) the input wordyt−1in thecontent-basedfunctions.15Withconcat, the perplexities achieved by different mod-els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Suchhigh perplexities could be due to the fact that we simplify thematrixWato set the part that corresponds to ̄hsto identity.MethodAERglobal (location)0.39local-m (general)0.34local-p (general)0.36ensemble0.34Berkeley Aligner0.32Table 6:AER scores– results of various modelson the RWTH English-German alignment data.alignments for some sample sentences and ob-served gains in translation quality as an indica-tion of a working attention model, no work has as-sessed the alignments learned as a whole. In con-trast, we set out to evaluate the alignment qualityusing the alignment error rate (AER) metric.

Given the gold alignment data provided byRWTH for 508 English-German Europarl sen-tences, we “force” decode our attentional modelsto produce translations that match the references.We extract only one-to-one alignments by select-ing the source word with the highest alignmentweight per target word. Nevertheless, as shown inTable 6, we were able to achieve AER scores com-parable to the one-to-many alignments obtainedby the Berkeley aligner (Liang et al., 2006).16

We also found that the alignments produced bylocal attention models achieve lower AERs thanthose of the global one. The AER obtained by theensemble, while good, is not better than the local-m AER, suggesting the well-known observationthat AER and translation scores are not well cor-related (Fraser and Marcu, 2007). We show some alignment visualizations in Appendix A.

1.5. Sample Translations

We show in Table 5 sample translations in bothdirections.It it appealing to observe the ef-fect of attentional models in correctly translatingnames such as “Miranda Kerr” and “Roger Dow”.Non-attentional models, while producing sensi-ble names from a language model perspective,lack the direct connections from the source sideto make correct translations. We also observedan interesting case in the second example, whichrequires translating thedoubly-negatedphrase,“not incompatible”. The attentional model cor-rectly produces “nicht. . .unvereinbar”; whereasthe non-attentional model generates “nicht verein-16We concatenate the 508 sentence pairs with 1M sentencepairs from WMT and run the Berkeley aligner. English-German translationssrcOrlando Bloom and Miranda Kerr still love each otherrefOrlando Bloom undMiranda Kerrlieben sich noch immerbestOrlando Bloom undMiranda Kerrlieben einander noch immer .baseOrlando Bloom undLucas Mirandalieben einander noch immer .src′′We′re pleased the FAA recognizes that an enjoyable passenger experience is not incompatiblewith safety and security ,′′said Roger Dow , CEO of the U.S. Travel Association .ref“ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-spruch zur Sicherheit steht ” , sagteRoger Dow, CEO der U.S. Travel Association .best′′Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit undSicherheitunvereinbarist′′, sagteRoger Dow, CEO der US - die .base′′Wir freuen uns ̈uber die, dass ein mit Sicherheit nichtvereinbarist mitSicherheit und Sicherheit′′, sagteRogerCameron, CEO der US -.German-English translationssrcIn einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .refHowever , in an interview , Bloom has said that he andKerrstill love each other .bestIn an interview , however , Bloom said that he andKerrstill love .baseHowever , in an interview , Bloom said that he andTinawere still.srcWegen der von Berlin und der Europ ̈aischen Zentralbank verh ̈angten strengen Sparpolitik inVerbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal-ten an der gemeinsamen W ̈ahrung gen ̈otigt wird , sind viele Menschen der Ansicht , das ProjektEuropa sei zu weit gegangenrefTheausterity imposed by Berlin and the European Central Bank , coupled with the straitjacketimposed on national economies through adherence to the common currency , has led many peopleto think Project Europe has gone too far .bestBecause of the strictausterity measures imposed by Berlin and the European Central Bank inconnection with the straitjacketin which the respective national economy is forced to adheretothe common currency , many people believe that the European project has gone too far .baseBecause of the pressureimposed by the European Central Bank and the Federal CentralBankwith the strict austerityimposed on the national economy in the face of the single currency ,many people believe that the European project has gone too far .Table 5:Sample translations– for each example, we show the source (src), the human translation (ref),the translation from our best model (best), and the translation of a non-attentional model (base). Weitalicize somecorrecttranslation segments and highlight a fewwrongones in bold.bar”, meaning “not compatible”.17The attentionalmodel also demonstrates its superiority in translat-ing long sentences as in the last example.

5 Analysis