References Abadi, Mart´ ın, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghe- mawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irv- ing, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, andZheng, Xiaoqiang. TensorFlow: Large-scale machine learning on heteroge- neous systems, 2015. URL http://tensorflow. org/. Software available from tensorflow.org. Arisoy, Ebru, Sainath, Tara N, Kingsbury, Brian, and Ram- abhadran, Bhuvana. Deepneuralnetworklanguagemod- els. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pp. 20–28. As- sociation for Computational Linguistics, 2012. Ballesteros, Miguel, Dyer, Chris, and Smith, Noah A. Improved transition-based parsing by modeling char- acters instead of words with lstms. arXiv preprint arXiv:1508.00657, 2015. Bengio, Yoshua and Senécal, Jean-Sébastien. Adaptive im- portancesamplingtoacceleratetrainingofaneuralprob- abilistic language model. Neural Networks, IEEE Trans- actions on, 19(4):713–722, 2008. Bengio, Yoshua, Senécal, Jean-Sébastien, et al. Quick training of probabilistic neural nets by importance sam- pling. In AISTATS, 2003. Bengio, Yoshua, Schwenk, Holger, Senécal, Jean- Sébastien, Morin, Fréderic, and Gauvain, Jean-Luc. Neural probabilistic language models. In Innovations in Machine Learning, pp. 137–186. Springer, 2006. Chelba, Ciprian, Mikolov, Tomas, Schuster, Mike, Ge, Qi, Brants, Thorsten, Koehn, Phillipp, and Robinson, Tony. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013. Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase represen- tations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recog- nition, 2009. CVPR 2009. IEEE Conference on, pp. 248–
- IEEE, 2009.
Filippova, Katja, Alfonseca, Enrique, Colmenares, Car-
losA,Kaiser, Lukasz, andVinyals, Oriol. Sentencecom-
pression by deletion with lstms. In Proceedings of the
2015 Conference on Empirical Methods in Natural Lan-
guage Processing, pp. 360–368, 2015.
Gers, Felix A, Schmidhuber, Jürgen, and Cummins, Fred.
Learning to forget: Continual prediction with lstm. Neu-
ral computation, 12(10):2451–2471, 2000.
Gillick, Dan, Brunk, Cliff, Vinyals, Oriol, and Subra-
manya, Amarnag. Multilingual language processing
from bytes. arXiv preprint arXiv:1512.00103, 2015.
Graves, Alex. Generating sequences with recurrent neural
networks. arXiv preprint arXiv:1308.0850, 2013.
Graves, Alex and Schmidhuber, Jürgen. Framewise
phoneme classification with bidirectional lstm and other
neural network architectures. Neural Networks, 18(5):
602–610, 2005.
Gutmann, Michael and Hyvärinen, Aapo. Noise-
contrastive estimation: A new estimation principle for
unnormalized statistical models. In International Con-
ference on Artificial Intelligence and Statistics, pp. 297–
304, 2010.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
term memory. Neural computation, 9(8):1735–1780,
1997.
Ji, Shihao, Vishwanathan, S. V. N., Satish, Nadathur, An-
derson, Michael J., and Dubey, Pradeep. Blackout:
Speeding up recurrent neural network language models
with very large vocabularies. CoRR, abs/1511.06909,
2015a. URL http://arxiv.org/abs/1511.
06909.
Ji, Yangfeng, Cohn, Trevor, Kong, Lingpeng, Dyer, Chris,
and Eisenstein, Jacob. Document context language mod-
els. arXiv preprint arXiv:1511.03962, 2015b.
Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever,
Ilya. An empirical exploration of recurrent network ar-
chitectures. In Proceedings of the 32nd International
Conference on Machine Learning (ICML-15), pp. 2342–
2350, 2015.
Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom,
Phil. A convolutional neural network for modelling sen-
tences. arXiv preprint arXiv:1404.2188, 2014.
Kim, Yoon, Jernite, Yacine, Sontag, David, and Rush,
Alexander M. Character-aware neural language models.
arXiv preprint arXiv:1508.06615, 2015.
Kneser, Reinhard and Ney, Hermann. Improved backing-
offform-gramlanguagemodeling. InAcoustics, Speech,
and Signal Processing, 1995. ICASSP-95., 1995 Inter-
national Conference on, volume 1, pp. 181–184. IEEE,
1995.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing
systems, pp. 1097–1105, 2012.
Le Cun, B Boser, Denker, John S, Henderson, D, Howard,
Richard E, Hubbard, W, and Jackel, Lawrence D. Hand-
written digit recognition with a back-propagation net-
work. In Advances in neural information processing sys-
tems. Citeseer, 1990.
Ling, Wang, Lu´ ıs, Tiago, Marujo, Lu´ ıs, Astudillo,
Ramón Fernandez, Amir, Silvio, Dyer, Chris, Black,
Alan W, and Trancoso, Isabel. Finding function in form:
Compositional character models for open vocabulary
word representation. arXiv preprint arXiv:1508.02096,
2015.
Luong, Minh-Thang, Sutskever, Ilya, Le, Quoc V, Vinyals,
Oriol, and Zaremba, Wojciech. Addressing the rare word
problem in neural machine translation. arXiv preprint
arXiv:1410.8206, 2014.
Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and San-
torini, Beatrice. Building a large annotated corpus of
english: The penn treebank. Computational linguistics,
19(2):313–330, 1993.
Mikolov, Tomᡠs. Statistical language models based on neu-
ral networks. Presentation at Google, Mountain View,
2nd April, 2012.
Mikolov, Tomas and Zweig, Geoffrey. Context dependent
recurrent neural network language model. In SLT, pp.
234–239, 2012.
Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cer-
nock
y, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In INTERSPEECH, vol- ume 2, pp. 3, 2010. Mikolov, Tomas, Deoras, Anoop, Kombrink, Stefan, Bur- get, Lukas, and Cernock
y, Jan. Empirical evaluation and combinationofadvancedlanguagemodelingtechniques. In INTERSPEECH, number s 1, pp. 605–608, 2011. Mnih, Andriy and Hinton, Geoffrey E. A scalable hierar- chicaldistributedlanguagemodel. InAdvancesinneural information processing systems, pp. 1081–1088, 2009. Mnih, Andriy and Kavukcuoglu, Koray. Learning word embeddings efficiently with noise-contrastive estima- tion. In Advances in Neural Information Processing Sys- tems, pp. 2265–2273, 2013. Morin, Frederic and Bengio, Yoshua. Hierarchical proba- bilistic neural network language model. In Aistats, vol- ume 5, pp. 246–252. Citeseer, 2005. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012. Rush, Alexander M, Chopra, Sumit, and Weston, Jason. A neural attention model for abstractive sentence summa- rization. arXiv preprint arXiv:1509.00685, 2015. Sak, Hasim, Senior, Andrew W, and Beaufays, Franc ¸oise. Longshort-termmemoryrecurrentneuralnetworkarchi- tectures for large scale acoustic modeling. In INTER- SPEECH, pp. 338–342, 2014. Schuster, Mike and Paliwal, Kuldip K. Bidirectional recur- rent neural networks. Signal Processing, IEEE Transac- tions on, 45(11):2673–2681, 1997. Schwenk, Holger, Rousseau, Anthony, and Attik, Mo- hammed. Large, pruned or continuous space language models on a gpu for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Fu- ture of Language Modeling for HLT, pp. 11–19. Associ- ation for Computational Linguistics, 2012. Serban, Iulian Vlad, Sordoni, Alessandro, Bengio, Yoshua, Courville, Aaron C., and Pineau, Joelle. Hierarchical neural network generative models for movie dialogues. CoRR, abs/1507.04808, 2015. URL http://arxiv. org/abs/1507.04808. Exploring the Limits of Language Modeling Shazeer, Noam, Pelemans, Joris, and Chelba, Ciprian. Sparse non-negative matrix language modeling for skip- grams. Proceedings of Interspeech, pp. 1428–1432, 2015. Srivastava, Nitish. Improving neural networks with dropout. PhD thesis, University of Toronto, 2013. Srivastava, Nitish, Mansimov, Elman, and Salakhutdinov, Ruslan. Unsupervised learning of video representations using lstms. arXiv preprint arXiv:1502.04681, 2015a. Srivastava, Rupesh K, Greff, Klaus, and Schmidhuber, Jürgen. Training very deep networks. In Advances in Neural Information Processing Systems, pp. 2368–2376, 2015b. Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generating text with recurrent neural networks. In Pro- ceedings of the 28th International Conference on Ma- chine Learning (ICML-11), pp. 1017–1024, 2011. Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Se- quence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014. Vaswani, Ashish, Zhao, Yinggong, Fossum, Victoria, and Chiang, David. Decoding with large-scale neural lan- guage models improves translation. Citeseer. Vincent, Pascal, de Brébisson, Alexandre, and Bouthillier, Xavier. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Systems, pp. 1108–1116, 2015. Vinyals, Oriol and Le, Quoc. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015. Wang, Tian and Cho, Kyunghyun. Larger-context language modelling. arXiv preprint arXiv:1511.03729, 2015. Williams, Ronald J and Peng, Jing. An efficient gradient- based algorithm for on-line training of recurrent network trajectories. Neural computation, 2(4):490–501, 1990. Williams, Will, Prasad, Niranjani, Mrva, David, Ash, Tom, and Robinson, Tony. Scaling recurrent neural network language models. In Acoustics, Speech and Signal Pro- cessing (ICASSP), 2015 IEEE International Conference on, pp. 5391–5395. IEEE, 2015. Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.