Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Recurrent models typically factor computation along the symbol positions of the input and output sequences.

有的材料把这里的“factor computation”翻译成因子计算,这是不对的。如果把“factor computation”看作一个词组的话,这句话就没有动词了。

Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.


This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.





[?] 仍然不明白的是,batch的size为什么会和sequence的长度有关呢?我个人认为,计算下一个时间步时,这一个时间步的信息就不需要了。所以一个样本所占的内存应该和只一个时间步所占的内存有关,与时间步的长度无关。

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

factorization tricks
conditional computation

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

在Transformer之前,序列模型或序列转换问题普遍都是用基于gate和recurrent的网络结构。所谓的recurrent结构是指存在从当前时间步的hidden state流向下一个时间步的hidden state的数据流动。这种方法存在“并行性差”、“长距离依赖关系难以学习”等问题。
Transformer用Attention代替了传统序列转换问题模型中的recurrent结构。在“并行性差”的问题是缓解,解决了“长距离难以学习”的问题。 Transformer摒弃了recurrent结构,这不代表在Transformer中每个时间步之间没有关系。实际上在Transformer中,还是存在从当前时间步到下一个时间步的数据流动。下一个时间步使用了当时步的输出。


results matching ""

    No results matching ""