
__scaled_800.jpg)
Layer outputs can be computed in parallel, instead of a series like an RNN. The main reasons is that Transformers replaced recurrence with attention, and computations can happen simultaneously. This makes them efficient on hardware like GPUs and TPUs. Unlike the recurrent neural networks (RNNs), Transformers are parallelizable.Transformers excel at modeling sequential data, such as natural language.The RNN+Attention modelĪfter training the model in this notebook, you will be able to input a Portuguese sentence and return the English translation.įigure 2: Visualized attention weights that you can generate at the end of this tutorial. This tutorial builds a 4-layer Transformer which is larger and more powerful, but not fundamentally more complex. The only difference is that the RNN layers are replaced with self attention layers. To get the most out of this tutorial, it helps if you know about the basics of text generation and attention mechanisms.Ī Transformer is a sequence-to-sequence encoder-decoder model similar to the model in the NMT with attention tutorial.Ī single-layer Transformer takes a little more code to write, but is almost identical to that encoder-decoder RNN model. That's a lot to digest, the goal of this tutorial is to break it down into easy to understand parts. This step is then repeated multiple times in parallel for all words, successively generating new representations.įigure 1: Applying the Transformer to machine translation.

Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. The Transformer starts by generating initial representations, or embeddings, for each word. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. Self attention allows Transformers to easily transmit information across the input sequences. Transformers are deep neural networks that replace CNNs and RNNs with self-attention. The Transformer was originally proposed in "Attention is all you need" by Vaswani et al. This tutorial demonstrates how to create and train a sequence-to-sequence Transformer model to translate Portuguese into English.
