current models have trouble learning dependencies over distance (i.e. between characters/words), # ops scale O(n) or O(log n).
transformer is O(1) in number of ops
encoder-decoder with residual conns. Encoder/decodes feed into themselves N times.
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, **ensures that the predictions for position i can depend only on the known outputs at positions less than i **.