euske/transformer.md

## transformer.md

      
    Raw
  

              transformer.md
            
          
    Intuitive Explanation of Transformer (without math)

Summary: Transformer (a la "Attention is All You Need") is a complex
model that is built upon several important ideas. In this article, we
explain these ideas in terms of traditional programming concepts,
without using math.
Prerequisites: basic understanding of NN and RNN, and Python.
What's wrong with RNN?

Here's the general seq2seq translation framework:
"King Midas has donkey ears" --[encode]-->  [17 29 54] --[decode]--> "王様 の 耳 は ロバ の 耳"
                                          |<--fixed-->|

In a typical RNN, an input sequence is processed for one element at a time.
It is "compressed (encoded)" to an intermediate representation (a fixed-length vector), then
decoded in the target language.
This model has one obvious bottleneck: the intermediate vector has
only a fixed length, and it can only hold so much information.
Naturally, this model is known to suffer with a long input sequence.
How is Transformer different?

Transformer fundamentally resolves the above problem by allowing an
intermediate vector that have the same length as the input vector:
"King Midas has donkey ears"
   |    |    |    |     |  (encode)
   v    v    v    v     v
 [17   39   27    53   76] (memory)
|<- same length as input ->|
   |    |    |    |     |  (decode)
   v    v    v    v     v
  "王様 の 耳 は ロバ の 耳"

Note that the meaning of the intermediate vector is very different in RNN and Transformer -
In RNN, the information of the vector is accumuated as it goes through each element,
so the longer the input sequence gets, the more compression will occur. This means that, in RNN,
a vector from a short sequence can't be directly compared with a vector from a longer sequence.
On the other hand, each element in Transformer can carry the equal amount of
information, and a vector from a short sequence can naturally expaned to a longer sequence.
This is good for training.
Furthermore, unlike RNN which needs to process one element at a time,
each element in Transformer can be trained/inferred in parallel. So
intuitively it has a good chance (at least in theory) to beat
traditional RNN models.
How does Transformer work?

Now let's take a look at how the algorithm works overall:
def transformer(input):
    memory = encoder(input)
    output = [BOS]
    while True:
        output.append( decoder(memory, output) )
        if output[-1] == EOS: break
    return output

Seems pretty straightforward, right? First, the input sequence
is converted to "memory" (whatever it is). Then,
the output sequence is generated one element at a time.
Now, what is this "memory" thing?
Intuitively speaking, this is an associative array,
hash table, or Python dictionary. The encoder constructs
the dictionary, and the decoder looks it up.
Here's the intuitive Python version:
def encoder(src):
    h1 = { key1(x):value1(x) for x in src }
    h2 = { key2(x):value2(x) for x in src }
    memory = (h1,h2)
    return memory

def decoder(memory, target):
    (h1,h2) = memory
    v1 = [ h1.get(query1(x)) for x in target ]
    v2 = [ h2.get(query2(x)) for x in target ]
    return ff(v1,v2)

Here, key1(), value1(), key2(), value2(), query1(), query2() and ff()
are all learnable functions.  Note that the real Transformer does not
use a dictionary because they are not differentiable (which is the vital requirement
for any neural network model). In the real Transformer,
vectors and matrices are used to do a comparable operation.
Also note that the above encoder/decoder uses only two dictionaries.
In the original paper, they used eight dictionary-like things (called
"heads" in the paper) but for simplicity we just use two of them
here.
Here is a simple model that uses the above framework.
This just returns an identical sequence to the input:
BOS = 0
EOS = 999

# Let's say these are learned functions.
def key1(x): return x
def value1(x): return x
def key2(x): return x
def value2(x): return 1
def query1(x): return x
def query2(x): return x+1
def ff(v1,v2):
    x1 = v1[-1]
    x2 = v2[-1]
    if x2 is None:
        return EOS
    else:
        return x1+1

print(transformer([BOS,1,2,3,4,5,EOS])) # [BOS,1,2,3,4,5,EOS]

This model is still rather limited because it can only perform a
simple element-wise operation. Specifically, it lacks the ability to
take into account following information (which is crucial for most
sequence problems.)

a. Relationship of multiple elements.
b. Order of each element.

Considering the relationship of multiple elements (Self-Attention)

Now let's take the first problem. How Transformer can gather
information from multiple elements? It uses a mechanism called
"Self-Attention". This is the key idea of the paper (hence the title)
and one of the biggest breakthroughs in recent NN research.
However, I found the term "Self-Attention" is rather confusing.
To me, this could be more aptly called "inner relationship", because
what it does is to figure out the relationships between the elements
within an input sequence. Something like this:
             +---object--+
             |           |
  +-subject--+           |
  |          |           |
  +-adj-+    |     +-adj-+
  |     |    |     |     |
"King Midas has donkey ears"

Each relationship consists of its source, target and the type of
relationship. It's a kind of a graph. What's great about Transformer is
that it can discover this kind of relationships automatically! Note
that the actual relationships ("attentions") in Transformer isn't this
clear-cut, and we often don't know what kind of relationships is
actually expressed, but nonetheless it's "some" relationship
anyway. This is analogous to the visual features learned in CNN
layers. It's hard to know what kind of features are actually captured
at each Conv layer, but they're nonetheless some features!
Here's a Python version of this relationship-building process:
def self_attn(seq):
    h1 = { sa_key1(x):sa_value1(x) for x in seq }
    h2 = { sa_key2(x):sa_value2(x) for x in seq }
    a1 = [ h1.get(sa_query1(x)) for x in seq ]
    a2 = [ h2.get(sa_query2(x)) for x in seq ]
    return [ aa(y1,y2) for (y1,y2) in zip(a1,a2) ]

Here, "seq" is the input sequence in question. sa_key1(), sa_value1(),
sa_key2(), sa_value2(), sa_query1(), sa_query2() and aa() are all
learnable functions. We first create two dictionaries, h1 and h2, which
correspond to two types of "relationships" (again, they are called
"heads"). Each key/value in the dictionary is computed from each
element. Then, we create two sequences by looking up the dictionary
with its own input.  This effectively compares all the input elements
with each other.  Finally, we combine these two sequences to produce a
final output sequence.  This sequence takes care of two types
of such relationships. In the example below, each element
of the output sequence is 1 if there's a half or double of the
original value within the input sequence.
def sa_key1(w): return w
def sa_value1(w): return w
def sa_key2(w): return w
def sa_value2(w): return w
def sa_query1(w): return w*2
def sa_query2(w): return w/2
def aa(v1,v2):
    if v1 is None and v2 is None: return 0
    return 1

print(self_attn([BOS,1,2,3,4,5,8,EOS])) # [1,1,1,0,1,0,1,0]

In Transformer, eight types of information is
carried separately in the output vector. So they hold multiple
relationships (called "Multi-Head Attention").  Again, the real
implementation does not use a dictionary but does a comparable thing
by computing dot products and softmax. Furthermore, they stack
this layer six times. This means that it can use shallow relationships
to discover deeper, more complex relationships between elements.
Self-Attention is really the key to handle the complexity of input
sequences.
Considering the element order (Positional Encoding)

The second problem we need to address is considering the element
order.  There are a number of ways to do this. The most
straightforward way is to add an extra field to each element that
contains a number. However this increases the information for each
input and adds extra burden to the network. A better way is to
superimpose some kind of "watermark" on each element in a way that you
can tell if an element comes before or after some other element, and
that's what the Transformer paper is doing. This method is called
"Positional Encoding".
Here's a very simplistic example of positional encoding:
def add_positional(seq):
    return [ i*1000+x for (i, x) in enumerate(seq) ]

print(add_positional([BOS,2,5,7,9,20,EOS])) # [0, 1002, 2005, 3007, 4009, 5020, 6999]

This simple implementation has a shortcoming in that the system can't
distinguish some elements if they are being added (e.g. 1002 + 2005 =
3007). The real Positional Encoding is a little more
sophisticated than this so that this kind of things doesn't happen,
but the basic idea is the same.
Combining this with Self-Attention, we can now detect
if the same element appears twice consecutively:
def sa_key1(w): return w // 1000
def sa_value1(w): return w % 1000
def sa_key2(w): return w // 1000
def sa_value2(w): return w % 1000
def sa_query1(w): return w // 1000
def sa_query2(w): return (w // 1000)-1
def aa(v1,v2):
    if v1 != v2: return 0
    return 1

print(self_attn(add_positional([BOS,1,1,5,5,2,EOS]))) # [0, 0, 1, 0, 1, 0, 0]

And that's it! Now you've understood the basic mechanism
of Transformer without using math.