# Transformers Introduction

## Transformers

Transformers are models which use attention to speed up training. While other models use attention, transformers discard the recurrent and convolution used in other architectures.

## Self-attention

Turn every word into a linear combination of each words’ *value vector* ($V$). The weights in the linear combination come inner products of word pairs’ *query vector* ($Q$) and *key vector* ($K$). These three matrices $Q, K, V$ are parameters learned during training.

- Compute three vectors from X, whose rows are word vectors, like word2vec:
- Query: $X W^Q = Q$
- Key: $X W^K = K$
- Value: $X W^V = V$
- These three vectors are all N-by-$d_k$, where $N$=number of words.

- Calculate a score for each input word with all other words.
- Dot product of the word’s
*query vector*with the other word’s*key vector*. - $Q K^T$

- Dot product of the word’s
- Normalize by the dimension of the
*key vector*:- $\frac{Q K^T}{\sqrt{d_K}}$

- Apply
**softmax**to rows so each words’ scores are positive and sum to 1.- $softmax(\frac{Q K^T}{\sqrt{d_K}})$

- For each word, take a linear combination of the words (rows in the value matrix V). Words that have higher weights from the softmax will receive more weight:
- $Z = softmax(\frac{Q K^T}{\sqrt{d_K}}) V$
- Each row of the new matrix is a weighted sum of the rows of V. This is because left-multiplicaiton of matrices is a linear combination or rows.
- The left matrix forces the V matrix to focus on certain words and not others.

The resulting matrix $Z$ has for each word a weight sum of the words’ values from $V$, where the weights are computed using inner products from $Q$ and $K$.

## Multi-headed attention

Now let’s use mutiple sets of Query/Key/Value matrices, one for each *head*.

- Before: $(W^Q, W^K, W^V)$
- After: ${(W_0^Q, W_0^K, W_0^V),..,(W_7^Q, W_7^K, W_7^V)}$

That gives us eight different $Z$ matrics:

- $Z_0,…,Z_7$

Now, concatenate all the matrices, and multiply by an additional weights matrix $W^0 (8d_k \times d_W )$.

- $Z_c=[Z_0,…,Z_8] \ (N \times 8d_k))$
- $Z = Z_cW^0$

This output We can summarize the calculation as follows.

$$ Z = Z_c W^0 = [Z_1,…,Z_7] $$

where

$$Z_i = softmax(\frac{Q_i K_i^T}{\sqrt{d_k}}) V_i$$

The weight matrices $(W_i^Q, W_i^K, W_i^V)$ are initialized randomly and learned during training. They project input embeddings into different representation subspaces.