Transformers Introduction


Transformers are models which use attention to speed up training. While other models use attention, transformers discard the recurrent and convolution used in other architectures.


Turn every word into a linear combination of each words’ value vector ($V$). The weights in the linear combination come inner products of word pairs’ query vector ($Q$) and key vector ($K$). These three matrices $Q, K, V$ are parameters learned during training.

  1. Compute three vectors from X, whose rows are word vectors, like word2vec:
    • Query: $X W^Q = Q$
    • Key: $X W^K = K$
    • Value: $X W^V = V$
    • These three vectors are all N-by-$d_k$, where $N$=number of words.
  2. Calculate a score for each input word with all other words.
    • Dot product of the word’s query vector with the other word’s key vector.
    • $Q K^T$
  3. Normalize by the dimension of the key vector:
    • $\frac{Q K^T}{\sqrt{d_K}}$
  4. Apply softmax to rows so each words’ scores are positive and sum to 1.
    • $softmax(\frac{Q K^T}{\sqrt{d_K}})$
  5. For each word, take a linear combination of the words (rows in the value matrix V). Words that have higher weights from the softmax will receive more weight:
    • $Z = softmax(\frac{Q K^T}{\sqrt{d_K}}) V$
    • Each row of the new matrix is a weighted sum of the rows of V. This is because left-multiplicaiton of matrices is a linear combination or rows.
    • The left matrix forces the V matrix to focus on certain words and not others.

The resulting matrix $Z$ has for each word a weight sum of the words’ values from $V$, where the weights are computed using inner products from $Q$ and $K$.

Multi-headed attention

Now let’s use mutiple sets of Query/Key/Value matrices, one for each head.

  • Before: $(W^Q, W^K, W^V)$
  • After: ${(W_0^Q, W_0^K, W_0^V),..,(W_7^Q, W_7^K, W_7^V)}$

That gives us eight different $Z$ matrics:

  • $Z_0,…,Z_7$

Now, concatenate all the matrices, and multiply by an additional weights matrix $W^0 (8d_k \times d_W )$.

  • $Z_c=[Z_0,…,Z_8] \ (N \times 8d_k))$
  • $Z = Z_cW^0$

This output We can summarize the calculation as follows.

$$ Z = Z_c W^0 = [Z_1,…,Z_7] $$


$$Z_i = softmax(\frac{Q_i K_i^T}{\sqrt{d_k}}) V_i$$

The weight matrices $(W_i^Q, W_i^K, W_i^V)$ are initialized randomly and learned during training. They project input embeddings into different representation subspaces.


Robert Hatem
Data Scientist

Data scientist interested in machine learning and natural language processing.