Decoding the magic behind LLMs — Attention!

Published in

Artificial Intelligence in Plain English

14 min readMay 2, 2024

Figure 1. Source: Image generated using Gemini

The idea behind this blog is to break down this attention mechanism that gave LLMs like ChatGPT their superpowers, into granular, easy to understand, and practical pieces. Stick with me till the end and trust me, everything boils down to simple matrix multiplications. So, let’s start with with my all time favourite quote,

Any sufficiently advanced technology is indistinguishable from magic.
- Arthur C. Clarke

1. Introduction
2. Background
3. Why Attention
3.1 The Proposal
3.2 Benefits
4. Embeddings
4.1 Intuition behind embeddings
4.2 Representations and Similarity
5. Decoding Attention
5.1 Query, Key, and Value
5.2 Self-Attention Code
5.3 Attention Head
6. Conclusion

1. Introduction

When you first interacted with Generative AI Language Models such as ChatGPT or Gemini, you must have felt a sense of awe and surprise as to how good they were, and how natural the conversation felt. It was like Magic! But, how can a machine understand something, which took us humans took years to devise?

Well, nothing is an overnight achievement. The journey started in 1950, when Alan Turing coined the term Imitation Game, also known as Turing Test. In short, if you ask a question to an entity, will you be able to distinguish whether that entity is a machine or a human from the answer that you receive? I’m pretty sure you must have guessed where this is going :)

When ChatGPT was made public, it broke the internet and everyone was talking about AI. If your grandma or granddad asks you what is ChatGPT, then you should know that it is a pivotal point in human history. This was the first time when the general public was handed a functional and super useful AI model, and they loved it!

Figure 2. Popularity of AI over time, visualised using Google Trends.

But, AI has been there since a long time. The first Neural Network was created back in 1957! At the risk of stating the obvious, a Neural Network is a method in AI that teaches computers to process data in a way that is inspired by the human brain.

So, why the sudden bump? What happened in the field of AI that gave birth to such a powerful AI that could understand the language so well? The answer is — We taught the machines to pay attention, and they did!

2. Background

Traditionally, sequence-to-sequence models such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs) that leveraged encoder-decoder architecture had limitations such as,

1. Vanishing Gradient Problem:

RNNs and LSTMs process information sequentially, one element at a time. This can lead to the vanishing gradient problem. Early information in the sequence can have a diminishing impact on the final output as the network processes later elements.
The gradients used to update the model’s weights during training become very small or vanish entirely for the earlier parts of the sequence. This makes it difficult for the model to learn long-range dependencies between elements, especially in very long sequences.

2. Limited Memory:

RNNs and LSTMs have a limited internal memory to store information about the sequence.
As the sequence length increases, the model struggles to retain the context of earlier elements needed to understand the later parts. This can lead to inaccurate or nonsensical outputs.

3. Computational Complexity:

Processing long sequences with RNNs can be computationally expensive. The number of calculations required grows with the sequence length, making training and inference slow.

3. Why Attention?

Figure 3. Source: Photo by Rod Long on Unsplash

Imagine being in a bustling cafe with a friend of yours. Conversations buzz, music plays, and the coffee grinder roars. Yet, you focus on your friend’s words. That’s attention: filtering information to perform a task, like focusing on what your friend is saying, in a noisy environment. Just like you wouldn’t drive ignoring the road, attention is crucial to filter out information from the noise.

Now, if there was a way we could teach these machines how and where to pay attention to, we can solve the problem of having to process long sequences. This is exactly what the paper, “Attention is All You Need” by Vaswani et al. solved for!

3.1. The Proposal: Transformers and Attention

The paper proposes a new architecture called the Transformer, which relies solely on an attention mechanism. Attention allows the model to focus on specific parts of the input sequence that are relevant to the current processing step. There are two types of attention introduced:

Self-attention: Focuses on how different parts of a single sequence relate to each other.
Encoder-decoder attention: Allows the decoder to attend to relevant parts of the encoded input sequence.

By using only attention mechanisms, Transformers can capture long-range dependencies within sequences more effectively than RNNs.

3.2. Benefits of Transformers

Parallelisation: The attention mechanism allows for parallel processing during training, which significantly improves training speed.
Accuracy: The paper demonstrates that Transformers achieve state-of-the-art performance on machine translation tasks compared to previous models.

Now, before understanding how attention works, we first need to understand how does a machine encodes the language into something it can understand.

4. Embeddings

Machines do not understand language, what they understand are numbers. So, how do we go about converting the language into numbers?

Let’s see if you can solve the below riddle:

What is round, red, and tastes sweet when you eat?
Answer: Cherry

Embeddings are like riddles, the machine tries to associate meaning of a word across various attributes. In the above example, let’s say the machine has 3 dimensions, where x-axis would be the shape, y-axis would be the colour, and z-axis would be the taste. The machine can learn to associate these attributes to the word, “cherry”.

4.1. Intuition behind embeddings

Figure 4. Source: Photo by Andres Perez on Unsplash

Here’s a better parallel: Think of colours. A large part of colours perceived by humans are made from Red, Green, and Blue. So, you can create any colour using different proportions of Red, Green, and Blue colours. Imagine these colours as the 3 dimensions in your embedding space. Each colour represents a specific aspect that contributes to the overall colour we perceive. You can play around with this tool and see how varying different values for RGB can represent each colour.

The machine doesn’t see the colour the way we humans do, it just gets it differently. Just like these colours can be encoded using 3 dimensions (RGB), we can also encode the words using n-dimensions, where n would be a very high number that would associate with the qualities or attributes of that word.

In the above example, we saw “cherry” being associated with just 3 attributes, but let’s say we have to encode the entire dictionary available to us, we would need more than just 3 dimensions to capture more subtle variations in meaning.

4.2. Representations and Similarity

What King is to Queen, is Dad to ?

You must have encountered these kind of analogy or semantic similarity questions at some point in your life. The simple way to solve these would be to find the relationship between the first two entities, and add that relationship to the third one to get the pair. It is obvious that there must be a gender related attribute that when added to Dad, would give the corresponding pair, which would be Mom.

Figure 5. Vector Representations for some gender specific entities. (Source: Image created by author)

Mathematically, if we subtract the vector representation of Queen([0.2, 0.5, 0.1]) and King([0.9, 0.5, 0.1]) and add it to Dad([0.9, 0.7, 0.9]), then we would get Mom([0.2, 0.7, 0.9]).

E(Queen) — E(King) = E(Mom) — E(Dad)

Here’s another important aspect: Similarity.

How do you determine if one word is having similar meaning to another, and to what extent. To judge similarity, you would look at certain characteristics or attributes of the entities and assess how close these attributes are. Let’s say your like Coldplay, John Mayer, and Ed Sheeran, then you have a very high similarity with me in terms of music taste, my friend!

Let’s call these collections of attributes that represent a word, a Vector Representation. One of the best ways to get similarity between two vectors is Cosine Similarity. In the above example, the similarity between Mom and Dad would be 0.88, whereas for Mom and King, it would be 0.52.

Figure 6. Cosine Similarity. Source: Wikipedia

Now that we have understood how to represent a word, modify a word based on semantics, and find similarity between any two words, let’s understand how we can teach a machine to pay attention to words in a document.

5. Decoding Attention

Let’s say we have the below sentence:
Terry loves to eat Mango Greek Yogurt

Now if I ask you do describe the Yogurt in the above context, you will say that it is Mango flavoured Greek Yogurt. The word Mango, and Greek are most relevant to understand what kind of Yogurt is being described here. In short, you would pay attention the adjectives of the word Yogurt.

The attention mechanism is a part of the transformer architecture that was proposed in the “Attention is All You Need” research paper. The scope of this blog is limited to attention mechanism within the transformer.

This mechanism focuses on specific parts of the input data (words in a sentence) and assigns them varying degrees of importance. It doesn’t define a complete set of instructions, but rather a specific way of processing information.

In nutshell, we are creating a new representation for each element that takes into account the context provided by other elements in the sequence. And that’s self-attention!

As we saw in the above section, we create word embeddings to encode the meaning of a word for the machine. These words in isolation do not mean anything, except the fact that similar words appear closer in n-dimensional vector space or have a small variance in their embeddings. For example, all the fruits would be a lot more closer to each other in the vector space.

5.1. Query, Key, and Value

In the attention mechanism, Query (Q), Key (K), and Value (V) vectors play crucial roles in determining how much information each element in a sequence contributes to the final output. Here’s a breakdown:

Query (Q): Represents the “current focus” or question at hand. It’s like a search term used to find relevant information. In the above example, a query could be like, what are the adjectives of the given word.
Key (K): Acts like an index or identifier for each element in the sequence. It helps determine how well each element “matches” the query. Continuing with the above example, the key could be like identifiers of the adjectives preceding the noun.
Value (V): Holds the actual information associated with each element in the sequence. It’s like the content you want to retrieve based on the query-key match. Think of this as the information you would add to the original value to make it as described. In the above example, it would be adding “Mango”, and “Greek” characteristics to the word, “Yogurt”.

Mathematically speaking, Query and Key values have much smaller dimension than the embedding vectors.

We have trainable matrices Wq and Wk for Query and Key vectors respectively. The intent of these trainable weight matrices is to encode the input embedding to a lower dimensional query and key vectors.

Figure 7. Query and Key Matrices. (Source: Image created by author)

Since, every word is encoded into a word embedding, let’s call the embedding vector of word, “Yogurt” as E(Yogurt). This E(Yogurt) would be a n-dimensional vector (where n would be a very large number), where each dimension represent some attribute associated with this word.

Figure 8. Imput Embeddings. (Source: Image created by author)

Once we multiply the matrix Wq with E(Yogurt), we will get a query vector Q(Yogurt). Now remember, this Q(Yogurt) query matrix is a smaller dimension vector (let’s say m-dimensional, where m<<n) compared to E(Yogurt) because it represents a query associated with the word Yogurt.

Figure 9. Query Vector Creation. (Source: Image created by author)

For example, you can think of this query matrix mapping the noun, Yogurt to a smaller dimension in the query space, which would somehow encode the notion of looking for adjectives in the preceding positions. (for representation purposes, only one query vector has been shown in Figure 10, in reality, a query vector for every word would be created)

Figure 10. Query Vector for word Yogurt. (Source: Image created by author)

Similarly, we have Wk matrix, which would also reduce every word vector in the document to a smaller dimension, m (same as query dimension).

Figure 11. Key Vector Creation. (Source: Image created by author)

Continuing with the above example, you can think of the key matrix mapping every word in the sentence to smaller dimension in key space, which would somehow encode the notion of the words being adjectives to the words in succeeding space. (with respect to above query vector, every key vector has been show in Figure 12)

Figure 12. Key Vector for all words. (Source: Image created by author)

Once we have the query and key vectors for all the words in the sentence, we can compute the similarity score between every query and key vector to see how much similar they are, by calculating the scaled-dot product between each query and key vector.

Figure 13. Scaled Dot Product between Query and Key Vector Matrices. (Source: Image created by author)

In the above example, the key vectors of words “Mango”, and “Greek”, would have very high similarity (scaled-dot product) with the query vector for word “Yogurt”, if the query vector tries to encode “Yogurt” to look for adjectives in the preceding positions. This means that embeddings of, “Mango”, and “Greek” attends to the embedding of “Yogurt”.

Figure 14. High Similarity between Query and Key Vector. (Source: Image created by author)

Since, this dot-product can lie anywhere between -inf to inf, we would ideally want these values to be between 0 and 1, to represent some kind of probability distribution. To achieve this, we apply softmax function to each column of the dot product of query and key vectors. This would scale the similarity score between each query and all key values between 0 and 1.

Figure 15. Applying Softmax to Scaled Dot Product Matrix. (Source: Image created by author)

Now that we have weighted attention matrix, depicting how much attention must be paid to each and every word with respect to every other, it’s time to modify the original word embeddings based on these attention weights.

Figure 16. Attention Weighted Matrix. (Source: Image created by author)

We can achieve this by using another matrix, Wv, which represents the Value associated with each word vector in n-dimensional vector space. In the above example, we would want the embeddings of “Mango”, and “Greek” to cause a change in the embeddings of “Yogurt”. When we multiply this Value matrix, Wv, with the embeddings of all the words, we would get a value vector for these words.

Figure 17. Value Vector Creation. (Source: Image created by author)

Continuing with the above example, we saw that words “Mango”, and “Greek” have maximum attention or weight towards the word “Yogurt”. So, the value vectors of “Mango”, an “Greek” would have the maximum influence on the word “Yogurt”, when added to the embedding of this word. This means that the original embeddings of word “Yogurt”, would be modified as per its preceding adjectives.

Figure 18. Attention Weighted Value Matrix. (Source: Image created by author)

It is important to note that the value vector is also a smaller dimension (m-dimensional) vector. This means that we can’t directly add these attention weighted values to the original embeddings. To do that, we perform a liner transformation, to scale back the m-dimensional value vector matrix to a n-dimensional value vector matrix.

Figure 19. Transformed Attention Weighted Value Vectors. (Source: Image created by author)

This results in a more refined vector, which encodes all the contextually rich meaning. In the above example we calculated the refined embeddings based on attention for just one word, “Yogurt”, for demonstration purposes. When applying attention, we perform these steps to get refined embeddings for every single word in the sentence/ document.

Figure 20. Refined Input Embeddings using Attention. (Source: Image created by author)

The attention mechanism, where we calculate attention to be paid for one word/token from every other word/token in a document is known as self-attention.

5.2. Self-Attention Code

Below is a sample code that will help you undertsand the above attention mechanism even better. Play around with the input and output dimensions, and try to see how the shape and values are changing at every step!

import numpy as np
from scipy.special import softmax

def self_attention(input_data):
  """
  This function implements a simplified version of self-attention.

  Args:
      input_data: A numpy array of shape (sequence_length, embedding_dim).

  Returns:
      outputs: A numpy array of the same shape as input, containing the
               attention-weighted representations.
  """
  input_dimension = input_data.shape[1]
  output_dimension = 16

  # Project input into query, key, and value vectors using linear transformations
  # Below matrices are trainable, but for representation purpose, they are kept static
  WQ = np.random.rand(input_dimension, output_dimension)  # Query weight matrix
  WK = np.random.rand(input_dimension, output_dimension)  # Key weight matrix
  WV = np.random.rand(input_dimension, output_dimension)  # Value weight matrix
  queries = np.matmul(input_data, WQ)
  keys = np.matmul(input_data, WK)
  values = np.matmul(input_data, WV)
  print(f"Shape of WQ: {WQ.shape}")
  print(f"Shape of WK: {WK.shape}")
  print(f"Shape of WV: {WV.shape}")

  # Calculate attention scores using scaled dot product
  d_k = keys.shape[-1]  # dimension of key vectors
  scores = np.matmul(queries, keys.transpose()) / np.sqrt(d_k)

  # Apply softmax to normalize attention scores
  attention_weights = softmax(scores, axis=1)
  print(f"Shape of Attention Scores: {attention_weights.shape}")

  # Weight the values based on attention scores
  attention_values = np.matmul(attention_weights, values)
  print(f"Shape of Attention Values: {attention_values.shape}")

  # Perform Linear Transfomration to scale up the value vector
  WV_linear = np.random.rand(output_dimension, input_dimension)  # Linear Transform Value matrix
  attention_transform_values = np.matmul(attention_values, WV_linear)
  print(f"Shape of Transformed Attention Values: {attention_transform_values.shape}")

  # Add the input to the weighted attention_transform_values (residual connection)
  outputs = attention_transform_values + input_data
  return outputs

# Example usage
input_sentence = "Terry loves to eat Mango Greek Yogurt"
input_data = np.random.rand(len(input_sentence.split(" ")), 128)  # sequence length len(input_sentence), embedding dim 128
outputs = self_attention(input_data)

print(f"Shape of inputs: {input_data.shape}")
print(f"Shape of outputs: {outputs.shape}")

5.3. Attention Head

This is known as single-head of attention. In reality, you would have multiple blocks of single-head attention, known as multi-headed attention. Having multiple heads hold the benefit of training the model to learn many different linguistic phenomenon, where each head learns to attend to different aspects of the relationships between elements. Imagine you’re analysing a movie script.

Single-headed attention: This is like focusing on a single aspect, such as the dialogue between characters. You might miss other important details like scene descriptions or character actions.
Multi-headed attention: Now, this like having multiple analysts, each focusing on a different aspect: dialogue, scene description, character actions, etc. This provides a more comprehensive understanding of the script.

6. Conclusion

This sums up the attention mechanism, the major innovation that lead to the development of LLMs! Although, attention is just one part of the transformer architecture, it is important to know how it works.

Well I hope this tutorial was useful for you and if you want to stay updated with more such articles, do show some love by leaving as many claps as you can, follow my page, and subscribe to my feed :)

Find me here: https://bento.me/harjotpahwa

References:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
Attention in transformers, visually explained | Chapter 6, Deep Learning | 3Blue1Brown
Attention for Neural Networks, Clearly Explained!!! | StatQuest by Josh Starmer

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
Tired of blogging platforms that force you to deal with algorithmic content? Try Differ
More content at PlainEnglish.io