Everything about Attention Family
These days, in deep learning, it is usual to hear about transformers’ outstanding performance on the challenges where other algorithms can not meet our expectations when most of them are based on attention. This article gives you a detailed illustration of the code and mathematics of the four most-used types of attention in the Deep Learning era.
The main feature of Attentions is the fact that their work is not limited to locality like CNNS, etc. but, we will see that in some cases we’d need the model to contemplate locality, etc. in their training process.
1. Full Attention
It was introduced in 2017 and implemented in a simple transformer with the structure of encoder-decoder in the popular scientific paper “Attention is All You Need”.
The structure is not complicated, which doesn’t make it difficult to understand. I also added Python code to assist you in having a better understanding.
Figure 1. (left) shows the mechanism of Scaled Dot-Product Attention schematically. When we have more than one attention, we call it Multi-Head Attention (so obvious😅). Let’s dive into the mathematics behind it; The main formula of attention is Eq 1.
Here Q (Query), K (Key), and V (values) are considered as its inputs, and dₖ (input dimension) is used to reduce the complexity and computational costs. This formula is the key point in this block in the deep learning era. So let’s have a look at its code to figure it out.
I also have done a project in that I proposed a TransformerEncoder based on Attention in the Tensorflow library for forecasting time series. You can find it in my GitHub (github.com/yazdanfarreza/PEM_FuelCell-Performance).
2. ProbSparse Attention
We can modify Eq. 1 with the help of the information in “Transformer Dissection: A Unified Understanding of Transformer’s Attention via the lens of Kernel” to Eq 2. Also, the i-th query’s attention is described as a kernel smoother in a probability form as below:
From Eq 2, we can define the i-th query’s sparsity measurement as below:
And finally, the final formula of the attention block is Eq 4.
Also, the Python code:
Also if you are interested in this type of attention, why don’t you read my article about its main algorithm informer.
3. LogSparse Attention
There are two drawbacks of the full attention that we discussed earlier: 1. Locality-agnostics (interesting, isn’t it?) 2. Memory bottleneck. To cope with these two, researchers used convolutional operators and LogSparse Transformers.
Instead of convolutional kernel size 1 with stride 1 (matrix multiplication), they employed casual convolution (to be assured that the model doesn’t have the access to the future points) of kernel size k with stride 1 to convert inputs into queries and keys.
You can find its source code here.
4. LSH Attention
Researchers introduced Reformer as an efficient transformer in its main research by implementing two methods. 1. The replacement of the dot-product attention (section 1) by a local sensitive block 2. Using reversible residual layers rather than the standard ones.
We begin with two tensors (instead of a single tensor in fully attention section 1), Q=K and V with the shape of [batch_size, length, dₘₒ𝒹ₑₗ]. The attention calculation is the same as its original formula (Eq. 1) where only the part of the softmax function is important, not the rest. As long as we are concerned about the softmax as it is potent in the formula, for each query (qᵢ) we just focus on the key values (K) that are nearest to qᵢ.
Locality sensitive hashing
We use locality-sensitive hashing (LSH) to find the nearest neighbors quickly in high-dimensional spaces, which is difficult to some extent. In this case, we need that nearby vectors get the same hash with high probability and that hash-buckets are of similar size with high probability.
Here, random projections are deployed, as shown in Figure 4.
We fix a random matrix R of the size of [dₖ, b/2] to get b hashes. Then, we describe h(x) = argmax([xR;-xR]) where [u;v] indicates the concatenation of two vectors. (LSH scheme method).
We have to formalize the LSH attention. We rewrite Eq 1 for a query position i at a time:
We can illustrate the LSH attention schematically compared with full attention in Figure 5.
a) The attention matrix for full attention is usually sparse but not beneficial for the computation
b) Sorted keys and queries based on their hash bucket
c) In the sorted attention matrix, pairs of the same bucket cluster close the diagonal
d) Following a batching approach, chunks of m consecutive queries attend to each other and one chunk back.
The code of the structure of this type of attention is: github.com/google/trax/blob/master/trax/models/research/configurable_transformer.py
I did not add the code here for avoiding making this article longer.
Main References:
1. Kitaev, N., Ł. Kaiser, and A. Levskaya, Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
2. Li, S., et al., Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems, 2019. 32.
3. Zhou, H., et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. in Proceedings of AAAI. 2021.
4. Vaswani, A., et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.