Everything about Attention Family

These days, in deep learning, it is usual to hear about transformers’ outstanding performance on the challenges where other algorithms can not meet our expectations when most of them are based on attention. This article gives you a detailed illustration of the code and mathematics of the four most-used types of attention in the Deep Learning era.

Reza Yazdanfar
6 min readMar 28, 2022

The main feature of Attentions is the fact that their work is not limited to locality like CNNS, etc. but, we will see that in some cases we’d need the model to contemplate locality, etc. in their training process.

[Source]

1. Full Attention

It was introduced in 2017 and implemented in a simple transformer with the structure of encoder-decoder in the popular scientific paper “Attention is All You Need”.

The structure is not complicated, which doesn’t make it difficult to understand. I also added Python code to assist you in having a better understanding.

Figure 1. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel

Figure 1. (left) shows the mechanism of Scaled Dot-Product Attention schematically. When we have more than one attention, we call it Multi-Head Attention (so obvious😅). Let’s dive into the mathematics behind it; The main formula of attention is Eq 1.

Eq 1.

Here Q (Query), K (Key), and V (values) are considered as its inputs, and dₖ (input dimension) is used to reduce the complexity and computational costs. This formula is the key point in this block in the deep learning era. So let’s have a look at its code to figure it out.

I also have done a project in that I proposed a TransformerEncoder based on Attention in the Tensorflow library for forecasting time series. You can find it in my GitHub (github.com/yazdanfarreza/PEM_FuelCell-Performance).

2. ProbSparse Attention

We can modify Eq. 1 with the help of the information in “Transformer Dissection: A Unified Understanding of Transformer’s Attention via the lens of Kernel” to Eq 2. Also, the i-th query’s attention is described as a kernel smoother in a probability form as below:

Eq. 2

From Eq 2, we can define the i-th query’s sparsity measurement as below:

Eq 3.

And finally, the final formula of the attention block is Eq 4.

Eq 4.
Eq 5.

Also, the Python code:

Also if you are interested in this type of attention, why don’t you read my article about its main algorithm informer.

3. LogSparse Attention

There are two drawbacks of the full attention that we discussed earlier: 1. Locality-agnostics (interesting, isn’t it?) 2. Memory bottleneck. To cope with these two, researchers used convolutional operators and LogSparse Transformers.

Figure 2. Illustration of different attention mechanism between adjacent layers in Transformer. [source]
Figure 3. convolutional self-attention is shown in (right), which uses convolutional layers of kernel size k with stride 1 to transform inputs (with proper padding) into queries/keys. Such locality awareness can correctly match the most relevant features based on shape matching in (left) [source]

Instead of convolutional kernel size 1 with stride 1 (matrix multiplication), they employed casual convolution (to be assured that the model doesn’t have the access to the future points) of kernel size k with stride 1 to convert inputs into queries and keys.

Eq 6.

You can find its source code here.

4. LSH Attention

Researchers introduced Reformer as an efficient transformer in its main research by implementing two methods. 1. The replacement of the dot-product attention (section 1) by a local sensitive block 2. Using reversible residual layers rather than the standard ones.

We begin with two tensors (instead of a single tensor in fully attention section 1), Q=K and V with the shape of [batch_size, length, dₘₒ𝒹ₑₗ]. The attention calculation is the same as its original formula (Eq. 1) where only the part of the softmax function is important, not the rest. As long as we are concerned about the softmax as it is potent in the formula, for each query (qᵢ) we just focus on the key values (K) that are nearest to qᵢ.

Locality sensitive hashing

We use locality-sensitive hashing (LSH) to find the nearest neighbors quickly in high-dimensional spaces, which is difficult to some extent. In this case, we need that nearby vectors get the same hash with high probability and that hash-buckets are of similar size with high probability.

Here, random projections are deployed, as shown in Figure 4.

Figure 4. An angular locality sensitive hash uses random rotations of spherically projected points to establish buckets by an argmax over signed axes projections. In this highly simplified 2D depiction, two points x and y are unlikely to share the same hash buckets (above) for the three different angular hashes unless their spherical projections are close to one another (below) [source]

We fix a random matrix R of the size of [dₖ, b/2] to get b hashes. Then, we describe h(x) = argmax([xR;-xR]) where [u;v] indicates the concatenation of two vectors. (LSH scheme method).

We have to formalize the LSH attention. We rewrite Eq 1 for a query position i at a time:

Eq 7.
Eq 8.
Eq 9.

We can illustrate the LSH attention schematically compared with full attention in Figure 5.
a) The attention matrix for full attention is usually sparse but not beneficial for the computation
b) Sorted keys and queries based on their hash bucket
c) In the sorted attention matrix, pairs of the same bucket cluster close the diagonal
d) Following a batching approach, chunks of m consecutive queries attend to each other and one chunk back.

Figure 5: Simplified depiction of LSH Attention showing the hash-bucketing, sorting, and chunking steps and the resulting casual attentions. (a-d) Attention matrices for these varieties of attention. [source]

The code of the structure of this type of attention is: github.com/google/trax/blob/master/trax/models/research/configurable_transformer.py

I did not add the code here for avoiding making this article longer.

Main References:

1. Kitaev, N., Ł. Kaiser, and A. Levskaya, Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

2. Li, S., et al., Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems, 2019. 32.

3. Zhou, H., et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. in Proceedings of AAAI. 2021.

4. Vaswani, A., et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.

Please note that this post is for my research in the future to look back and review the materials on this topic. If you find any errors, please let me know. Meanwhile, you can directly contact me on Twitter here or LinkedIn here for any reason.

--

--

Reza Yazdanfar
Reza Yazdanfar

No responses yet