How Perform Attention-based Transformers on local sensitivity?

We cannot ignore the demand in time series forecasting regardless of the industry, Energy, Healthcare, etc. Recently, Transformers have been expressed as great architectures to make complex predictions in deep learning. These transformers are mainly based on attention. The full Self-attention includes a mathematical operation known as Scaled Dot-Product Attention in its core. These attentions suffer from two problems: 1. Locality-agnostics. 2. Memory bottleneck. This article is about solving these two problems.

4 min readJun 20, 2022

source: GIPHY

I am building nouswise🤗!! check it out and get on the waitlist😉 If you want early access hit me up on X (Twitter). Let me know your thoughts.

How can we augment the locality?

Answer:

By adding convolutional operators to the architecture.

Illustration:

It doesn’t matter if the data point is an outlier or seems to be an outlier due to unconventional/rare events in time series; we all know that the surrounding data points are important.
In self-attention layers, we calculate the resemblance between queries and keys according to their point-wise values regardless of considering the locality. (You can see Figure 1)

This matching of Key and Query can cause a misunderstanding of its reality (outlier/ change point /true point) and, subsequently, problems in optimization.
So, this is a problem with consequences in our model development based on attention. What the researchers did is design convolutional operators, as you can see in Figure 2.

In this implementation convolutional, researchers proposed stride=1 and kernel=k (casual convolution) to convert inputs into queries and keys.

By this proposal (casual convolution), we are assured that the point at the moment doesn’t have access to the future points (just the points in the past). Also, by then, we switch the point-wise values to local shapes, which leads to a better prediction.

source

How we can eliminate the memory bottleneck weakness?

Answer:

Using the power of a mathematic operator called LOGARITHM.

Illustration:

Before illustrating the solution, let’s experiment on a real-world time-series dataset (traffic-f).
If we train a 10-layer-model based on full attention; then visualize the scores in layers 2, 6, and 10 and compare them with the occupancy in the dataset; The result is Figure 3:

We can see that layer 2 is following the general pattern (in contrast to occupancy rate); However, layers 6 and 10 are saying another thing, expressing the dependency of pattern on sparsity. In other words, it may be a good idea to add a kind of sparsity that slightly affects the accuracy.

As the dataset is time series, we divide it into a fixed sequence (L); with the memory usage O(L²) due to attention scores between each pair of cells, we can make the model more reliable for capturing long sequences.

Therefore, logSparse has became the key role in this initiative.

It demands just computing O(logL) dot products for each cell in each layer

In Figure 4, a comparison between various attentions among neighbor layers is provided:

This figure very briefly says(I did not mention it too technically):
In self-attention (a), The cell investigates any cells with its all previous ones (this is not good due to the space complication)
So, this is a problem. What can we do?? We can choose some of the indices, not all of them, and the best manner to do it is using a mathematic operation called the logarithm. This leads to a lower rise in memory usage. (see FIGURE 4(b))
Should we propose a random manner to choose previous cells or another method? Is it effective or just wasting time and energy with a huge computational cost?
The impressive manner is considering the mathematics. AND YESS! Researchers suggested using the LOG operator to grow exponentially. I think it’s a good idea to use it since, as it goes far from the cell, the chosen cells become more and more sparse.

Please note that the main script illustrates the mathematics in detail, for inspecting these (Theorems, …) you can see the main paper.

For the implementation, you can see this repo. Also, to implement this architecture, you can use an incredible library called flow-forecast in Python.😉 You only need to run pip install flood-forecast.

You can find its documentary here and get started, or if you want me to write short tutorials for implementing time series algorithms, you just need to drop a comment here or DM me directly via Twitter or LinkedIn. If there is anything else, feel free to drop me a message.

How Perform Attention-based Transformers on local sensitivity?

How can we augment the locality?

Answer:

Illustration:

How we can eliminate the memory bottleneck weakness?

Answer:

Illustration:

Written by Reza Yazdanfar

No responses yet