Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

6 min readFeb 10, 2022

--

Informer is a transformer-based model that is developed to cope with long dependencies.

I am building nouswise🤗!! check it out and get on the waitlist😉 If you want early access hit me up on X (Twitter). Let me know your thoughts.

The main topic of this article is sequence prediction. The sequence prediction can be found anywhere we have data that changes constantly, such as the stock market or etc. Despite most real-world applications of AI, predicting long sequence time series is crucial, it is not easy as we can say; in fact, it requires a robust model with a high capacity of prediction that can capture long-range dependencies. In my personal experience, it appeared as an important factor for two of my projects working on time-series data sets. I can remember, that I was confused because we could not propose a simple Neural Network for time series; thus, I have read several papers and also have seen some courses and find this informer fascinating. Though I did not obtain this model in my research; by the way; for sure, I will be inspired by this model architecture.

Figure 1. (a) LSTF can cover an extended period than the short sequence predictions, making a vital distinction in policy planning and investment-protecting. (b) The prediction capacity of existing methods limits LSTF’s performance. E.g., starting from length=48, MSE rises unacceptably high, and the inference speed drops rapidly. [source]

Transformers are considered revolutionary in the deep learning era by making predictions more reliable and accurate. Nevertheless, there are some problems with transformers that avoid them from being implemented directly to Long Sequence Time-Series Forecasting(LSTF) such as quadratic time complexity, high memory usage, and the inherent limitation of the encoder-decoder architecture. This results in developing an efficient transformer-based model called Informer. In this article, I would like to demonstrate the initiative obtained in this Informer in detail.

Transformer

First, let me give a summary of Transformers in case you are not aware of it. (for those who know transformers well, you can skip this section. 😉)

Transformers are new deep learning models which are presented at a rising rate. They adopt the mechanism of self-attention and showed a significant increase in model performance on challenging tasks in NLP and Computer Vision. The Transformer architecture can be envisaged into two parts known as encoder and decoder, as it is illustrated in Figure 2, below:

Figure 2. The Transformer Architecture. [source]

The main point of transformers is their independence from localities; That said, in contrast to other popular models like CNNs, transformers are not limited by localization. Also, we do not propose any CNN architecture in transformers; instead, we use Attention-based structures in transformers, which allows us to accomplish better results.

Attention Architecture can be summarized in Figure 3:

Figure 3. (Left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. [source]

The function of Scaled Dot-Product Attention is Eq. 1

Q (Query), K (Key) and, V(Vector) are the inputs of our attention.

For a complete fundamental realization of transformers, look at “Attention is all you need”. It gives you a great understanding of attention and transformers; in fact, for the first time, I had understood this important model by this paper completely.

I think this amount of summary is enough for transformers; So, let’s dive into our informer architecture.

Informer Architecture

ProbSparse Self-Attention

In this informer, instead of using Eq 1., we use Eq 2. by letting each key just attend to u dominant queries:

Informer Architecture

Encoder: Letting for Processing Longer Sequential Inputs under the Memory Usage Limitation

The way that the encoder is designed is to extract the robust long-range dependencies of the long sequential inputs. Figure 4. is showing the schematic architecture of the encoder:

Figure 4. The **single stack in Informer’s encoder**. (1) The **horizontal stack** stands for an individual one of the encoder replicas in Figure 5. (2) The presented one is the **main stack** receiving the whole **input sequence**. Then the **second stack** takes **half slices of the input**, and the **subsequent stacks repeat** (3) The **red layers** are **dot-products matrixes**, and they get cascade **decrease** by applying **self-attention distilling** on **each layer**. (4) **Concatenate** all **stacks’ feature maps** as the **encoder’s output.** [source]

The encoder’s feature map has extra combinations of value V as a result of the ProbSparse self-attention mechanism. The distilling operation is used to score the superior ones with dominating features and build a concentrated self-attention feature map in the next layer.

From Figure 4, it can be seen that the structure consists of several Attention blocks, Conv1d, and MaxPooling layers to encode the input data. Built replicas of the main stack with dividing inputs into half elevate the reliability of the distilling operation. Also, constantly, the number of self-attention distilling layers is reduced one after another. At the end of the encoder, the researchers concatenated Feature Map to direct the output of the encoder to the decoder.

Decoder: Generating Long Sequential Outputs Through One Forward Procedure

The decoder structure is not come with complexity; it is the standard decoder structure that is illustrated in “Attention is All You Need”. It includes a stack of two identical multi-head attention layers. But, the generative inference is proposed to ease the speed plunge in long prediction, just as can see in Figure 5.

Figure 5. **Informer model overview**. **Left**: The **encoder** receives massive long sequence inputs (green series). We replace canonical self-attention with the proposed **ProbSparse attention**. The blue trapezoid is the self-attention distilling operation to extract dominating attention, **reducing** the **network size sharply**. The layer stacking replicas **increase robustness**. **Right**: The **decoder** receives **long sequence inputs**, pads the target elements into **zero**, measures the weighted attention composition of the feature map, and instantly predicts output elements (orange series) in a **generative style**. [source]

The way that the decoder is fed is by obtaining the below equation (Eq. 4):

Instead of utilizing specific flags as the token, an L(token) long sequence is sampled in the sequence of the input, like the earlier slice before the output sequence.

The Hyperparameter tuning range

In order to tune the model, researchers changes three important length parameters: Prolong Input Length (48, 96, 168, 240, 336, 480, 624, 720), Encoder Input Length (78, 96, 168, 240, 480, 624, 720) and, Encoder Input Length (96, 168, 240, 336, 480, 720). This can be seen in Figure 6 simply:

Figure 6. The **parameter sensitivity** of three components in **Informer** [source]

Model Evaluation

The researchers evaluated the model with other state-of-the-art models for both univariate and multivariate time series data. Four real-world datasets were used in the evaluation section including ETT (Electricity Transformer Temperature) (separated into two datasets with various resolutions), ECL (Electricity Consuming Load) and, weather (is about local climatological).

Two metrics (MSE and MAE) are used to show the evaluation. The results of this comparison are summarized into two tables (Table 1&2).

Table 1: **Univariate** long sequence time-series forecasting results on **four** datasets (five cases) [source]

Table 2: **Multivariate** long sequence time-series forecasting results on **four** datasets (five cases) [source]

From both Table 2 and Table 1, we can see the great performance of the proposed informer is mostly better than other models for both univariate and multivariate datasets.

Conclusion

It was a great attempt to propose a transformer-based model to address some drawbacks of transformers of long sequence time series forecasting. In the designing section, the researchers designed the ProbSparse self-attention mechanism and distilling operation to cope with the quadratic time complexity and quadratic memory usage challenges in the vanilla Transformer.

You can run the model in collab by yourself. You only need to click here.😉

Reference

Zhou, H., et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. in Proceedings of AAAI. 2021.

Please note that this post is for my research in the future to look back and review the materials on this topic.
If you found any errors, please let me know. Meanwhile, you can contact me in Twitter here or LinkedIn here. Finally, if you have found this article interesting and useful, you can follow me on medium to reach more articles from me.

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Transformer

Informer Architecture

ProbSparse Self-Attention

Informer Architecture

Encoder: Letting for Processing Longer Sequential Inputs under the Memory Usage Limitation

Decoder: Generating Long Sequential Outputs Through One Forward Procedure

The Hyperparameter tuning range

Model Evaluation

Conclusion

Reference

Written by Reza Yazdanfar