Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Informer is a transformer-based model that is developed to cope with long dependencies.
I am building nouswise🤗!! check it out and get on the waitlist😉 If you want early access hit me up on X (Twitter). Let me know your thoughts.
The main topic of this article is sequence prediction. The sequence prediction can be found anywhere we have data that changes constantly, such as the stock market or etc. Despite most real-world applications of AI, predicting long sequence time series is crucial, it is not easy as we can say; in fact, it requires a robust model with a high capacity of prediction that can capture long-range dependencies. In my personal experience, it appeared as an important factor for two of my projects working on time-series data sets. I can remember, that I was confused because we could not propose a simple Neural Network for time series; thus, I have read several papers and also have seen some courses and find this informer fascinating. Though I did not obtain this model in my research; by the way; for sure, I will be inspired by this model architecture.
Transformers are considered revolutionary in the deep learning era by making predictions more reliable and accurate. Nevertheless, there are some problems with transformers that avoid them from being implemented directly to Long Sequence Time-Series Forecasting(LSTF) such as quadratic time complexity, high memory usage, and the inherent limitation of the encoder-decoder architecture. This results in developing an efficient transformer-based model called Informer. In this article, I would like to demonstrate the initiative obtained in this Informer in detail.
Transformer
First, let me give a summary of Transformers in case you are not aware of it. (for those who know transformers well, you can skip this section. 😉)
Transformers are new deep learning models which are presented at a rising rate. They adopt the mechanism of self-attention and showed a significant increase in model performance on challenging tasks in NLP and Computer Vision. The Transformer architecture can be envisaged into two parts known as encoder and decoder, as it is illustrated in Figure 2, below:
The main point of transformers is their independence from localities; That said, in contrast to other popular models like CNNs, transformers are not limited by localization. Also, we do not propose any CNN architecture in transformers; instead, we use Attention-based structures in transformers, which allows us to accomplish better results.
Attention Architecture can be summarized in Figure 3:
The function of Scaled Dot-Product Attention is Eq. 1
Q (Query), K (Key) and, V(Vector) are the inputs of our attention.
For a complete fundamental realization of transformers, look at “Attention is all you need”. It gives you a great understanding of attention and transformers; in fact, for the first time, I had understood this important model by this paper completely.
I think this amount of summary is enough for transformers; So, let’s dive into our informer architecture.
Informer Architecture
ProbSparse Self-Attention
In this informer, instead of using Eq 1., we use Eq 2. by letting each key just attend to u dominant queries:
Informer Architecture
Encoder: Letting for Processing Longer Sequential Inputs under the Memory Usage Limitation
The way that the encoder is designed is to extract the robust long-range dependencies of the long sequential inputs. Figure 4. is showing the schematic architecture of the encoder:
The encoder’s feature map has extra combinations of value V as a result of the ProbSparse self-attention mechanism. The distilling operation is used to score the superior ones with dominating features and build a concentrated self-attention feature map in the next layer.
From Figure 4, it can be seen that the structure consists of several Attention blocks, Conv1d, and MaxPooling layers to encode the input data. Built replicas of the main stack with dividing inputs into half elevate the reliability of the distilling operation. Also, constantly, the number of self-attention distilling layers is reduced one after another. At the end of the encoder, the researchers concatenated Feature Map to direct the output of the encoder to the decoder.
Decoder: Generating Long Sequential Outputs Through One Forward Procedure
The decoder structure is not come with complexity; it is the standard decoder structure that is illustrated in “Attention is All You Need”. It includes a stack of two identical multi-head attention layers. But, the generative inference is proposed to ease the speed plunge in long prediction, just as can see in Figure 5.
The way that the decoder is fed is by obtaining the below equation (Eq. 4):
Instead of utilizing specific flags as the token, an L(token) long sequence is sampled in the sequence of the input, like the earlier slice before the output sequence.
The Hyperparameter tuning range
In order to tune the model, researchers changes three important length parameters: Prolong Input Length (48, 96, 168, 240, 336, 480, 624, 720), Encoder Input Length (78, 96, 168, 240, 480, 624, 720) and, Encoder Input Length (96, 168, 240, 336, 480, 720). This can be seen in Figure 6 simply:
Model Evaluation
The researchers evaluated the model with other state-of-the-art models for both univariate and multivariate time series data. Four real-world datasets were used in the evaluation section including ETT (Electricity Transformer Temperature) (separated into two datasets with various resolutions), ECL (Electricity Consuming Load) and, weather (is about local climatological).
Two metrics (MSE and MAE) are used to show the evaluation. The results of this comparison are summarized into two tables (Table 1&2).
From both Table 2 and Table 1, we can see the great performance of the proposed informer is mostly better than other models for both univariate and multivariate datasets.
Conclusion
It was a great attempt to propose a transformer-based model to address some drawbacks of transformers of long sequence time series forecasting. In the designing section, the researchers designed the ProbSparse self-attention mechanism and distilling operation to cope with the quadratic time complexity and quadratic memory usage challenges in the vanilla Transformer.
You can run the model in collab by yourself. You only need to click here.😉
Reference
Zhou, H., et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. in Proceedings of AAAI. 2021.
Please note that this post is for my research in the future to look back and review the materials on this topic.
If you found any errors, please let me know. Meanwhile, you can contact me in Twitter here or LinkedIn here. Finally, if you have found this article interesting and useful, you can follow me on medium to reach more articles from me.