Recurrent Neural Networks for Multivariate Time Series with Missing Values
Time series are everywhere, in every industry from Energy to Geoscience, etc. Therefore, it is crucial to work on them; In most cases (especially in real-world projects), time-series datasets contain numerous missing data points which are highly connected to the output of prediction. This article gives you a review of the existed methods and then a thorough illustration of GRU-D (based on Gated Recurrent Unit) to deal with missing points.
There are several reasons that we can consider for having missing data points: unexpected events, saving costs, anomalies, inconvenience, incomplete data entry, equipment malfunctions, lost files, and so on. Mostly we are unable to prevent that; So, we have to find approaches to deal with this issue.
The researchers have been making efforts to develop and propose solutions. The simplest is to delete those points, which leads to a great reduction in the data size and, consequently, loss of some important information. Another method is filling those missed points with some alternatives (imputation). Moreover, Smoothing, interpolation, and spline are other methods. These methods have been efficient to some extent, but as time-series datasets have been becoming more complicated, they lost their efficiency and practicality in most cases.
There are also some other imputation methods to cope with missing data points, such as spectral analysis, kernel methods, EM algorithms, matrix completion, and matrix factorization. Blending the imputation methods and prediction models leads to two separate steps, which leads to not exploring missing patterns properly; Consequently, suboptimal analysis results have appeared.
These missing points are usually information missingness. Figure 1. is the plot of the Pearson correlation coefficient among variable missing rates to show how often the variable is missing the sequential dataset. We can realize the correlation between the value of the missing rate and the labels.
This short illustration is for those who forget or do not know about pearson correlation coefficient
Pearson correlation coefficient (also known as Pearson’s R) is a type of correlation coefficient that is used in linear regression to find the strength of a linear relationship between two variables by r. We use the below formula to calculate this coefficient:
r = correlation coefficient
xᵢ = values of the x-variable in sample
x_mean = mean of the values of the x-variable
yᵢ = values of the y-variable in a sample
y_mean = mean of the values of the y-variable
In this article, researchers used RNNs, which have a great performance on time-series forecasting for variable-length dependencies and long-term temporal dependencies. This research designed The structure of the models incorporating the patterns of missingness for time-series classification problems. So, researchers developed a deep learning model based on GRU (GRU-D) to exploit two indications of informative missingness patterns (e.g. masking and time interval). This GRU-D model outperforms well on the researched datasets in this research and shows some insights:
1. A general deep model to handle missing data in time-series datasets.
2. A great answer to characterize the missing patterns of not complete-missed at random in the dataset set with masking and intervals.
3. A great manner of studying the effect of variable missingness on prediction labels by decay analysis.
Notations.
We can see an example of the notations as below:
This section gives you a summary illustration of GRUs, so if you are aware of their structure and operating system, skip this short section
😉
Gated Recurrent Units (GRUs) are gating mechanisms introduced in 2014 by Cho et al. Unlike LSTMs that have 3 gates, in GRUs have 2 gates to operate the time series data. Its main structure can be seen in Figure 3, and for further understanding, Understanding GRU Networks is highly recommended. Also, if you want to understand LSTMs and GRUs in the place, this article is recommended: Illustrated Guide to LSTM’s and GRU’s: A step by step explanation.
GRU-D: Model with Trainable decays
There are two features of time series datasets with missing points:
- There is a tendency for missing points to have a close value to their neighbors in time-series trend
- The effect of the input disappears if the variable has been missing for a while
So, the researchers proposed GRU-D as you can see its schematic architecture in Figure 3.
We can see that GRU-D is based on GRU with a number of additions. A decay mechanism is added for the input variables and the hidden states to capture the mentioned properties. To control the decay mechanism, we use decay rates. The vector of decay rates can be formalized as below in Equation 4.
Where W and b are model parameters. This is the exponentiated negative rectifier to keep a decreasing trend for the decay rate; We could use other functions such as the sigmoid function, etc.
There are two various decay mechanisms that are used in this research to use the missingness directly with the input values and implicitly in the RNN states.
- Instead of using the previous data point, we use an input decay γₜ to decay over time toward the empirical mean (it takes default configuration). We apply the trainable decay scheme to the measurement vector by Equation 5.
2. For better capturing, there is a hidden state decay γₜ in the GRU-D. Its formula can be seen in Eq 6, whereby decaying the previous hidden state before calculating the new hidden states:
Also, the other update function in GRU can be seen by Eq. 7–10 (The same as they are in GRU with a slight difference):
Baseline
The researchers used most of the available methods whether they are RNN based (LSTM-Mean, GRU-Simple, GRU-Mean, etc.) or non RNN (logistic regression(LR), Support Vector Machine (SVR), etc.), whether they are interpolation or imputation methods (including Mean imputation, Forward imputation, Simple imputation, SoftImpute, KNN, MIXW(Multiple Imputation by Chained Equation), etc). Discussing and illustrating all of these methods demands another article for each.
All inputs normalized, the results were reported from 5-fold cross-validation under the ROC curve (AUC score). We will see the comparison of GRU-D with others in various model sizes.
Results
The two real-world healthcare datasets and a synthetic dataset were used in the evaluation stage. These two are:
3. MIMIC-III dataset (MIMIC-III)
From Figure 5, we can see the failure of GRU-Simple when the correlation is low. Also, GRU-D turns out with better AUC scores and more stable performance in all the settings.
Table 1 provides the calculated performance of all models on the mortality task.
We can see the comparison of all models in Table 2 which are fairly similar to the mortality prediction task.
Figure 6. provides the online results of the prediction for MIMIC-III mortality task. Here, the RNN is compared with the other three models (non-RNN) that are widespread in machine learning.
Conclusion
It is quite challenging to deal with missing values in time series datasets and use Recurrent Neural Networks to deal with them. This article proposed a new architecture based on Gated Recurrent Units (GRUs) with some modifications which outperformed on time-series datasets with missing points. The GRU-D (the proposed model) is developed for classification tasks, but it can extend to regression tasks. Also, I have added the code of the GRU-D model, which helped me understand better, and will do the same for you or even use them in your projects (source code).
Main Reference
- Che, Z., et al., Recurrent neural networks for multivariate time series with missing values. Scientific reports, 2018. 8(1): p. 1–12.
Please note that this post is for my research in the future to look back and review the materials on this topic.If you found any errors, please let me know. Meanwhile, you can directly contact me in Twitter here or LinkedIn here for any reason.
Finally, if you found this article interesting and useful, you can follow me on medium to reach more articles from m