Recurrent Neural Networks for Multivariate Time Series with Missing Values

Time series are everywhere, in every industry from Energy to Geoscience, etc. Therefore, it is crucial to work on them; In most cases (especially in real-world projects), time-series datasets contain numerous missing data points which are highly connected to the output of prediction. This article gives you a review of the existed methods and then a thorough illustration of GRU-D (based on Gated Recurrent Unit) to deal with missing points.

7 min readMar 15, 2022

There are several reasons that we can consider for having missing data points: unexpected events, saving costs, anomalies, inconvenience, incomplete data entry, equipment malfunctions, lost files, and so on. Mostly we are unable to prevent that; So, we have to find approaches to deal with this issue.

The researchers have been making efforts to develop and propose solutions. The simplest is to delete those points, which leads to a great reduction in the data size and, consequently, loss of some important information. Another method is filling those missed points with some alternatives (imputation). Moreover, Smoothing, interpolation, and spline are other methods. These methods have been efficient to some extent, but as time-series datasets have been becoming more complicated, they lost their efficiency and practicality in most cases.

There are also some other imputation methods to cope with missing data points, such as spectral analysis, kernel methods, EM algorithms, matrix completion, and matrix factorization. Blending the imputation methods and prediction models leads to two separate steps, which leads to not exploring missing patterns properly; Consequently, suboptimal analysis results have appeared.

Figure 1. Demonstration of informative missingness on MIMIC-lll dataset. The bottom figure shows the missing rate of each input variable. The middle figure shows the absolute values of Pearson correlation coefficients between missing rate of each variable and mortality. The top figure shows the absolute values of Pearson correlation coefficients between missing rate of each variable and each ICD-9 diagnosis category. [source]

These missing points are usually information missingness. Figure 1. is the plot of the Pearson correlation coefficient among variable missing rates to show how often the variable is missing the sequential dataset. We can realize the correlation between the value of the missing rate and the labels.

This short illustration is for those who forget or do not know about pearson correlation coefficient

Pearson correlation coefficient (also known as Pearson’s R) is a type of correlation coefficient that is used in linear regression to find the strength of a linear relationship between two variables by r. We use the below formula to calculate this coefficient:

r = correlation coefficient

xᵢ = values of the x-variable in sample

x_mean = mean of the values of the x-variable

yᵢ = values of the y-variable in a sample

y_mean = mean of the values of the y-variable

In this article, researchers used RNNs, which have a great performance on time-series forecasting for variable-length dependencies and long-term temporal dependencies. This research designed The structure of the models incorporating the patterns of missingness for time-series classification problems. So, researchers developed a deep learning model based on GRU (GRU-D) to exploit two indications of informative missingness patterns (e.g. masking and time interval). This GRU-D model outperforms well on the researched datasets in this research and shows some insights:

1. A general deep model to handle missing data in time-series datasets.

2. A great answer to characterize the missing patterns of not complete-missed at random in the dataset set with masking and intervals.

3. A great manner of studying the effect of variable missingness on prediction labels by decay analysis.

Notations.

We can see an example of the notations as below:

Figure 2. An example of measurement vectors xₜ, time stamps sₜ, masking mₜ, and time interval δₜ

This section gives you a summary illustration of GRUs, so if you are aware of their structure and operating system, skip this short section 😉

Gated Recurrent Units (GRUs) are gating mechanisms introduced in 2014 by Cho et al. Unlike LSTMs that have 3 gates, in GRUs have 2 gates to operate the time series data. Its main structure can be seen in Figure 3, and for further understanding, Understanding GRU Networks is highly recommended. Also, if you want to understand LSTMs and GRUs in the place, this article is recommended: Illustrated Guide to LSTM’s and GRU’s: A step by step explanation.

GRU-D: Model with Trainable decays

There are two features of time series datasets with missing points:

There is a tendency for missing points to have a close value to their neighbors in time-series trend
The effect of the input disappears if the variable has been missing for a while

So, the researchers proposed GRU-D as you can see its schematic architecture in Figure 3.

Figure 4. Graphical illustrations of the original GRU (top-left), the proposed GRU-D (bottom), and the whole network architecture (right) [source]

We can see that GRU-D is based on GRU with a number of additions. A decay mechanism is added for the input variables and the hidden states to capture the mentioned properties. To control the decay mechanism, we use decay rates. The vector of decay rates can be formalized as below in Equation 4.

Eq. 4

Where W and b are model parameters. This is the exponentiated negative rectifier to keep a decreasing trend for the decay rate; We could use other functions such as the sigmoid function, etc.

There are two various decay mechanisms that are used in this research to use the missingness directly with the input values and implicitly in the RNN states.

Instead of using the previous data point, we use an input decay γₜ to decay over time toward the empirical mean (it takes default configuration). We apply the trainable decay scheme to the measurement vector by Equation 5.

Eq. 5

2. For better capturing, there is a hidden state decay γₜ in the GRU-D. Its formula can be seen in Eq 6, whereby decaying the previous hidden state before calculating the new hidden states:

Eq. 6

Also, the other update function in GRU can be seen by Eq. 7–10 (The same as they are in GRU with a slight difference):

Baseline

The researchers used most of the available methods whether they are RNN based (LSTM-Mean, GRU-Simple, GRU-Mean, etc.) or non RNN (logistic regression(LR), Support Vector Machine (SVR), etc.), whether they are interpolation or imputation methods (including Mean imputation, Forward imputation, Simple imputation, SoftImpute, KNN, MIXW(Multiple Imputation by Chained Equation), etc). Discussing and illustrating all of these methods demands another article for each.

All inputs normalized, the results were reported from 5-fold cross-validation under the ROC curve (AUC score). We will see the comparison of GRU-D with others in various model sizes.

Results

The two real-world healthcare datasets and a synthetic dataset were used in the evaluation stage. These two are:

3. MIMIC-III dataset (MIMIC-III)

Figure 5. Classification performance on Gesture synthetic datasets with different correlation values. [source]

From Figure 5, we can see the failure of GRU-Simple when the correlation is low. Also, GRU-D turns out with better AUC scores and more stable performance in all the settings.

Table 1. Model performance measured by AUC score (*mean*±*std) for mortality prediction* [source]

Table 1 provides the calculated performance of all models on the mortality task.

Table 2. Model performances measured by average AUC score (*mean*±*std) for multitask predictions on real datasets* [source]

We can see the comparison of all models in Table 2 which are fairly similar to the mortality prediction task.

Figure 6. Early prediction capacity and model scalability comparisons of GRU-D and other RNN baselines on the MIMIC-III dataset [source]

Figure 6. provides the online results of the prediction for MIMIC-III mortality task. Here, the RNN is compared with the other three models (non-RNN) that are widespread in machine learning.

Conclusion

It is quite challenging to deal with missing values in time series datasets and use Recurrent Neural Networks to deal with them. This article proposed a new architecture based on Gated Recurrent Units (GRUs) with some modifications which outperformed on time-series datasets with missing points. The GRU-D (the proposed model) is developed for classification tasks, but it can extend to regression tasks. Also, I have added the code of the GRU-D model, which helped me understand better, and will do the same for you or even use them in your projects (source code).

Main Reference

Che, Z., et al., Recurrent neural networks for multivariate time series with missing values. Scientific reports, 2018. 8(1): p. 1–12.

Please note that this post is for my research in the future to look back and review the materials on this topic.If you found any errors, please let me know. Meanwhile, you can directly contact me in Twitter here or LinkedIn here for any reason.
Finally, if you found this article interesting and useful, you can follow me on medium to reach more articles from m