This project was carried out as part of the TechLabs Digital Shaper Program in Aachen (Summer Term 2021).
Introduction
The COVID-19 pandemic is a coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome. After a long period of consistent research, the COVID-19 vaccine is ready for distribution in Germany. Therefore the need of the hour is an efficient and streamlined distribution of the Covid-19 vaccine. Statistics suggest an availability of roughly 1000 shots for every million people. Uniformity in distribution is significant, and this in turn calls for a detailed Time-Series analysis in order to track- and narrow down on the most impacted areas. For this project, we have focused on developing an optimized forecasting model based on the number of covid cases, population and hospital infrastructure among other relevant factors. The data inventory for the study will mostly include real-world data from different sources as well as charter a distribution based on the vaccine shot availability.
Method
Dataset
1. Kaggle Dataset
For the forecasting of the number of cases in different federal states, we finally decided to zero down on the Kaggle dataset by the Kaggle Publishers. This document contains details on the different models and the libraries that were considered in the course of the search for the perfect fit statistically. The data contains a categorization of the number of cases noted for the months beginning March of 2020 until the beginning of April of the subsequent year 2021.
1. Intensive Care Beds Dataset
We found several other datasets that we were considering using in our project, but we decided on the intensive care beds dataset from DIVI-Intensivregister. It consists of daily reports with information about the occupied and free intensive care beds in hospitals all around Germany. It also states how many of the beds are being occupied by covid patients, and how many of them are being respirated. It should be mentioned, that neither for the free intensive care beds nor for the total number of intensive care beds a stable and smooth plot could be generated. It was not possible to discern a trend from this data. Therefore, we decided to plot the respirated covid patients and use it as another metric to determine a covid vaccine distribution. We chose to use this dataset to give us insight into the severe covid cases.
Data Cleaning
The data that was being collected from the different sources mentioned above had to undergo some preprocessing before being deemed compatible with the forecasting methods. This included data cleaning wherein, irrelevant data was removed. We also performed some data modification, data replacement and data merging.
Methods Studied and Tested
- Moving Average
- Exponential Smooting
- ARIMA
- Seasonal Autoregressive Integrated Moving Average (SARIMA)
- Auto ARIMA
- Linear Optimization
Libraries Considered for Time Series Analysis
Three libraries were considered for working on the time series analysis: Prophet, sktime and statsmodels. Most of the work was done with the sktime library, since it seemed more suitable for beginners. It provides dedicated algorithms and transformation tools to efficiently work on time series regression, forecasting, and classification tasks. Sktime also contains the forecasting method AutoArima, which was employed in the final implementation. A small part of the forecasting was done with statsmodels library which included different statistical tests that were used to determine the parameters for the ARIMA models. Although the process was not automatic it showed us what steps one should go through in order to choose the best parameters.
Models
Moving Average
The Moving Average model is probably the most naive approach to time series modeling. This model simply states that the next observation is the mean of all past observations. Although simple, this model might be surprisingly good and it represents a good starting point. Otherwise, the moving average can be used to identify interesting trends in the data. We can define a window to apply the moving average model to smooth the time series and highlight different trends.
Exponential Smoothing
Exponential Smoothing uses similar logic to moving average, but this time, different decreasing weight is assigned to each observation. In other words, less importance is given to observations as we move further from the present. Exponential Smoothing forecasting methods are similar in that a prediction is a weighted sum of past observations, but the model explicitly uses an exponentially decreasing weight for past observations.
There are three main types of Exponential Smoothing time series forecasting methods. A simple method that assumes no systematic structure, an extension that explicitly handles trends, and the most advanced approach that adds support for seasonality.
ARIMA
ARIMA models provide another approach to time series forecasting. Exponential Smoothing and ARIMA models are the two most widely used approaches to time series forecasting and provide complementary approaches to the problem. While Exponential Smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.
Seasonal Autoregressive Integrated Moving Average (SARIMA)
ARIMA is a method can handle data with a trend but it does not support time series with a seasonal component. An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA. SARIMA is the combination of simpler models to make a complex model that can model time series exhibiting non-stationary properties and seasonality.
AutoARIMA
The advantage of using the extension AutoARIMA over the ARIMA model is that after data preprocessing steps, we can skip the next steps & directly fit our model. The auto ARIMA function in R uses a variation of the Hyndman-Khandakar algorithm (Hyndman & Khandakar, 2008), which combines unit root tests, minimization of the AICc and MLE to generate optimal parameters (p,d,q) so that the ARIMA model fits best to given data.
Linear Optimization
Linear programming is a set of techniques used in mathematical programming, sometimes called mathematical optimization, to solve such equations while maximizing or minimizing some linear function. Linear programming is precise, relatively fast, and suitable for a range of practical applications. The basic method for solving linear programming problems is called the simplex method. We used the following Python packages for the technique: SciPy, which is a general-purpose package for scientific computing with Python; and PuLP which is a Python linear programming API for defining problems and invoking external solvers.
For our distribution optimization, we used the results from our 14-day forecasts in combination with population. To connect the inputs we defined a constraint for every federal state in Germany plus one constraint to limit the number of total vaccines available in Germany. The number of vaccines required by each federal state is defined by the number of risk patients, active covid cases and the number of occupied intensive care beds by covid patients in an inequality constraint. As a result, we have a [17x16] matrix for all constraints including the vaccine limitation constraint. The individual metrics are scaled by 3 separate scales in the constraints. As of right now, arbitrary scales have been chosen between 0 and 1. Risk patients scale as the biggest factor was given 0.5, active covid patients was given 0.1 and the intensive care beds was given 1 as this number is by far the smallest one. By increasing the weight the influence of this metric was improved. For a better metric scaling further research has to be conducted. As a result, the Linear Optimization shows the relative share of available vaccines the respective federal state will get.
Results
Kaggle Dataset Results
Since we had no prior experience with forecasting models for time series, we started the implementation process by testing out different forecasting models in order to find the best fitting model. After some research, we narrowed it down to two possible models: ARIMA and Exponential Smoothing. After a short exploration time frame, we decided on using ARIMA since it seemed easier to handle and also produced better outcomes. As the framework for comparing the models, we made use of the ‘mean absolute percentage error’ (MAPE). The AutoARIMA extension allows several possible restrictions to find the optimal parameters for the given time series. We made no restrictions regarding the order of the model and allowed the optimization to use a seasonal model approach. While using the optimization we observed a much higher calculation time for the different seasonalities, but no real benefit resultwise. For some federal, states the results were very close to each other, in some cases resulting parameters were non-seasonal, even if seasonality was allowed. This is why we chose to use non seasonal AutoARIMA in the end. After working on the forecasts for some time, we made two other observations: First, the shorter the forecasting period, the more accurate became our results. Secondly, we noticed an increasing accuracy when focussing on the cases for a whole federal state, rather than single counties. See below an exemplary plot that depicts the development of Covid cases in the state of Berlin. As you can see, our 14-day-forecast performs well when compared with the test data.
See Results [1] |
Intensive Care Bed Dataset Results
Intensive care bed datasets were forecasted using Exponential Smoothing and ARIMA models. The forecasting results were evaluated with 2 different error metrics: MAE (mean average error) and SMAPE (symmetric mean average percentage error). MAE was utilized in order to compare the results of the Exponential Smoothing model and the ARIMA model because of the easy readability. SMAPE is used to rate the forecasting quality because percentage errors scale with sample size and therefore are better compatible with various sample sizes. It is worth mentioning that the data sets have a similar progression for every federal state of Germany. During the summer the respirated corona cases drop to nearly zero for every federal state and experience an increase during the winter period.
See Results [2] |
For Exponential Smoothing SMAPE ranges from 0.415 to 0.069 and MAE from 42.201 to 2.290. As mentioned beforehand, the forecasts from smaller federal states such as Hamburg, Saarland and Schleswig Holstein performed worse with an average SMAPE of 0.322 than bigger federal states such as North-Rhine Westphalia, Bavaria and Hessen with an average SMAPE of 0.13. To summarize, Exponential Smoothing forecasts on smaller federal states are more unreliable than for more populated federal states but have to be evaluated on an individual basis. In the following graph, the Exponential Smoothing forecast for Berlin can be seen including the test data set in yellow and train data set in green. The forecast shows an overall good result even with a high MAE. This is due to the high number of overall intensive care beds relative to the error.
See Results [3] |
The ARIMA model performed on average better with SMAPE range from 0.47 to 0.04 and MAE from 23.52 to 2.84. Also here the population of the federal states caused a big impact on results with smaller federal states delivering unreliable results and big federal states delivering good results. The quality of the results also depended on that if the forecast was made on a day just before a big change, which would lead to not so accurate results. The following is the forecast for Berlin in the same timeframe. The green shadow shows the confidence interval.
Vaccine Distribution Result
In the following table, the vaccine distribution for each federal state in Germany is depicted for the 31st of March 2021. As the covid cases and intensive care beds numbers are changing on a daily basis, we decided to automate the metrics, such that the data sets are updated daily. It is explained in detail in the next section. The vaccine distribution is given as the percentage of the total number of available covid vaccines in Germany. Big federal states such as Baden Wuerttemberg, Bavaria and North-Rhine Westphalia are the most populated federal states and also have the highest number of risk patients and covid cases. Therefore, it is logical that these federal states get a bigger share of the vaccines than smaller federal states e.g. Bremen.
See Results [4] |
Automation
After we were done with all the stages of our project we decided to automate the whole process, so we could get the latest forecast with one execution of the script, and not having to execute each script separately. Every day there were updates to our datasets and it was getting quite annoying to download them manually, so we have implemented a web scraper that would download the latest version, or in the case of intensive care beds dataset it would download the data for each missing day separately, as it wasn’t concatenated in a single file. After that, the new data would be concatenated with the old dataset. In the next stage, both datasets go through data cleaning and subsetting the data into federal state subsets. Then a forecast is made with each dataset for each federal state and combined with demographic information, which is then used by the linear optimization method to calculate the distribution of vaccines across the federal states.
Problems Faced
ARIMA Model for Intensive Care Beds Dataset
Choosing the right parameters for the ARIMA model and finding the correct seasonality. We didn’t use AutoARIMA here, so we had to go through the process manually. At first, we used ACF and PACF plots in order to choose the model parameters, but it was time consuming because one had to visually determine the parameters from the plots, and this had to be done for each federal state separately. On top of that, it wasn’t that easy to determine it in our case. After that, we chose to use the Akaike information criterion (AIC) which is considered a good statistic for forecasting models. As we continued to do the parameter estimation manually it took a lot of time to train each model, so we made use of parallel computing to speed up the process. Some of us had basic knowledge of HPC (High-Performance Computing) from previous courses but used C++ back then. In Python, it was a bit different and we needed some time to make it work. In the end, we could not fully achieve what we wanted but still were able to determine the parameters 5 times faster.
Linear Simplex Method
It was difficult to assign the scale parameters for the method. The data suggested that there were no constraints that related the federal states with each other. For instance, the number of Vaccines that were required by NRW did not have any effect on the number of vaccines required by Bayern. However, only the total number of vaccines were affected. Another problem that was faced was to define how metrics could be weighted and calculated. To find a scale that somehow connects the other metrics to the current covid cases in order to embed them in scale2. When the Simplex optimization was run with a fixed number of covid vaccines, say ‘10000’, it distributed everything normally and put all the Residuals/leftover vaccines in the assigned first federal state. This was primarily caused due to an equality constraint that stated: 10000 = sum of all covid vaccines required by the federal states. On changing the equality constraint to an inequality constraint that would say ‘10000’ > sum of all X, the downsides were that the percentages of the X could not be read easily. On the other hand when the number of available vaccines was assigned to be too low then it gave an error that there were no feasible solutions.
Webscraper
Our first approach was to use a list of the links of each daily report and then using the urllib library to automatically download the csv file from each link. One would take the first link and then change only the date part. That failed because apart from the default part of the link, there was also a number code that turned out to be inconsistent. Then we tried the selenium library. It uses a web browser just like a real human would. You need to find each button on the website by referring to the HTML/CSS code behind it. That caused some problems as neither of us had any prior knowledge of HTML. After many trial and errors, it finally looked like it was going to work and in the end, we have a decent working web scraper.
Automation
Dealing with paths caused us some problems in the automation of the process. To make it work from any computer (path formats are different in Windows, Mac and Linux) we used the pathlib library which turned out to be much better than os.path. We used relative paths to navigate, export and import data from the data directory in our project. Each script (cleaning, subsetting, forecasting etc.) defined internally its current path with Path.cwd() but when you import these scripts as modules in the automation script, the path that is returned is the one from which the automation script is. That caused a lot of problems as nothing seemed to work. In the end, the solution was to define the current path of the separate scripts as Path(file).parent .
Conclusion
The project aimed at developing a technique that would not only help in an efficient distribution of vaccines in Germany but also point out the most impacted cities based on different forecasting models. Using an empirical testing method, we were able to select the appropriate forecasting model for the two relevant datasets that have been used for this project. At the end of the development and testing phase, we were able to successfully trace the upward or downward shift of number of cases in different federal states. However, considering the extremely dynamic state of the number of COVID cases in Germany, we proceeded with developing a technique that would automatically update the dataset with the number of cases in different federal states and carry out the forecasting and distribution of vaccines. Nevertheless, during the course of development, we realized that the following points if remedied, can provide a scope for an improvement of the functionality of the method.
- With more data specific to the COVID cases being available in the future, forecasting can be improved.
- By having an incremental model, there will no need to perform training from the beginning, while at the same time a dynamic model would result in having new parameters for the forecastings for each incremental improvement.
- Deeper research on how to better determine the metric’s scales for the simplex method.
- After building the automated machine learning pipeline, a new software engineering perspective would make it more compliant to the software development principles and thus avoid unnecessary errors.
References
- Source : Kaggle, Section: How to Use Kaggle — Datasets, Link : https://www.kaggle.com/docs/datasets , Publisher : Kaggle Team, DOP : Sep-2020
- Source : DIVI Intensivregister Tagesreport-Archiv : https://www.divi.de/divi-intensivregister-tagesreport-archiv-csv?layout=table&start=0
- Source : Time Series : Time Series Forecasting With Prophet in Python, Section: Prophet Forecasting Library, Link : https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-python/, Publisher : Jason Brownlee, DOP : Aug-26–2020
- Source : Time Series : Time Series Forecasting With Prophet in Python, Section: Prophet Forecasting Library, Link : https://machinelearningmastery.com/time-series-forecasting-with-prophet-in-python/, Publisher : Jason Brownlee, DOP : Aug-26–2020
- Source : Towards Data Science : The Complete Guide to Time Series Analysis and Forecasting, Section: Exponential Smoothing, Link : https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775 , Publisher : Marco Peixeiro, DOP : Aug-07–2019
- Source : Time Series : A Gentle Introduction to Exponential Smoothing for Time Series Forecasting in Python, Section: Exponential Smoothing, Link: https://machinelearningmastery.com/exponential-smoothing-for-time-series-forecasting-in-python/, Publisher : Jason Brownlee, DOP : Aug-20–2018
- Source: Towards Data Science, Section: The Complete Guide to Time Series Analysis and Forecasting, Link: https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775, Publisher: Marco Peixeiro, DOP: Aug 7, 2019
- Source : FORECASTING: PRINCIPLES AND PRACTICES 2nd Edition, Section: ARIMA modelling in R, https://otexts.com/fpp2/arima-r.html, Publisher: Rob J Hyndman, George Athanasopoulos