Historically, infectious diseases have greatly impacted human health, necessitating a robust understanding of their trends, processes, and transmission. This study focuses on the COVID-19 pandemic, employing mathematical, statistical, and machine-learning methods to examine its time-series data. We quantify data irregularity using approximate entropy, revealing higher volatility in the U.S., Italy, and India compared to China. We employ the Dynamic Time Warping algorithm to assess regional similarity, finding a strong correlation between the U.S. and Italy. The Seasonal Trend Decomposition using the LOESS algorithm illuminates strong trend degrees in all observed regions, but China's prevention measures show marked effectiveness. These tools, whilst already valuable, still present opportunities for development in both theory and practice.
Throughout history, infectious diseases have persistently posed significant threats to human life. Despite societal advancement during the 20th century, numerous infectious diseases continue to jeopardize global health. In particular, the first documented case of COVID-19, a novel coronavirus, emerged in Wuhan, Hubei Province, China, in December 2019. Characterized by their high transmissibility, severe health implications, and unpredictable epidemic timing, coronaviruses represent a constant menace to human well-being.
Amidst the COVID-19 pandemic, the role of data science has become increasingly prominent, serving as an invaluable tool in mitigating the crisis. This study aims to harness advanced methodologies to deeply explore the temporal characteristics of infectious diseases and discern their propagation dynamics. Gaining these insights is critical for early intervention and the establishment of effective preventive measures against similar infectious diseases in the future.
Analytical tools such as approximate entropy, the Dynamic Time Warping (DTW) algorithm, and Seasonal Trend Decomposition using Loess (STL) decomposition offer promising avenues for time-series analysis. However, their application in infectious disease time-series data remains limited. In this study, we adopt the COVID-19 pandemic as a case study, employing approximate entropy to measure the complexity of infectious disease time series. Additionally, we utilize the DTW algorithm to evaluate similarity characteristics and the STL algorithm for time series decomposition and trend feature extraction.
The analysis considers the number of confirmed COVID-19 cases and associated fatalities in China, the United States, Italy, and India, as well as nine severely affected cities within China. Data were sourced from the World Health Organization, the Johns Hopkins University real-time surveillance system, and the Chinese Health Commission. This comprehensive and nuanced approach allows us to delve into the heart of infectious disease dynamics, offering a valuable foundation for future prevention and control measures.
A variety of quantitative methodologies from statistics, mathematics, infectious disease modeling, and other fields have been employed by researchers to analyze time series data on the spread of infectious diseases. Two core components of such analyses are curve fitting to capture the trajectory of an outbreak and time series forecasting to predict future case numbers. For example, Aryatama et al. [1] proposed using the SPCIRD model to anticipate the spread of COVID-19 in Indonesia while accounting for public compliance with containment measures, employing the Levenberg Marquardt curve fitting approach to develop an accurate model. De Silva and Abeysundara [2] utilized functional data analysis techniques to model and examine the spread dynamics of the first wave of COVID-19 globally and in Asia. Based on daily reported data from Iran, Gholami et al. [3] compared three continuous distributions—normal, lognormal, and Weibull—to model the distributions of COVID-19 cases and fatalities.
Epidemiological compartmental models, autoregressive models, Kalman filtering, and artificial intelligence have also been applied for real-time prediction of COVID-19 [4]. Kogan et al. [5] proposed an early warning system using multiple digital traces of COVID-19 activity for close to real-time monitoring. Some have successfully predicted the spread of COVID-19 using ARMA models or a combination of ARMA and other models [6], [7].
Many researchers have employed machine learning and deep learning techniques rather than traditional time series analysis approaches for prediction. According to Tamang et al. [8], artificial neural networks are well suited for handling large data sets and can be computationally analyzed to uncover patterns, trends, and predictions. We forecast new COVID-19 cases and deaths in India, the US, France, and the UK using artificial neural network-based curve fitting, accounting for patterns seen in China and South Korea. Comparative studies have also been conducted [9]. Chandra et al. [10] found that cyclic neural networks, a type of deep learning model, are ideal for simulating spatiotemporal sequences. They used popular recurrent neural networks like LSTMs to conduct multi-step prediction of COVID-19 spread in Indian states [11], [12], [13].
For statistical analysis of epidemiological characteristics, some have calculated peak and trough amplitudes, disease selectivity, oscillation intensity, and other metrics before and after epidemics, while others have performed straightforward correlation analyses. Yang et al. [14] noted that China’s approach to infectious disease prevention and control has changed significantly since the 2003 SARS outbreak, though few studies had examined trends and epidemiological characteristics. Ho et al. [15] observed that exponential growth culminating in a peak is typical of infectious disease epidemics, including coronavirus outbreaks. They used item response theory to identify changes in each country or region, then described inflection point characteristics on a time bar graph to locate the COVID-19 inflection point. Zhao et al. [16] calculated peak-valley amplitudes, disease selectivity, preferred outbreak time, and oscillation intensity for 23 notifiable diseases in China from 2017 to 2021, finding changes before and after epidemics [16], [17].
Approximate entropy, DTW, and STL have rarely been used in infectious disease research. Schneckenreither et al. [18] proposed an effective aggregation dispersion index using approximate entropy for disease research. Rodriguez et al. [19] used deep neural networks and approximate entropy as a tool for early diagnosis of Chagas disease and cardiac damage [20]. Regarding DTW, Dallas et al. [21] hypothesized that geographically proximate locations may have similar infectious disease dynamics. They used DTW to analyze the importance of distance, population size, and age structure in determining similarities between U.S. counties’ COVID-19 epidemics. STL has mainly been applied in meteorology, environment, and finance. He et al. [22] combined STL and machine learning to predict rainfall time series. Jainonthee et al. [23] used STL to determine the seasonality of two diseases. Zhao et al. [24] used STL to analyze influenza seasonality in China, then compared SARIMA, SARIMA-LSTM, and SSA-SARIMA-LSTM models for prediction.
In summary, researchers have studied the time series attributes of infectious diseases using qualitative and quantitative techniques, providing crucial guidance for effective pandemic prevention and control. The use of approaches like approximate entropy, DTW, and STL for infectious diseases remains conceptually and empirically underexplored. Further research on integrating these methods with infectious disease modeling is warranted.
In this study, the complexity of time series data associated with infectious diseases was explored, with a particular focus on the COVID-19 pandemic, using an entropy approximation approach. This method, introduced by Steven M. Pincus in 1991, provides a measure of complexity in signal sequences through the concept of approximate entropy. A key strength of this approach is its robustness in dealing with small data sets. This study hence benefits from this feature, as most measured time series are capable of satisfying the requirements, and the resultant findings exhibit a strong resilience and credibility.
To ascertain the similarity between time series data, we employ the DTW algorithm. This algorithm uses dynamic programming to nonlinearly align time series data, thereby facilitating the accurate computation of similarity between distinct time sequences. The DTW methodology used in this research was provided by the DTAI Research Group at the University of Leuven and is available open-source.
Beyond measuring complexity and similarity in time series data, this study also investigates trends, periodicity, and seasonality. To achieve this, the Seasonal-Trend decomposition based on the STL method was applied to decompose time series data related to COVID-19 from various countries and regions. This analysis facilitates a thorough exploration of trend characteristics inherent in the outbreak patterns of COVID-19.
Collectively, these three methodologies offer valuable insights into the temporal transmission characteristics and patterns of infectious diseases, such as COVID-19. The data used in this study primarily consist of time series of daily new confirmed COVID-19 cases in different countries and regions, from the onset of the outbreak to when each respective area had the pandemic under control.
The aforementioned methodologies were implemented through Python, underscoring the specific computational process. This blend of techniques underscores the innovative and multifaceted approach of the study in comprehending the temporal dynamics of infectious diseases, and in particular, the ongoing COVID-19 pandemic.
Approximate entropy serves as a critical metric in assessing the complexity and irregularity inherent in time series data. It operates on the principle of conditional probability to ascertain the propensity for the emergence of new information within time series data. This effectively encapsulates the likelihood of developing novel subseries within the overall time series data set [25]. The procedural steps to calculate approximate entropy are as follows:
Step 1: Suppose we have a time series of length $N: u(1), u(2), u(3), \ldots, u(N)$. Set a threshold value r which represents the similarity comparison. Then determine a measure m that divides the sequence length into subsequences.
Step 2: By reconstructing the original sequence, obtain the subsequences $X(1), X(2), X(3), \ldots, X(N-m+1)$, where each subsequence is denoted by $X(i)$.
Step 3: Calculate the distance $d_m[X(i), X(j)]$ between any two reconstructed vectors $X(i)$ and $X(j)$, where, $d_m$ represents the distance between two reconstructed vectors $X(j)(1 \leq j \leq N-m+1)$ and $X(i)$. The distance $d_m$ is calculated by the maximum difference of the corresponding elements in the two vectors, this includes the distance when $i$ equals $j$;
Step 4: Count the number of vectors that satisfy certain conditions and determine the ratio to the total number of statistics.
This process is called the template matching process of $X(i), C_i^m(r)$ represents the matching probability between any $X(j)$ and the template;
Step 5: Define the average similarity rate when the number of molecular sequences is m.
Step 6: Based on Steps 1 through 5, calculate the average similarity rate $\Phi_{m+1}(r)$ when the number of molecular sequences is $m+1$.
Step 7: Approximate entropy is calculated.
For the ApEn_alpha.R program, it is the equality transformation of Step 7, then there is
It's important to note that for the selection of m and r, m is typically chosen as 2 or 3, while r is chosen based on the actual application scenario. In this study, r is selected to be 0.2, meaning it is 0.2 times the standard deviation of the original time series.
Each point in a time series must be mapped to one or more points in another time series for the series to be aligned along the timeline. This is the underlying mechanism of the dynamic time-warping (DTW) algorithm. Dynamic programming is a technique that can be used to determine the optimal mapping [26].
As shown in Figure 1, the mapping between time series points is no longer one-to-one; rather, it allows a one-to-many or many-to-one relationship. The right panel of Figure 1 depicts two-time series represented by the green and blue lines; similar points between the two sequences are connected by the red lines. The DTW algorithm measures the similarity between time series using the total distance between these matched points, known as the integration path distance.
It should be noted that relevant parameters, such as time series length, window size, etc., should be chosen according to the actual data when utilizing the DTW method for time series similarity analysis. Due to the high computational complexity of the DTW algorithm, it may be necessary to optimize or choose different algorithms for large-scale data sets.
Suppose there are two-time series $A$ and $B$, with lengths $n$ and $m$, respectively. Comparing the two sequences with an $n\times m$ matrix, the warping path $P=\left\{p_{-} 1, \ldots, p\_{s}, \ldots, p_{-} S\right\}$ will pass through this matrix. The $s$-th element of the warping path is represented by $p_{-} s=\left(i_{-} s, j_{-} s\right)$, where $i$ and $j$ represent the corresponding points of the two sequences, respectively.
The goal of the DTW algorithm is to find an optimal warping path p. This is essentially an optimization problem and can be expressed mathematically as $\operatorname{DTW}(A, B)=\min \sum_{s-1}^S\left(p_s\right)$. To solve this optimization problem, three conditions must be met:
Condition 1: Boundary conditions. $p 1=(1,1)$ and $p S=(n, m)$ mean that the beginning and end of the two sequences must match. The path starts from the bottom left and ends at the top right, ensuring that the entire sequence is taken into account.
Condition 2: Continuity. If $p_s=(a, b)$ and $p(s-1)=\left(a^{\prime}, b^{\prime}\right), a-a^{\prime} \leq 1$ and $b-b^{\prime} \leq 1$ must be satisfied. This constraint indicates that in the matching process, many-to-one and one-to-many cases can only match the surrounding time step. That is to say, it is impossible to cross a certain point for matching. You can only align points adjacent to you. This keeps the warping path free of jumps and every coordinate in A and B appears in the warping path.
Condition 3: Monotonicity. If $p_s=(a, b)$, and $p(s-1)=\left(a^{\prime}, b^{\prime}\right), a-a^{\prime} \geq 0$ and $b-b^{\prime} \geq 0$ must be satisfied, indicating that the warping path does not take a backward path. This ensures that each coordinate will not be repeated in the path and that the warping path increases monotonically over time.
The monotonicity and continuity of the warping path $p$ mean that $p$ can only proceed in three ways: one space to the right, one space up, and one space diagonally up to the right. In addition to the boundary conditions of $p$, the solution of the optimal $p$ becomes a dynamic programming problem. Let's call this dynamic programming problem $\gamma$, then
Therefore, the similarity between two-time series A and B is obtained by the DTW algorithm (Figure 2).
Seasonal Trend decomposition procedure based on Loess (STL) is a popular algorithm used in time series decomposition. It breaks down data $Y_v$ into a trend component, a seasonal component, and a residual component [27]. In other words, $Y_v=T_v+S_v+R_v, v=1, \cdots, N$. STL comprises an inner and outer cycle, with the inner cycle primarily used for trend fitting and seasonal component calculation.
At the end of the $(k-1)$th pass in the inner cycle, $T_v^{(k)}$ and $S_v^{(k)}$ are assumed to be the trend and seasonal components respectively. $T_v^{(k)}=0$ is initialized to zero. Several parameters are defined as follows:
$n_{(i)}$ is the number of inner cycles;
$n_{(0)}$ is the outer cycle number;
$n_{(p)}$ is the number of samples in a period;
$n_{(s)}$ is the smoothing parameter described in Step 2;
$n_{(l)}$ is the smoothing parameter described in Step 3;
$n_{(t)}$ is the smoothing parameter described in Step 6.
Sample points at the same position in each cycle form a subsequence, referred to as a cycle-subseries. There are a total of $n_{(p)}$ such subsequences. The inner cycle is mainly divided into the following 6 steps:
Step1 Detrend by subtracting the trend component of the previous iteration, $Y_v-T_v^{(k)}$;
Step2 Use the $\operatorname{LOESS}\left(q=n_{n(s)}, d=1\right)$ to do the regression for each subsequence and extend one cycle forward and backward. Smooth the result of temporary seasonal sequence, remember to $C_v^{(k+1)}, v=-n_{(p)}+1, \cdots,-N+n_{(p)}$;
Step3 Filter the low flux of the periodic subsequence. Do the sliding average of the result sequence $C_v^{(k+1)}$ in turn, and then do the regression of $\operatorname{LOESS}\left(q=n_{n(l)}, d=1\right)$. The result sequence $L_v^{(k+1)}, v=1, \cdots, N$ is equivalent to extracting the low flux of the periodic subsequence;
Step4 removes the subsequence trend, smooth cycle $S_v^{(k+1)}=C_v^{(k+1)}-L_v^{(k+1)}$;
Step5 Decycle trend, subtract the periodic component, $Y_v-S_v^{(k+1)}$;
Step6 Trend smoothing is described. For the sequence after removing the period, do the regression of $\operatorname{LOESS}\left(q=n_{n(t)}, d=1\right)$, and obtain the trend component $T_v^{(k+1)}$.
The outer cycle is primarily used to adjust the robustness weight. If there are outliers in the data series, the residual will be larger. Define $h=6 * \operatorname{median}\left(\left|R_v\right|\right)$, for the location of $v$ data points, its robustness weighting for $\rho_v=B\left(\left|R_v\right| / h\right)$, which functions as the bisquare function $B$:
$B(u)=\left\{\begin{array}{cl}\left(1-u^2\right)^2 & \text { for } \quad 0 \leq u<1 \\ 0 & \text { for } \quad u \geq 1\end{array}\right..$
Then in the inner cycle of each iteration, when doing the regression in Step2 and Step6, the neighborhood weight needs to be multiplied by to reduce the influence of outliers on the regression.
Table 1 reveals that among the four countries, the approximate entropy calculation for the time series of daily new COVID-19 cases in the United States (0.5104) is significantly higher than that of China (0.0640). Likewise, the entropy values for Italy (0.2201) and India (0.3144) are also significantly higher than China's. This suggests that the course of COVID-19 in these three countries has been relatively more unpredictable and volatile.
Looking at the nine cities in China, only Chengdu (0.2523) and Harbin (0.2588) have higher approximate entropy values than the other seven cities, though still lower than those of the United States and India. This indicates that the time series of daily new COVID-19 cases in China is considerably less complex or irregular than in countries such as the United States, Italy, and India. This is likely due to China's timely implementation of effective, scientific, and reasonable preventive measures, which have yielded significant results.
Country or city | $\Phi_m(r)$ | $\Phi_{m+1}(r)$ | $ApEn$ |
China | -0.7531 | -0.8171 | 0.0640 |
America | -2.2338 | -2.7442 | 0.5104 |
Italy | -1.7777 | -1.9978 | 0.2201 |
India | -2.0425 | -2.3569 | 0.3144 |
Chengdu | -1.0668 | -1.3191 | 0.2523 |
Wuhan | -0.2869 | -0.3154 | 0.0285 |
Hangzhou | -0.7877 | -0.9403 | 0.1526 |
Guangzhou | -0.3794 | -0.3981 | 0.0187 |
Urumqi | -0.7994 | -0.9064 | 0.1070 |
Harbin | -1.2781 | -1.5369 | 0.2588 |
Xi'an | -0.7038 | -0.8648 | 0.1610 |
Beijing | -0.4857 | -0.5227 | 0.0369 |
Shanghai | -0.3608 | -0.3866 | 0.0259 |
This study employs the DTW algorithm to perform pairwise calculations on epidemic time series data from various nations and regions. The findings suggest that Wuhan shows a strong correlation with the national epidemic in China, as depicted in Figure 3 (left), which is due to the city being the initial epicenter of the outbreak. Among the four countries, only the United States shows a strong correlation with the Italian outbreak (Figure 3 (right)).
From a historical standpoint, no event occurs without cause or warning. A case in point is Italy, which became severely affected by COVID-19. Some may attribute this to the government's inefficiency or societal factors, but these are common across Europe and the Western world. As per real-time statistics from Johns Hopkins University, as of 11 a.m. Beijing time on April 2, 2020, the U.S. had 215,417 confirmed cases and 5,116 deaths. The U.S., being the first country to surpass 200,000 confirmed cases, is likely the source of this global disaster.
The STL algorithm is used to decompose the time series of infectious diseases, effectively revealing trend, periodic, and seasonal characteristics. This study selected four countries and nine cities in China severely affected by COVID-19 for STL decomposition of time series data for daily new cases and deaths.
Time series data decomposes into trend (T), seasonal (S), and residual (R) components: $\mathrm{S}~y_t=T_t+S_t+R_t$. For strongly trending data, the variance of residuals should be smaller than seasonally adjusted data, thus $\operatorname{Var}\left(R_t\right) / \operatorname{Var}\left(T_t+R_t\right)$ is small. For time series with weak or no trend, these variances should be comparable. Therefore, the trend strength is defined as:
This yields a measure of trend strength between 0 and 1. Similarly, the intensity of seasonality is defined using data after trend adjustment:
When $F_S$ was close to 0, it indicated that there was almost no seasonality in the sequence, and when $F_S$ was close to 1, it indicated that $Var(R_t)$ of the sequence was much smaller than $Var(S_t+R_t)$.
Table 2's calculations indicate that for the time series of new COVID-19 cases, China, the United States, Italy, India, and eight Chinese cities show strong trends, while Wuhan shows a weak trend. This can be attributed to China's consistent and effective preventive measures. When considering the trend strength of new death numbers, it is clear that China's control measures have been more effective than those in the United States, Italy, and India. In terms of seasonal intensity, the United States and Italy show strong seasonal characteristics for new cases and deaths.
Time series with more trends and seasonality produces more predictable data. Hence, it is important to decompose the data with STL and calculate its trend and seasonal intensity before forecasting the progression of infectious diseases. This will aid in better model selection and improve prediction accuracy.
Country or city | Number of newly confirmed cases | Number of new deaths | ||
$F_T$ | $F_S$ | $F_T$ | $F_S$ | |
China | 0.8653 | 0.3114 | 0.3621 | 0.3492 |
America | 0.9486 | 0.7690 | 1.0000 | 0.8438 |
Italy | 0.9863 | 0.8463 | 1.0000 | 0.5270 |
India | 0.9899 | 0.5133 | 1.0000 | 0.2505 |
Chengdu | 0.7004 | 0.1556 | 0.2155 | 0.3685 |
Wuhan | 0.5874 | 0.3115 | 0.3066 | 0.3483 |
Hangzhou | 0.7822 | 0.3214 | 0.2200 | 0.3337 |
Guangzhou | 0.9247 | 0.3372 | 0.3741 | 0.3945 |
Urumqi | 0.8725 | 0.2898 | 0.2178 | 0.2887 |
Harbin | 0.8569 | 0.3998 | 0.5291 | 0.5654 |
Xi'an | 0.9547 | 0.2368 | 0.3865 | 0.6783 |
Beijing | 0.9767 | 0.3951 | 0.2155 | 0.3685 |
Shanghai | 0.9259 | 0.4397 | 0.3066 | 0.3483 |
In light of the profound impact and the multifaceted challenges the COVID-19 pandemic has exerted on global health systems, this study leverages a blend of approximate entropy, Dynamic Time Warping (DTW) algorithm, and Seasonal Trend decomposition procedure based on Loess (STL) to draw pivotal insights into infectious disease research.
Firstly, approximate entropy serves as an effective tool in assessing the complexity of time series data. The computational findings of this study provide an objective and quantifiable gauge of the volatility of the infectious disease time series across various countries and regions. This methodology lends a new perspective to understanding and interpreting the course of epidemic progression.
Secondly, the DTW algorithm was employed to ascertain the congruity of COVID-19 spread between four countries and nine cities in China. This calculation holds significant potential in facilitating comparative analysis of infectious disease transmission across different geographical entities, thus providing valuable insights into disease transmission correlations.
Finally, the STL algorithm was utilized to decompose the time series of infectious diseases and determine their trend characteristics. This information offers a practical basis to analyze and evaluate the efficacy of epidemic prevention and control measures implemented across different countries or regions.
Despite the notable findings and conclusions, several thought-provoking questions emanate from this study that warrants further exploration. A fascinating point of discussion is the potential implications of high and low entropy on the course of the epidemic. Further inquiry could shed light on the causes of the observed disparities between countries and cities.
Moreover, using the DTW algorithm's calculation results, deeper discussions could be initiated to unravel the relationship between places exhibiting strong disease transmission similarities. Regarding STL decomposition, future discourse could aim to relate the findings of this study with disease control efforts more directly.
In summation, this study invites more nuanced and comprehensive investigations into these intriguing issues, contributing to the ongoing dialogue surrounding global public health crises.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.