Comparative Analysis of Mortality Predictions from Lassa Fever in Nigeria: A Study Using Count Regression and Machine Learning Methods
Abstract:
In Sub-Saharan Africa, particularly in Nigeria, Lassa fever poses a significant infectious disease threat. This investigation employed count regression and machine learning techniques to model mortality rates associated with confirmed Lassa fever cases. Utilizing weekly data from January 7, 2018, to April 2, 2023, provided by the Nigeria Centre for Disease Control (NCDC), an analytical comparison between these methods was conducted. Overdispersion was indicated (p<0.01), prompting the exclusive use of negative binomial and generalized negative binomial regression models. Machine learning algorithms, specifically medium Gaussian support vector machine (MGSVM), ensemble boosted trees, ensemble bagged trees, and exponential Gaussian Process Regression (GPR), were applied, with 80% of the data allocated for training and the remaining 20% for testing. The efficacy of these methods was evaluated using the coefficients of determination (R²) and the root mean square error (RMSE). Descriptive statistics revealed a total of 30,461 confirmed cases, 4,745 suspected cases, and 772 confirmed fatalities attributable to Lassa fever during the study period. The negative binomial regression model demonstrated superior performance (R²=0.1864, RMSE=4.33) relative to the generalized negative binomial model (R²=0.1915, RMSE=18.2425). However, machine learning algorithms surpassed the count regression models in predictive capability, with ensemble boosted trees emerging as the most effective (R²=0.85, RMSE=1.5994). Analysis also identified the number of confirmed cases as having a significant positive correlation with mortality rates (r=0.885, p<0.01). The findings underscore the importance of promoting community hygiene practices, such as preventing rodent intrusion and securing food storage, to mitigate the transmission and consequent fatalities of Lassa fever.
1. Introduction
Lassa fever, a viral haemorrhagic fever induced by the Lassa virus, represents a significant public health challenge within West Africa, with Nigeria experiencing a substantial burden of the disease. Transmission of this virus is typically initiated through exposure to the urine or faeces of infected rodents, with secondary human-to-human transmission occurring, particularly in healthcare settings where infection control measures are suboptimal [1]. World Health Organization (WHO) estimates suggest an annual incidence of 100,000 to 300,000 Lassa fever cases across West Africa, culminating in approximately 5,000 deaths [2].
The onset of Lassa fever within Nigeria was first documented in 1969, with subsequent outbreaks resulting in tens of thousands of cases and hundreds of fatalities annually. Despite concerted efforts to contain the disease, the NCDC reported a staggering 2,070 confirmed cases and 397 deaths in 2020 alone. Manifestations of the illness range from fever, sore throat, and muscle aches to severe complications such as haemorrhagic fever. The significant morbidity associated with Lassa fever, primarily affecting young adults, exerts profound socioeconomic impacts [1], [2]. As of April 2, 2023, it has been reported that the confirmed fatalities due to Lassa fever amount to 772 [3].
In the context of Nigeria, where Lassa fever is endemic, efforts to forecast and mitigate the disease's impact are of paramount importance. Accordingly, the application of count regression models and machine learning algorithms has been explored to predict Lassa fever mortality. This methodological approach, though novel, offers a promising avenue to understand disease dynamics and identify correlates of disease incidence and mortality. Such correlates include suspected and confirmed cases, temporal factors (week and year), and other potential predictors. The present study was conducted with the aim of modeling mortality associated with Lassa fever using these analytical methods, offering a comparative assessment of their predictive accuracies.
The persistent rise in Lassa fever cases and deaths, particularly in resource-constrained rural settings, highlights the inadequacies in current control measures [4]. Previous studies have predominantly relied on questionnaire-based assessments or mathematical modeling, with limited exploration of count regression models in the context of Lassa fever. The application of machine learning in this domain is emergent, thus underscoring the originality and relevance of the present work.
This investigation endeavors to provide a robust model for Lassa fever mortality within the Nigerian context by employing both count regression and machine learning methodologies. Given the endemic nature of Lassa fever, an empirical analysis of this kind is crucial for the formulation of health policies aimed at reducing mortality rates. Thus, this study conducts a comparative analysis of the efficacy of count regression models and machine learning algorithms in modeling Lassa fever mortality in Nigeria.
2. Related Works
Endemicity of Lassa fever within Nigeria persists as a critical public health issue. Research conducted by Buba et al. [5] identified certain occupations, notably farming and hunting, as significant risk factors contributing to the heightened susceptibility to Lassa fever. Olayemi et al. [6] reported a marked seasonality in the prevalence of Lassa fever in West Africa, with a peak during the dry season spanning November to April, coinciding with increased rodent activity and food supply. These findings underscore the imperative for enhanced surveillance and the implementation of more effective prevention and control measures to mitigate the incidence and mortality of the disease.
In a novel approach, Alile [7] utilised a supervised machine learning technique for the diagnosis of Lassa fever, employing clinical signs such as sore throat and headache as diagnostic features. The study revealed that the developed machine learning algorithm exhibited high accuracy, presenting a potential adjunct to support clinical decision-making in the diagnosis of Lassa fever.
The work of Oluwole and Nkonyana [8] involved the application of k-nearest neighbors (kNN) and decision tree algorithms to model weekly cases of Lassa fever, juxtaposing their efficacy with that of the Seasonal Autoregressive Integrated Moving Average (SARIMA) model. The study demonstrated that machine learning models yielded robust performance despite the complexities inherent in confirmed case data, with the kNN algorithm showing superior performance compared to the other models examined. Nnebe et al. [9] proposed the use of fuzzy logic as a supplementary diagnostic method for Lassa fever, offering an alternative to traditional laboratory techniques. The model developed was concluded to provide reliable diagnostics, which could aid medical practitioners in making informed decisions.
Further research by Shoaib et al. [10] explored the use of artificial neural networks (ANNs) to understand the dynamics of Lassa fever in Nigeria. Concurrently, Tahmo et al. [11] investigated the application of SARIMA and poisson regression models to forecast the trajectory of Lassa fever outbreaks. Steur and Mueller [12] successfully employed neural networks for the classification of viral haemorrhagic fevers, with a focus on Lassa fever, illustrating the models' efficiency as a prospective tool against future outbreaks. The versatility of decision support systems was exhibited by Olabiyisi et al. [13], who employed fuzzy logic in the diagnosis of a spectrum of tropical diseases, including Lassa fever, highlighting the broader applicability of such models in disease diagnostics.
In an analysis focusing on environmental correlates, Clark et al. [14] applied negative binomial and poisson regression models to identify domestic factors associated with increased rodent abundance in regions of rural upper Guinea endemic for Lassa fever. Complementing this environmental perspective, Redding et al. [15] leveraged poisson regression to elucidate geographical and climatic determinants linked to the spatial distribution of confirmed Lassa fever cases in Nigeria.
3. Methods
Secondary data comprising weekly confirmed new cases of Lassa fever in Nigeria for the period from January 7, 2018, to April 2, 2023, served as the basis for this study. These data were sourced from the NCDC and provided a platform for subsequent analyses. The dependent variable of interest was the confirmed fatalities attributed to Lassa fever, while independent variables included temporal factors (year, month, week) and epidemiological measures (suspected and confirmed cases of Lassa fever). Selection of count regression models, specifically negative binomial regression, poisson regression, and generalized negative binomial regression, was predicated on their ability to handle count data. In parallel, machine learning algorithms were deployed, capitalizing on their advanced capacity to learn from data and yield reliable predictions.
Prior to analysis, data underwent preprocessing, which included cleansing and transformation to ensure compatibility with the requirements of the chosen statistical and machine learning methodologies. Feature scaling was conducted, and the dataset was partitioned into a training set (80%) and a testing set (20%).
The mathematical representation of the poisson regression is encapsulated by the following equation:
where, $\ln \left(Y_i\right)$ denotes the natural logarithm of the expected count of the dependent variable (confirmed fatalities due to Lassa fever), with $\beta_0, \beta_1, \beta_2, \ldots, \beta_5$ representing coefficients estimated for independent variables $X_1, X_2, \ldots, X_5$.
Incorporating a parameter to account for overdispersion, the negative binomial regression extends the poisson model as follows:
where, $\ln \left(Y_i\right)$ symbolizes the natural logarithm of the expected count of the dependent variable. Coefficients $\beta_0, \beta_1, \beta_2, \ldots, \beta_5$ are estimated for the independent variables $X_1, X_2, \ldots, X_k$, while $\alpha$ represents the dispersion parameter, a quantifier of data overdispersion. In alignment with the methodologies adopted by poisson regression, negative binomial regression is utilized, employing a logarithmic link function. Diverging from poisson regression's limitations, the negative binomial model incorporates a dispersion parameter. This addition facilitates the accommodation of overdispersion inherent in count data, thus enabling a more robust modelling framework.
A more adaptable variant, the generalized negative binomial regression, allows for the independent variables to influence both the mean and the dispersion parameters. The corresponding equations are:
where, $\ln \left(\alpha_g\right)$ symbolizes the natural logarithm of the dispersion parameter. Coefficients $\gamma_0, \gamma_1, \gamma_2, \ldots, \gamma_5$ are estimated for the dispersion model, and $e_g$ denotes the error term associated with the dispersion model in Eq. (3).
In the investigation of predictive models for the incidence of Lassa fever, a selection of supervised machine learning algorithms was examined: MGSVM, ensemble boosted trees, ensemble bagged trees, and exponential GPR.
The MGSVM, synonymous with the medium radial basis function (RBF) SVM, utilizes a Gaussian kernel to measure the similarity between pairs of data points in a transformed feature space. It replaces the inner product kernel function with a nonlinear mapping to a higher dimensional space, facilitating the identification of an optimal hyperplane for feature space division. The MGSVM has demonstrated utility in a diverse array of applications, including image classification and speech recognition.
Ensemble boosted trees, a robust algorithm in machine learning, are constructed iteratively to ameliorate the inaccuracies of preceding trees in the sequence. Comprising a collection of decision trees, each individual tree within the ensemble is trained to strengthen the performance of the collective model. The methodology allows for the conversion of weaker individual trees into a robust predictive mechanism, with new trees compensating for prior errors.
Known also as random forest, the ensemble bagged trees are employed to enhance prediction accuracy and mitigate the risk of overfitting. The technique involves the training of multiple iterations of a base learning algorithm on varied subsets of the training data, blending ensemble learning with decision tree algorithms. Through the amalgamation of multiple decision trees, each drawn at random from the data, a more potent composite tree is forged, yielding improved predictions.
The exponential GPR represents a non-parametric Bayesian approach to regression. Optimal for smaller datasets, this variant of GPR differs from its squared exponential counterpart by the application of the Euclidean distance, which, in this instance, is not squared. While exponential GPR demonstrates a proficiency in smoothing functions with minimal error, it is less adept at managing discontinuities. Employing a probabilistic framework, the GPR synthesizes likelihood with prior distributions, inferring predictions that furnish both a mean and a standard deviation.
For the evaluation of model performance, R² and RMSE were employed as comparative metrics. The model exhibiting the highest R² was deemed superior in terms of fitness, indicating a greater proportion of variance accounted for by the model. Conversely, the model manifesting the lowest RMSE was identified as optimal for forecasting purposes, denoting minimal discrepancy between observed and predicted mortality due to Lassa fever.
where, $n$ is the number of observations, $y_i$ represents the actual instances of Lassa fever mortality, and $\hat{y}_i$ denotes the predicted values.
4. Results and Discussion
The analysis of the descriptive statistics revealed that, during the study period, a total of 30,461 confirmed cases of Lassa fever were recorded, alongside 4,745 suspected cases. Furthermore, there were 772 confirmed fatalities attributed to Lassa fever, with an average mortality rate of 3.04, indicating approximately three confirmed cases per recorded fatality. Positive skewness was observed for both confirmed cases and confirmed deaths (Table 1).
Variables | Minimum | Maximum | Sum | Mean | SD | Skewness |
Suspected cases | 11.00 | 560.00 | 30461.00 | 119.93 | 104.44 | 1.99 |
Confirmed cases | 0.00 | 137.00 | 4745.00 | 18.68 | 25.41 | 2.34 |
Confirmed deaths | 0.00 | 21.00 | 772.00 | 3.04 | 4.11 | 2.09 |
Subsequent to the preliminary statistical analysis, an examination for overdispersion was conducted to ascertain the suitability of the Poisson regression model.
Test Statistics | Statistic | P-Value | Remark |
Deviance goodness-of-fit | 429.3775 | 0.0000** | Significant |
Pearson goodness-of-fit | 419.2726 | 0.0000** | Significant |
The results outlined in Table 2 indicate a substantiation of overdispersion within the data, as evidenced by the reported deviance goodness-of-fit statistic of 429.3775 (p=0.000, p<0.05) and the Pearson goodness-of-fit statistic of 419.2726 (p=0.000, p<0.01). The results, indicating significant overdispersion, suggested the inadequacy of the Poisson regression model. Therefore, negative binomial and generalized negative binomial regressions were implemented as corrective measures for overdispersion. The decision to utilize these models is justified by their capability to encompass overdispersed data, a property not shared by the standard Poisson regression model, which assumes equidispersion. To address the complexity of the data, four machine learning algorithms were selected based on their ability to model target variables through experiential learning from the dataset. These algorithms were specifically chosen to adequately cover the breadth of the study's scope.
Table 3 presents the criteria for optimal model selection based on fitness and forecasting accuracy.
| Fitness Performance | Forecasting Accuracy | ||
Count regression models | R^{2}_{adj.} | BIC | MSE | RMSE |
Negative binomial | 0.1864 | 968.9921 | 18.76 | 4.33 |
Generalized negative binomial | 0.1915 | 977.855 | 332.79 | 18.2425 |
The results displayed in Table 3, with adjusted R² values of 0.1864 (18.64%) for the negative binomial model and 0.1915 (19.15%) for the generalized negative binomial model, reveal the superiority of the generalized negative binomial model in terms of fitness, as indicated by the marginally higher adjusted R² value. However, the comparative Akaike Information Criterion (AIC) values (not shown in the table) suggest a more nuanced interpretation. The lower AIC for the generalized negative binomial model (935.4407) compared to the negative binomial model (944.3207) implies a better trade-off between fit and complexity for the former. However, the Bayes Information Criterion (BIC) for the negative binomial model (968.9921) is less than that of the generalized negative binomial model (977.855), indicating lower model complexity. With respect to error metrics, the negative binomial regression yielded a RMSE of 4.33 and a mean square error (MSE) of 18.76, while the generalized negative binomial regression presented an RMSE of 18.2425 with an MSE of 332.79. Thus, the negative binomial regression emerges as the more suitable model for modeling Lassa fever fatalities in Nigeria, given its lower RMSE and MSE.
Table 4 evaluates the performance of four machine learning algorithms.
Model Type | $\mathbf{R}^2$ adj. | MSE | RMSE |
MGSVM | 0.84 | 2.6188 | 1.6183 |
Ensemble boosted trees | 0.85 | 2.4006 | 1.5994 |
Ensemble bagged trees | 0.82 | 2.8155 | 1.6779 |
Exponential GPR | 0.82 | 2.9528 | 1.7184 |
The analyses detailed in Table 4 yield an adjusted R² value of 0.84 for the MGSVM, 0.85 for the ensemble boosted trees, 0.82 for the ensemble bagged trees and the exponential GPR. RMSE values were recorded as 1.6183 for the MGSVM, 1.5994 for the ensemble boosted trees, 1.6779 for the ensemble bagged trees, and 1.7184 for the exponential GPR. Correspondingly, MSE values were found to be 2.6188, 2.4006, 2.8155, and 2.9528 for the aforementioned models, respectively.
It is observed that the ensemble boosted trees algorithm not only exhibits the highest adjusted R² value but also the lowest RMSE and MSE, indicating superior predictive performance in comparison to the other evaluated machine learning algorithms. These findings suggest that, within the context of Lassa fever mortality prediction in Nigeria, machine learning models, and specifically the ensemble boosted trees, offer more precise forecasting capabilities than their counterparts.
Figure 1 illustrates the actual versus predicted new cases of Lassa fever deaths in Nigeria, as forecasted by the ensemble boosted trees model.
A comparative analysis between optimal count regression models and machine learning algorithms was conducted. Table 5 shows the results.
Models | Type | RMSE |
Count regression | Negative binomial | 4.3300 |
Machine learning | Ensemble boosted trees | 1.5994 |
Table 6 presents a correlation matrix, elucidating the relationships between temporal variables (week, year), clinical incidences (suspected and confirmed cases), and Lassa fever mortality.
1. Week | 1 | 2 | 3 | 4 | 5 |
2. Year | $-0.134^*$ | 1 | |||
3. Suspected Cases | $-0.495^{* *}$ | $0.427^{* *}$ | 1 | ||
4. Confirmed Cases | $-0.566^{* *}$ | $0.279^{* *}$ | $0.885^{* *}$ | 1 | |
5. Lassa Fever Mortality | $0.537^{* *}$ | $0.186^{* *}$ | $0.756^{* *}$ | $0.868^{* *}$ | 1 |
The results, encapsulated in Table 5, demonstrate a lower RMSE for the ensemble boosted trees algorithm (1.5994) relative to the negative binomial count regression model (4.3300), highlighting the former's superior predictive capability for modeling Lassa fever mortality in Nigeria.
It was observed in Table 6 that Lassa fever mortality correlates significantly and positively with year progression, and the number of suspected and confirmed cases. These associations suggest an ongoing endemic presence of Lassa fever in Nigeria.
The investigation's findings affirm that Lassa fever persists as a significant health threat in Nigeria. Within the count regression model category, the negative binomial model exhibited superior forecasting prowess, as indicated by its RMSE value of 4.33. However, even the least performing machine learning algorithm (Exponential GPR) with an RMSE of 1.7184 surpassed the most proficient count regression model. The ensemble boosted trees algorithm, with the lowest RMSE of 1.5994, was deemed the most effective across both predictive model categories. This reinforces the advantage of machine learning algorithms in pattern recognition within datasets, as opposed to count regression models that rely on estimation of predefined models.
The comparative efficacy of machine learning algorithms is supported by existing literature. Oluwole and Nkonyana [8] observed superior performance of these algorithms against the SARIMA model, a time-series statistical model. Similarly, Busari and Samson [16] documented the preeminence of machine learning in predicting infectious diseases within Nigeria. A potential explanation for the enhanced performance of machine learning models is their capacity to discern patterns in data, enabling more accurate predictions than the more rigid count regression approach. Additionally, the relative robustness of Machine Learning algorithms to statistical assumptions may contribute to their observed superiority.
5. Conclusions and Recommendations
In the culmination of the presented analysis, the efficacies of count regression models, specifically negative binomial regression and its generalized counterpart, were juxtaposed with those of various machine learning algorithms, including MGSVM, ensemble boosted trees, ensemble bagged trees, and exponential GPR. The corpus of data scrutinized herein comprised weekly reports of Lassa fever mortality, spanning from January 7, 2018, to April 2, 2023, as procured from the NCDC.
It was discerned that machine learning algorithms surpassed count regression models in modelling the mortality rates associated with Lassa fever in Nigeria. Notably, the ensemble boosted trees algorithm emerged as the most proficient among the machine learning contenders. However, the superiority of machine learning algorithms delineated in this study warrants further investigation, incorporating a broader array of algorithms and more contemporaneous data. Such endeavors would potentially refine the understanding of Lassa fever mortality trends and inform the formulation of robust health promotion policies.
In light of these findings, it is recommended that:
(i) The ensemble boosted trees algorithm be employed for future modelling of Lassa fever mortality in Nigeria, owing to its demonstrated predictive accuracy.
(ii) The Federal Government of Nigeria is urged to ensure rigorous adherence to public health directives and preventive strategies to mitigate the spread of the virus.
(iii) An appeal is extended to residents and visitors within Nigeria and other West African nations to actively participate in stemming the tide of Lassa fever. Measures advocated include immediate reporting of symptomatic manifestations to medical facilities, bolstering community hygiene practices, deterring rodent infestations, and safeguarding food sources from contamination.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.