Comparing Artificial Neural Networks with Multiple Linear Regression for Forecasting Heavy Metal Content
Abstract:
This paper adopts two modeling tools, namely, multiple linear regression (MLR) and artificial neural networks (ANNs), to predict the concentrations of heavy metals (zinc, boron, and manganese) in surface waters of the Oued Inaouen watershed flowing towards Inaouen, using a set of physical-chemical parameters. XLStat was employed to perform multiple linear and nonlinear regressions, and Statista 10 was chosen to construct neural networks for modeling and prediction. The effectiveness of the ANN- and MLR-based stochastic models was assessed by the determination coefficient (R²), the sum squared error (SSE) and a review of fit graphs. The results demonstrate the value of ANNs for prediction modeling. Drawing on supervised learning and back propagation, the ANN-based prediction models adopt an architecture of [18-15-1] for zinc, [18-11-1] for manganese, and [18-8-1] for boron, and perform effectively with a single cached layer. It was found that the MLR-based prediction models are substantially less accurate than those based on the ANNs. In addition, the physical-chemical parameters being investigated are nonlinearly correlated with the levels of heavy metals in the surface waters of the Oued Inaouen watershed flowing towards Inaouen.
1. Introduction
Water can be polluted by many things, namely, wastewater, industrial products, and domestic products. The discharge of pollutants, particularly those containing harmful nanoparticles or metallic heavy metals, makes the water toxic and poses risks to human health, biological life, and environmental security. Metals have a special status in water pollution, because they are employed frequently in many applications. The heavy metal elements are brought up the food chain by water. These elements are often only present in tiny amounts, but when they bioaccumulate in living organisms, they pose an increasingly dangerous problem [1].
Public agencies have set limitations to manage emissions, examine threats, and protect our ecosystem against the toxicity of heavy metals. The scientific community and Moroccan government officials are working hard on environmental issues [2]. Morocco's environmental plans introduce the management and management of natural resources, and consider the preservation of our ecosystem and the preservation of natural resources [3].
This paper aims to create mathematical models that can predict the presence of heavy metals like manganese, boron, and zinc based on various environmental factors. The objects are the surface waters in the Oued Inaouen watershed. Two prediction models, namely, artificial neural network (ANN) and multiple linear regression (MLR) were compared on Statista 10 and XLSTAT 2017.
2. Methodology
Our database consists of 100 surface water samples (observations) from the province of Taza. The collection, transportation, and storage of water samples in 2014 and 2015 were accomplished in accordance with the policies and practices of the National Drinking Water Office. The lab run by the University Sidi Mohamed Ben Abdellah (USMBA) of Fez, known as the Regional University Interface Center (CURI), executed part analysis, part fabrication, and part manufacturing [4], [5], [6].
A total of 16 physical-chemical features are identified in these samples, and taken as independent (explanatory) factors, namely, total dissolved solids (TDS), conductivity (Cond), temperature (T/°C), dissolved oxygen (DO), pH, potassium (K), magnesium (Mg), calcium (Ca), nitrate (NO3), chloride (Cl), phosphorus (P), ammoniacal nitrogen (NH4), total alkalinity (CaCO3), sodium (Na), and sulfates (SO4) [5]. Manganese (Mn), zinc (Zn), and boron(B) levels were identified as the three dependent variables to be predicted.
From the overall database, 70% of the data were selected randomly to train the dependent variable forecast model [7]. The remaining 30% of data were used to prevent over fitting, and assess the prediction effect of the trained model.
Data normalization makes models less difficult. The input data are made up of the 16 independent variables above, which are of various magnitudes. The data are transformed into a standardized variable to equalize the measuring scales. The I values of each independent variable are adjusted to match their means and standard deviations, using the connection $X\left(v_i\right)$:
where, $\mathrm{X}_{\mathrm{s}}\left(v_i\right)$, $\mathrm{X}\left(v_i\right)$ and $\bar{X}\left(v_i\right)$ are the standardized, measured, and mean values, respectively. The mean value can be calculated by:
The standard deviation $\sigma\left(v_i\right)$ can be expressed as:
The values of all variables are standardized to prevent computing excessively large or small exponents, and thus reduce the variability of the mean [8]. The data of the explanatory variable was also normalized to the range of [0,1] to meet the constraints of the transfer function used by the neural networks. The normalization can be realized by the following relationship [9]:
where, $Y_n, Y, Y_{\min }$ and $Y_{\max }$ are the standardized, measured, minimum, and maximum values, respectively.
Two methods are adopted to realize the modelling process, namely, MLR and MLP-type ANN, aiming to optimize the link between manganese (Mn), zinc (Zn), and boron (B), and physicochemical elements [10].
Regression analysis is a statistical technique that enables the examination of the potential relationships between a dependent variable and one or more independent variables [11].
Linear regression uses a straight line to summarize, interpret, and forecast the fluctuations of a dependent variable (Y) according to another independent variable (X). Regardless of the objective, the success of regression analysis is substantially influenced by the relationships between the explanatory variables [12].
A statistical technique known as MLR describes the fluctuations of an endogenous variable connected to numerous exogenous variables [13]. It addresses the same problem as the simple linear regression (SLR). The difference is that MLR seeks to explain the values of Y not by a single variable X, but by several variables {$X_j$} known as explanatory variables. After slightly modifying the previous notations, it is assumed that Y and {$X_j$} has a linear relationship [13]:
where, Y is the dependent variable; $X_1, X_2, \ldots, X_p$ are the independent variables; p is the number of explanatory variables; $a_0, a_1, a_2, \ldots, a_p$ are the parameters to be estimated; $a_0$ is the estimated intercept; $a_1, a_2, \ldots, a_p$ are the slopes (partial regression coefficients); ε is the error of the model that expresses or summarizes the missing information in the linear explanation of Y values from $X_1, X_2, \ldots, X_p$ [14].
The MLR coefficients can be interpreted the same as the SLR coefficients: The intercept represents the expected value of Y when the values of all the $X_j$ are fixed at 0. Each slope $a_j$ indicates the expected change in Y for a change of one unit in $X_j$, while all other things are equal. The last rule is important, for it requires all other $X_j$ to have fixed values [15].
The ANN is a computational model mimicking biological behaviors of the human nervous system [16]. The human brain is composed of a vast number of neurons, which correspond to the nodes, the fundamental elements of an ANN. The ANN is an excellent tool for data analysis [17].
Each ANN contains many nodes that communicate with each other by sending signals through links called synaptic connections. In general, there are three kinds of nodes in the neural system [18]: Input nodes that receive data, output nodes that send data as the system output; hidden nodes whose input and output signals remain in the system.
(a) Formal node
In statistics, a formal node is a parameterized algebraic function with nonlinear parameters [19] and bounded values:
where, $X_i$ is the model input; $\mathrm{w}_{\mathrm{i}}$ is the model parameter; f is the activation function of the node [20].
Figure 1 compares the structure of an artificial node with that of a human neuron. The artificial node (an elementary processor) receives some inputs from upstream nodes. Each input is associated with a weight W, which represents the strength of the connection, and an output function, whose output is imported to other similar nodes [21].
The formal node is the essential element of a neural network. It is a straightforward mathematical operator whose numerical value can easily be calculated. A typical node has N inputs, each with a synaptic weight (Figure 2), an output function, an output which in turn serves as an input to other similar nodes [22]. The result is a nonlinear function f of a linear combination of the inputs ($X_i$). The most frequently used potential v is the weighted sum of the inputs $X_i$ weighted by the coefficients ($w_i$) also called connection weights [23].
Then, a formal node can be characterized as follows [24]: The input signals are accepted by a set of synaptic connections, which are defined by synaptic weights determining the effect of the signal by node i on node j. In addition, a combination function or adder performs the weighted activation sum that converges to node j. This weighted sum can be expressed as:
where, $W_{i j}$ is the synaptic weight; $X_i$ is the input value. An activation (or transfer) function drives a node by determining its activation as the node output [25]:
On this basis, a transfer function φ computes the value of the node state, called activation $X_j$. It is then transmitted to the forward nodes:
where, θj is the bias of node j based on a kind of local weight, i.e., an inhibitory input employed in many types of activation functions. The input makes the network more flexible by varying the node’s trigger threshold through weight adjustment via training. This is adopted in several types of activation functions [26].
(b) Activation function
The activation function non-linearizes the functioning of the node. It generally has three intervals [27]: The node is inactive, when the value of the function is below the threshold (the output is often 0 or -1); the node is in the transitional state, when the value of the function is around the threshold; the node is active, when the value of the function is above the threshold (the output is often 1).
There are several activation functions, including hyperbolic tangent, Gaussian, sigmoid, affine function, and arc tangent function. The most popular one is the sigmoid function (Figure 3) [28]:
3. Results and Discussion
The Results section may be divided into subsections. It should describe the results concisely and precisely, provide their interpretation, and draw possible conclusions from the results. The results from the STATISTICA neural network method are displayed in Table 1. Using the Broyden-Fletcher-Goldfarb-Shanno algorithm [29], According to the topology, they provide the two layers' determination factors, cycles, and training algorithm (BFGS).
By changing the total number of hidden nodes and the number of training cycles, we changed the network's design (number of iterations). We gradually changed the number of hidden nodes to achieve this (1, ..., 15). The findings demonstrated that the optimum prediction model for heavy metal concentrations is obtained when the number of hidden nodes equals eleven for Mn, eight for B, and fifteen for Zn. At this moment, the lowest mean square error, the most iterations, and the highest determination coefficient have been achieved.
The network's training in the Boron scenario is shown in Figure 4. After 102 steps, the outcome is reached. The learning error and test gradients converge properly with eight nodes in the caching layer.
Architecture | Training algorithm | Error function | Activation function of the hidden layer | Output layer activation function | Number of iterations | |
Zn | 16-15-1 | BFGS | SOS | Tanh | Logistic | 217 |
B | 16-8-1 | BFGS | SOS | Tanh | Exponential | 102 |
Mn | 16-11-1 | BFGS | SOS | Logistic | Exponential | 228 |
NHLN=8 | NHLN=11 | NHLN=15 | |||||||
Metals | R² | IN | RMSE | R² | IN | RMSE | R² | IN | RSME |
B | 0.99 | 102 | 0.0001 | 0.98 | 199 | 2.07 | 0.95 | 81 | 5.96 |
Mn | 0.93 | 75 | 4.25 | 0.99 | 228 | 0.003 | 0.96 | 123 | 1.13 |
Zn | 0.97 | 58 | 0.98 | 0.94 | 12 | 2,98 | 0.99 | 217 | 0,001 |
Boron | Manganese | Zinc | ||
MLR | R² = 0.17 | R² = 0,22 | R² = 0,4 | |
ANN | Architectures of the selected models | |||
[16-8 -1] | [16-11 -1] | [16-15 -1] | ||
Learning | ||||
R² = 0,997 | R² = 0,998 | R² = 0,9998 | ||
SSE = 0,0001 | SSE = 0,003 | SSE=0,001 | ||
Test | ||||
R² = 0,996 | R² = 0,997 | R² = 0,997 | ||
SSE = 0,008 | SSE = 4.291 | SSE = 3.902 |
Overlearning has occurred due to the network's extensive training; this occurrence has occurred 102 times.
Comparing neural network models to those discovered using MLR, increases in the explanation of variance of up to 82% are possible. For manganese, boron, and zinc, it is 77%, 82%, and 59% respectively. The coefficients of determination rise from 0.224 to 0.99, 0.172 to 0.99, and 0.40 to 0.99, respectively. The outcomes shown in Table 2 demonstrate that the models created by the ANN are superior to those discovered by the MLR approach. In fact, the models created by the ANN have estimated coefficients of determination that are substantially higher (above 0.9), while the models created by the MLR have calculated coefficients that are lower (between 0.17 and 0.40). Referring to in subgraphs (a) and (b) of Figure 5 and the subsequent Table 2 and Table 3 makes this clear.
It may be deduced from the determination coefficients found when analyzing the accuracy of the ANN models that they closely match learning-related models. On the other hand, there are some differences between the determination coefficients found during training and those used to assess the models' applicability to MLR. This reveals that there are nonlinear correlations between the Physico-chemical characteristics of the environment and the concentrations of heavy metals in the waters of the Oued Inaouen catchment area. What typically manifests itself in an aquatic environment [30], [31].
The ANN architecture we employed for Boron prediction is shown in Figure 6.
A topology network [16-8-1] is what it is.
The connections between measured and estimated amounts of heavy metals for models created using the ANN and MLR approaches are shown in subgraphs (a) and (b) of Figure 5. The effectiveness of the ANN technique is established by the other two metals and manganese, which both demonstrate a high coefficient of determination of 0.99. (Boron and Zinc). We can show that the models discovered by the ANN are more effective by contrasting them with those produced by the MLR. We are looking at in subgraphs (a) and (b) of Figure 5, it is simple to observe this.
4. Residual Studies
The relative error is the error made by the models created using each method on a specific member of the sample used to build the model [32].
For starters, it enables us to spot any outliers or observations crucial in deciding the regression. This examines the residuals (Yi-Yi') essential [33]. Second, the only way to empirically test the accuracy of the model's assumptions is frequently reviewed these residuals.
The relationships between the estimated heavy metal values from the neural network and MLR models and their residuals are depicted in the following figures. These graphs demonstrate that for the four metals under investigation, the residuals generated by neural network approach models are less dispersed [34] (near zero) than those obtained by MLR models. This demonstrates once more that the current method, which is based on the principle of ANNs, is more effective than the method based on MLR, which is typically used in the development of linear prediction models for predicting the heavy metal contents of the surface water from Physico-chemical parameters.
The relationships between the predicted heavy metal levels using the neural network (NN) and MLR (MLR) models and their residuals are shown in Figure 7.
The analysis of the residuals, which represent the differences between the estimated and observed values, shows that the models developed using the method based on the neural network principle are more efficient than those generated using MLR for the three metals studied. This proof shows a nonlinear relationship between the surface water heavy metal levels and the examined Physico-chemical parameters.
5. Conclusions
This study used MLR and ANNs to estimate the concentrations of heavy metals (manganese, boron, and zinc) in the surface waters of the Inaouen watershed based on their Physico-chemical parameters.
There are benefits and drawbacks to these statistical methods. Although more effective and efficient, neural networks do not clearly explain the relationship between the input and output data. Because there is no linear relationship between the independent and dependent variables, MLR allows for the determination of equations that describe the relationship between the explanatory input variables and the output variables.
The calculations of the mean square errors performed in this study revealed that a three-layer architecture with an input layer, a hidden layer, and an output layer is preferable for the establishment of prediction models using ANNs related to the metals studied in the surface waters of the Inaouen catchment area. They also demonstrated that, when employing a backpropagation algorithm and supervised learning, a hidden layer with 15 nodes for zinc, 11 for manganese, and 8 for boron produces mathematical models that are more accurate at predicting the future than those that use fewer nodes.
The three heavy metals found in the surface waters of the Inaouen watershed have shown significant learning and predictive capabilities thanks to ANNs.
To put things in perspective, the following ideas can help us develop this topic further:
• Work with different network types, like recurrent networks;
• Utilize other activation methods;
• To predict the contents of these metals in the surface waters of the Moroccan watershed, try extrapolating the effective models developed by ANNs related to the prediction of metal contents studied in the surface waters of the Inaouen watershed.
• The outcomes encourage further consideration of the approach that enhances the work done thus far. To test another RNA architecture to see which one yields the best results, or to investigate the possibility of simultaneously predicting the three parameters, for instance, would be very interesting.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.