One of the biggest problems that humans are faced with today is pollution and climate change. Pollution is not a new phenomenon and remains a leading cause of diseases and deaths. Mining, industrialization, exploration and urbanization caused global pollution, whose burdens are shared by developed and undeveloped countries alike. Awareness and stricter laws in the developed countries have contributed to environmental protection. Although all countries have paid attention to pollution, the impact and severity of its long-term consequences are being felt. There is a cause-and-effect link between the pollution of air, water and soil and the environment. This research aimed to prove that the main function of the philosophy of science is to have a functional understanding of knowledge, which views knowledge as a tool for prediction. Prediction is the function or mission of science or the goal that must be achieved if the scientific project is successful. In other words, prediction is the final harvest of description and interpretation. In addition, science is primarily concerned with the prediction of events that have occurred in the universe. A mature prediction is what science provides to validate scientific models. This paper introduced the concepts of using machine learning techniques to enhance the prediction process results. Pollution data set and the negative effects of polluted air data were used. We built, trained and tested various models in order to find the optimal model, which could enhance the results of the prediction process.
Pollution, also called environmental pollution, is the presence of harmful substances in the environment that has a poisonous effect. These harmful substances are called pollutants, which may cause health problems to humans and creatures.
In a scientific way, one might define pollution as the emergence of substances, which may surround living organisms and eventually will cause harm to them [1].
One of the most useful fertilizers for soil is animal waste. However, on the other hand, if the animal waste was disposed in drains, it would lead to diseases and epidemics [2].
It is known that the primary cause of pollution is humankind. Pollution explosion is blamed on a list of reasons, including industrial revolution, technological advancements, misuse of natural resources, and population expansions. Pollution is responsible for a lot of deaths. It was stated that pollution caused 50,000 deaths annually, accounting for about 2% of the total [3], [4].
Tobacco or cigarette smoke kill about three million people annually. If smoking continues to exist, the deaths will increase to 10 million annually [5].
Air pollution is the presence of harmful substances in the air, which can be classified into two parts [6], [7]. The first part is natural sources of pollution, such as dust. The second part includes industrial sources, car emissions, and fuel-driven electric power generators, which emit large amounts of harmful fine particles and gases into the air [8].
Large industrial cities are one of the most affected areas of air pollution. In addition, the developed countries do not have the capabilities to eliminate pollution. Some of the elements that cause air pollution are shown in Figure 1 [9] and discussed in the list below.

1) Fine particles, generated by petroleum-driven vehicles, factories that release unfiltered smoke of chemicals, and desert dust.
2) Carbon dioxide, with factories as its primary source.
3) Nitrogen dioxide, generated from burning fuels.
4) Ozone, a gas composed of three atoms of oxygen, exists both in the upper atmosphere of the earth and on the ground. For ozone at the ground level, it’s harmful to people and the environment, and is the main ingredient in “smog” [10].
5) Carbon monoxide, generated from automobile emissions in highly populated cities, bush fires and volcanoes.
6) Cigarette smoke, which produces fine Particulate Matter (PM), is the most dangerous element of air pollution for health (Medical News Today).
7) Lead, the most toxic metal, is considered a major pollutant. Lead content exceeds 64,000 parts per million in the dust of houses and exceeds 3,000 parts per million in the air of streets.
Polluted air means the surrounding air has large quantities of pollutants, causing harm to humans, animals, plants or materials. Nowadays air pollution is a big killer. In 2015, pollution caused 6.4 million deaths globally, with 2.8 million deaths of them caused by household air pollution [11]. Data for the year 2015 below shows the percentages in global deaths caused by air pollution [11]:
• 19% of all cardiovascular deaths
• 24% of ischemic heart disease deaths
• 21% of stroke deaths
• 23% of lung cancer deaths
In addition, the polluted air caused neurodevelopmental disorder in children [12], [13] and neurodegenerative diseases in adults [14].
With large economic and human losses caused by air pollution, the environment predictive and forecasting technology is needed.
Air pollution must be put under control. Or by 2030 the air will become so polluted that it will be necessary to use an oxygen kit to breathe easily. Increasingly serious air pollution will also lead to premature negative effects. Human exposure to air toxins will increase to a large extent if air pollution is not controlled. Analysis of historical data sets showed that Pollution Parameter (PM) levels were associated with negative effects. This paper used an equation to forecast future target levels, with Regression Analysis Model (RAM) and Artificial Neural Networks (ANNs) with deep learning as efficient prediction tools and techniques [15], [16].
The RAM was used to formulate the association between the PMs and the negative pollution effects based on Eq. (1).
where, t- response. Dependent variable, observation and a1-predictor. Independent variable, explanatory variable and Cd-coefficient and ϵ- random error noise.
The regression coefficients were easily calculated using MATLAB software, then the obtained equations were applied to calculate the target values.
Deep Machine Learning (DML) is an Artificial Intelligence (AI) technique (Figure 2), which can be used for various applications, including prediction applications. By using the DML, the prediction gets better results with more data, bigger models and more computation [17], [18].
ANN models can easily solve the prediction problem [19] by increasing the number of neurons in the input layer [20], or by adding extra hidden layers [21], which leads to more computations. Therefore, the Mean Square Error (MSE) between the targets and the calculated outputs is shown in Figure 3.


This paper selected a pollution input data set, which was composed of a 2D matrix with 8 rows and 508 columns. Each column represented the PM values, such as temperature, relative humidity, carbon monoxide, sulfur dioxide, nitrogen dioxide, hydrocarbons, ozone and so on. The target data was a 2D matrix with 3 rows and 508 columns, and each row represented the values of the negative effects of pollution, such as total mortality, respiratory mortality, cardiovascular mortality and so on. Figure 4 shows samples of the input data set.
A MATLAB code was used to calculate regression coefficients. Then the regression equation for each of the three targets was used to calculate the predicted outputs.
This paper created and tested various Feed Forward ANN (FFANN) and calculated the MSE for each FFANN in order to select the optimal FFANN (with the minimum MSE). Then a MATLAB code was used to create, train and test different ANNs with various architectures (Figure 5).
This paper created and tested various Cascade Feed Forward ANN (CFFANN) and calculated the MSE for each CFFANN in order to select the optimal FFANN (with the minimum MSE). Then a MATLAB code was used to create, train and test different ANNs with various architectures (Figure 6).



A MATLAB code was written and implemented using the input data set to find the regression coefficients, which are shown in Figure 7. Then the coefficients were used to form the equation for each target. Figure 8 shows the error plot between the targets and the predicted outputs. The maximum and minimum errors (Table 1) were high, and thus the MSE (applying Eq. (2)) between the targets and the predicted outputs was also high.
where, $(t-\hat{t})^2$- the square of the difference between actual and predicted.


Target | t1 | t2 | t3 |
Max. error | 51.9156 | 20.3733 | 31.6502 |
Min. error | -32.9672 | -5.1315 | -16.5542 |
This paper created, tested and implemented different complicated FFANN and CFFANN models to decrease the MSE value. Figure 9 shows the obtained errors using the CFFANN with 5 layers (8-16-8-9-3 neurons), with linear activation function for the output layer and Tansig activation function for all other layers. Figure 10 shows the obtained errors using the FFANN with the same architecture.
According to Figure 9 and Figure 10, when the CFFANN was used for prediction purposes, the MSE value rapidly decreased, thus enhancing the prediction results.
From Table 2 we can see the following facts [22], [23]:
- Use of the ANN enhanced the prediction results, compared with regression analysis.
- The CFFANN gave bitter results for any ANN architecture.
- Although the DML process required a lot of time (minutes and in some cases hours), it was important and vital in improving the results of the prediction process, taking into account that this sentence was only carried out once.
- Increase of the number of layers decreased the MSE.
- It was bitter to use 2-layer ANN with a big number of neurons in the input layer. Thus, this paper increased the number of neurons in the input layer, which decreased the MSE for both types of ANN, the FFANN and the CFFANN. This can be seen in Figure 11 [19], [20].


Several experiments were implemented to check the effects of applying the DML. Various compacted ANN models were created and tested. Table 2 shows the obtained experimental results [23], [24]:
ANN architecture | MSE | |||||
Layer 1/neurons | Layer 2/neurons | Layer 3/neurons | Layer 4 /neurons | Output layer /neurons | FFANN | CFFANN |
8 | - | - | - | 3 | 0.000806971 | 0.000744657 |
16 | - | - | - | 3 | 0.000528262 | 0.000434399 |
24 | - | - | - | 3 | 0.000278335 | 0.000240503 |
32 | - | - | - | 3 | 0.000190689 | 0.000185831 |
64 | - | - | - | 3 | 4.45135e-005 | 4.37043e-005 |
128 | - | - | - | 3 | 1.66451e-007 | 9.27851e-008 |
8 | 9 | - | - | 3 | 0.000452378 | 0.000279326 |
8 | 16 | - | - | 3 | 0.000279576 | 0.00012127 |
8 | 16 | 9 | - | 3 | 0.000132483 | 3.82815e-005 |
8 | 16 | 8 | 9 | 3 | 0.000101754 | 9.09967e-006 |

When the regression models were used, the negative effects of pollution parameters were easily predicted. These models required a short time, but the obtained MSE was always high. Thus, this paper used the DML models to enhance the prediction results. Using the ANN with complicated architecture increased the computation and rapidly decreased the MSE between the targets and the predicted outputs. It was shown that the best choice was to use the CFFANN with two layers, because the increase of the number of neurons rapidly decreased the MSE, making the prediction tool more accurate.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
