The advent of air travel, originally proposed by the Wright brothers, has led to a significant surge in aircraft usage for human transportation. In its nascent stages, this mode of transport was linked with a high frequency of accidents and consequent fatalities, placing it in the high-risk category. To counter these risks, the International Civil Aviation Organization (ICAO) was established in 1947 as a collaborative effort among numerous countries with the primary goal of enhancing aviation safety regulations. This study analyzed archival data from the Bureau of Aircraft Accidents Archives (B3A), covering a span of 72 years from 1918, the year of the first commercial airplane crash, until 2020. The objective was to understand the ICAO's impact on altering accident rates, fatalities, and underlying causes. Analytical methodologies encompassed both descriptive statistics—examining data distribution, central tendencies, and category frequencies—and exploratory data analysis (EDA) to identify variable relationships and outlier identification. The results indicated that ICAO's interventions have led to a notable decline in accident rates, with an annual average reduction of 70.9%, and a corresponding decrease in incidents attributed to technical factors. However, an unexpected trend was the increase in fatalities despite the drop in accident numbers, attributable to the introduction of larger aircraft designs carrying more passengers per flight. The findings underscore the ICAO's successful efforts in reducing aircraft accidents, but also suggest a need for further exploration into factors contributing to the rise in fatalities.
The aviation industry, particularly in the realm of commercial operations, has been witnessing a consistent upsurge in air traffic volume and frequency over recent decades. This underscores the transformation of airplanes into a predominant means of transportation. As per data released by the International Civil Aviation Organization (ICAO), approximately 4,397 billion passengers were reported to have chosen airplanes as their primary mode of transportation in 2019 . Moreover, projections suggest a yearly growth rate of about 1.8% in air travel demand over the forthcoming two decades . This sustained interest in air travel can be attributed to its superior safety record compared to other forms of transportation , , . Evidence supporting this comes from the Bureau of Transportation Statistics, which documented the lowest accident percentage for air transportation across the 2004–2019 timeframe . This impressive safety record is largely credited to advancements in aircraft design, technology, and safety standards  along with the oversight of safety organizations like the ICAO.
Despite aviation's exemplary safety record, it remains crucial to analyze past aircraft accidents to prevent recurrence and enhance overall aviation safety. In order to preclude accident recurrence, the establishment of an international organization is advocated. This research is primarily focused on assessing the impact of the ICAO, an international organization, in preventing recurring aviation accidents and augmenting aviation safety.
Further improvements in aviation safety and accident prevention may be achieved through an examination of historical accident data. This allows for a comprehensive understanding of statistical data characteristics and the extraction of concealed knowledge , . Most existing commercial aircraft accident analyses are geographically limited , making them less effective given the transnational and intercontinental nature of commercial flight activities. Previous methods also heavily relied on expert experience, which can sometimes lead to user errors during data analysis . Moreover, the absence of formal accident investigation processes and the disregard for loss of life have resulted in substandard accident analyses . Newer methods, such as machine learning, are constrained by the hidden knowledge and insights in the data . To address these issues, it is proposed to adopt analytical methodologies such as descriptive statistics  and exploratory data analysis (EDA) .
Considering that the aviation accident data to be analyzed is in tabular form, employing EDA seems most fitting as it scrutinizes each data observation per variable (column). The goal of this study is to unravel variable relationships and analyze shifts in fatalities and commercial aircraft accidents pre- and post-ICAO establishment, thereby gauging its impact on aviation safety. This will help ascertain if the ICAO, as an international civil aviation safety organization, is achieving its main objectives.
This study has been structured as follows: The first section provides the introduction and outlines the study objectives; the second section discusses the methodology, detailing the strategies adopted to fulfill the study objectives, in addition to elaborating on data collection, processing, and EDA implementation procedures; the third section presents the results derived from the methodology's application and the subsequent analysis process; the final section furnishes the conclusions drawn from the study.
The Bureau of Aircraft Accidents Archive (B3A) database served as the data source for this investigation. A meticulous analysis was performed on the B3A database to ascertain the ICAO's influence on enhancing transportation safety. A flowchart representing the stages of study is depicted in Figure 1.
The initial phase involved delineating the analysis' purpose. By defining the objective of the data analysis, appropriate variables can be selected to facilitate conclusion drawing and hypothesis or research question testing . The intention of this study was to scrutinize the ICAO's impact on aviation safety, accomplished by examining fluctuations in accident count, fatalities, and material losses over time.
The subsequent stage encompassed data collection. This phase involves gathering and measuring variable-related information to analyze data procured from diverse sources . Next came data preprocessing, a phase intended to rectify inconsistencies and anomalies present in raw data, such as outliers, null values, etc. . Given its significance in problem rectification, data preprocessing is considered an essential precursor to the data analysis process . In this investigation, three data preprocessing techniques were employed: data cleaning, data transformation, and data reduction .
The data cleaning process was implemented via the Python programming language's Pandas library to identify missing values. Given that accident data represents factual information and cannot be estimated, any observation row data exhibiting a null value was eliminated. The data transformation phase involved converting raw data from character (object) to numeric format, significantly simplifying the EDA analysis process. Data reduction was the final preprocessing step, where insignificant raw data variables (from 15 variables) were discarded to streamline the analysis process. This process was conducted using a statistical approach, specifically examining the correlation value between variables.
Post-data preprocessing, the analysis was carried out employing two distinct methods: Descriptive Statistics and Exploratory Data Analysis (EDA). Descriptive Statistics involves simplifying and summarizing data with statistical calculations, presenting it in an attractive, helpful, and easily comprehensible manner . Data description techniques vary depending on the data type. Quantitative data types in variables utilize measures of frequency , central tendency , dispersion , and kurtosis and skewness . Qualitative data types in variables, on the other hand, employ the technique of calculating the frequency of occurrence and the proportion or percentage of each value category .
EDA aims to comprehend data structure and patterns, serving multiple analytical purposes, including identifying outliers, detecting data deviations, determining crucial variables, suggesting hypotheses, and uncovering hidden patterns . Statistical analyses or graphic representations can be deployed to address hypotheses built with EDA . Specific techniques, such as measures of frequency, central tendency, and dispersion, were applied to the dataset for quantitative data, with the calculation of frequency and proportion applied for qualitative data. Graphical representations were also generated. All these stages were conducted using the Python programming language .
The analysis aimed to address the research question by characterizing commercial aircraft accident data through statistical calculations and Exploratory Data Analysis (EDA) to assess changes in commercial aircraft accidents and fatalities pre and post the International Civil Aviation Organization (ICAO) establishment. The initial hypothesis posited a decline in the number of accidents and fatalities involving commercial aircraft subsequent to the ICAO's establishment, due to the standardization of international civil aviation rules and conditions by the ICAO .
Data, observed from 1918 to 2020, were collected in December 2020 from the website https://www.baaa-acro.com/ and preserved in CSV format.
The aircraft accident data underwent several preprocessing steps to prepare it for subsequent analysis. The data was loaded into Jupyter notebook software and stored in a data frame consisting of 592 rows and 15 columns. The variable columns were renamed to facilitate code readability, with each variable character changed to lowercase and the space between words replaced with an underscore. Missing values were checked and found to be absent in the data frame, obviating the need for deletion. Non-commercial flight accident data was removed by eliminating rows with flight types other than scheduled revenue flights, air/taxi, cargo, and postal. This step reduced the number of rows to 110. Rows with unknown causes of accidents were also removed, resulting in 91 rows of data. Finally, variables focusing on answering research questions were selected, namely the year, number of deaths, and possible causes. This process reduced the number of columns to three.
As a result of the preprocessing stage, the data frame was reduced to 91 commercial aircraft crash data lines, which we refer to as the dataset. The next step is to implement descriptive statistical and EDA methods.
The dataset underwent several preprocessing steps to prepare it for analysis. The data was loaded into Jupyter notebook software and stored in a data frame consisting of 592 rows and 15 columns. Variable columns were renamed to enhance code readability, with each variable character converted to lowercase and the space between words replaced with an underscore. Missing values were checked and found to be absent in the data frame, rendering deletion unnecessary. Non-commercial flight accident data was eliminated by removing rows with flight types other than scheduled revenue flights, air/taxi, cargo, and postal, resulting in a reduction of rows to 110. Rows with unknown causes of accidents were also removed, leading to 91 rows of data. Finally, we selected variables that focused on answering research questions, namely the year, number of deaths, and possible causes, reducing the number of columns to three.
Following the preprocessing stage, the dataset comprised 91 commercial aircraft crash data lines, which we refer to as the dataset. Descriptive statistical and EDA methods were then implemented.
In this segment, the exploration of two quantitative variables within the dataset - ‘year’ and ‘fatalities’ - was performed using a descriptive statistical analysis approach, encompassing frequency measurement, measures of central tendency, dispersion, and assessments of skewness and kurtosis .
The ‘year’ variable, indicating the year of the accident, underwent frequency measurement to ascertain the prevalence of particular years in the dataset. The analysis revealed fifty distinct accident years within the dataset, spanning from 1918 to 2020. The year marked by the highest frequency of incidents was 1943, with five recorded accidents. This was followed by years 1944, 1936, 1938, 1925, and 1948, each accounting for four accidents. Further detail regarding the frequency measurement of the ‘year’ variable is depicted in Figure 2.
Moreover, the ‘total fatalities’ variable was assessed for the frequency of fatalities per accident, per year, as illustrated in Figure 3. The data revealed 34 unique categories of fatalities.
Upon completion of frequency measurements for both variables, the central location of data distribution was determined. A few parameters, such as the mean, median, and mode, were considered for this task . The choice of appropriate parameters required understanding of the data distribution type for each variable, with three possibilities: symmetrical, positively skewed, and negatively skewed .
Symmetrical distribution can be identified when mean, median, and mode values are equal. Positively skewed distribution exhibits a median value less than the mean value, an asymmetric distribution, and a tail curving towards the right. Negatively skewed distribution, conversely, is characterized by a median value exceeding the mean value, an asymmetric distribution, and the tail of the curve leaning leftwards.
This analysis was performed using the Python programming language, more specifically, the Pandas library with the “df.describe()” command. The “df” stands for the data frame, which represents the dataset, and the describe() method is employed to summarize central tendency, dispersion, and the shape of the data distribution.
Variables with symmetrical data distribution could utilize either mean, median, or mode parameters. However, variables exhibiting skewed data distribution are best represented by the median parameter, given its resistance to outliers' influence . The findings from these calculations are presented in Table 1.
Figure 4 displays the data distribution for the variables ‘year’ and ‘fatalities’ with their respective mean, median, and mode values. In subgraph (A) of Figure 4, the ‘year’ variable exhibits a positively skewed data distribution, thus the median, which is 1945, serves as the central location . The graph for the ‘fatalities’ variable (subgraph (B) of Figure 4) also demonstrates a positively skewed data distribution. The mean value is positioned to the right of the median value, while the mode value lies to the left of the median. Hence, the median, with a value of 4, signifies the central location of the fatalities distribution.
Upon establishing the central location, the dispersion of data around this central point was assessed . Low dispersion implies a concentration of data points around the central location, denoting consistency, while high dispersion indicates inconsistency and a broad spread of data points from the central location . Several parameters, such as range, interquartile range (IQR), sample variance, and standard deviation, were considered for dispersion measurement. Similarly to previous measurements, the choice of parameter hinged upon the type of data distribution for each variable. For variables exhibiting symmetrical distribution, a combination of mean, variance, and standard deviation sufficed. However, for variables with skewed distribution, the IQR was chosen due to its resistance to outliers . The results of dispersion measurement for ‘year’ and ‘fatalities’ variables are presented in Table 2. As previously determined, the ‘year’ variable showcases a positively skewed distribution, indicating the presence of outliers . Therefore, the IQR emerges as the appropriate measurement for data dispersion as it is less impacted by outliers. Subgraph (A) of Figure 5 visualizes the data dispersion for the ‘year’ variable using IQR. The IQR, representing the range between the first (Q1) and third quartiles (Q3), yielded a value of 22.5, with Q1 and Q3 values of 1937 and 1959.5 respectively. Certain years such as 1999, 2009, and 2014 were identified as outliers as they fall outside the minimum and maximum distribution limits. In contrast, the ‘fatalities’ variable exhibits an IQR of 16, within a range of 329, as demonstrated in subgraph (B) of Figure 5. However, a substantial number of outliers, ranging from 42 to 329, are observed, indicating a considerable spread from the central location.
Interquartile Range (IQR)
The final step in the quantitative analysis involved the evaluation of skewness and kurtosis. Though skewness could be inferred from Figure 4, numerical values for skewness and kurtosis were obtained to verify the findings from central tendency calculations. The skewness value indicates how the data distribution deviates from a normal distribution. If skewness equals zero, the data distribution is symmetrical. However, if skewness is either less than -1 or greater than 1, the distribution is highly skewed. If skewness ranges between -1 and -0.5 or 0.5 and 1, the distribution is moderately skewed. Lastly, if skewness falls between -0.5 and 0.5, the data distribution is approximately symmetrical . A positive skewness value indicates a rightward tilt of the data distribution curve, whereas a negative value suggests a leftward tilt .
The subsequent parameter, kurtosis, indicates the peak value of the data distribution curve and the thickness of its tail , . Based on kurtosis values, data distributions are classified as mesokurtic (normal distribution, kurtosis of 3), platykurtic (negative kurtosis, kurtosis less than 3, light-tailed curve with fewer outliers), or leptokurtic (kurtosis greater than 3, heavy-tailed curve with more outliers) .
The computation of skewness and kurtosis for the dataset yielded a positive skewness (0.9979) for the ‘year’ variable, closely approximating normal distribution. This implies a rightward tilt of the distribution curve . The corresponding kurtosis was 0.42503, suggesting a platykurtic kurtosis type, hence fewer outliers. On the other hand, the ‘fatalities’ variable revealed a skewness of 3.78 and kurtosis of 15.37, indicating a positively skewed and leptokurtic distribution. This suggests a considerable number of outliers in the fatalities data. These findings corroborate the graphical representations in Figure 4 and Figure 5. The respective skewness and kurtosis values are summarized in Table 3.
The third variable of consideration, denoted as ‘probable causes’, falls under the qualitative data type. This variable encompasses various categories of causative factors for commercial aircraft accidents, spanning from 1918 to 2020. A computation of the frequency and proportion of each category, based on their association with the accidents, is carried out.
The dataset categorizes probable causes into six variables: conflict factors, collision with other objects, disappearance without trace, human error, adverse weather, and technical factors. The frequency of occurrence of these probable causes is visualized in Figure 6.
Analysis of Figure 6 reveals that accidents attributed to technical factors exhibited the highest frequency, appearing 40 times. ‘Disappearance without trace’ emerged as the second most frequent cause, recorded 19 times. Conversely, collisions with other objects were the least frequent, occurring only twice. The proportional representation of these causes, illustrated in Figure 7, demonstrates that out of 91 commercial aircraft accidents recorded between 1918-2020, 44% were caused by technical factors and 20.9% were due to aircrafts disappearing without trace. Adverse weather was a factor in 13.2% of the accidents, while geopolitical conflicts after the Second World War accounted for 11%. Human error and collision with other objects comprised 8.8% and 2.2% of accidents, respectively.
Following the descriptive statistical analysis, the inherent characteristics of each variable, such as the frequency of occurrence of each data category, data distribution, central location of data, etc., were identified. Thereafter, the Exploratory Data Analysis (EDA) method was employed to unearth relationships between accidents, fatalities, and probable causes within the timeframe of 1918–2020. Further attention was given to the impact of data in the post-ICAO era (after 1948), illustrated in diagrammatic visualizations.
Initially, EDA was applied to the ‘year’ and ‘fatalities’ variables to ascertain the number of accidents. This exploration is depicted in Figure 8, where a timeline barrier denotes the formation year of the ICAO (1947). This visualization demonstrates a significant reduction in the number of aircraft accidents per annum post the ICAO establishment. Additionally, it portrays a decline in the density of accidents in specific periods. In the pre-ICAO era, 48 accidents led to 226 fatalities, while in the post-ICAO era, despite a decrease in the number of accidents to 43, fatalities dramatically increased to 1883.
The data visualization in Figure 8 underpins a reduction in the annual rate of commercial aircraft accidents post the establishment of the ICAO. However, when the causative factors are examined (Figure 7), technical factors before ICAO formation led to 19 accidents over 19 years (1918-1947), which translates to an average of 1 accident per year. In contrast, in the post-ICAO period spanning 72 years (1948-2020), 21 cases were recorded, reducing the average accident rate to 0.291 per year. This reduction suggests a significant decline of almost 70.9% in the number of accidents attributed to technical factors.
Similar patterns are observed for other factors. Conflict factors were responsible for 9 accidents, averaging 0.47 per year in the pre-ICAO period, but post-ICAO, there was only one incident in 1987, bringing the average down to 0.013 per year. Furthermore, human error led to 5 accidents per year, on average, before the formation of ICAO, which reduced to an average of 0.041 per year, based on the 3 incidents recorded post-ICAO.
Weather-related incidents followed a similar trend. Prior to the ICAO establishment, 5 cases were reported, averaging 0.263 per year. However, in the post-ICAO period, the number of cases rose to 7, but given the longer duration of 72 years, the average rate fell to 0.097 per year.
In contrast, the category ‘disappearance without trace' saw an increase in the number of incidents from 9 in the pre-ICAO period, with an average rate of 0.47 per year, to 10 in the post-ICAO period. Yet, due to the longer duration of 72 years, the average yearly rate still decreased to 0.138. Lastly, the ‘collision with other objects' category recorded one case each in the pre- and post-ICAO periods, with average accident rates of 0.05 and 0.013 per year, respectively. The reduction in these averages is based on the findings from the EDA using Python programming and is visualized in Figure 9.
It must be noted that the global aviation industry has seen significant evolution to meet increasing demands, especially concerning aircraft size . The decline in accident frequency post the establishment of the ICAO can be attributed to advancements in safety systems, technology, and aircraft design , in compliance with ICAO-standardized safety rules and regulations . These measures have led to safer civil aviation operations globally. The late 20th century saw a surge in international travel and a subsequent increase in passenger volume, prompting aircraft manufacturers to scale up aircraft size to cater to these demands. A landmark development was Boeing's introduction of the jumbo jet, Boeing 747, in 1969. While the larger aircraft could carry more passengers, it also meant that any accident would lead to a larger number of casualties. Evidence of this is seen in 1985, where despite only one recorded accident, the number of fatalities was significantly higher compared to other years.
It is important to acknowledge the limitations of this study. The data, sourced from the official B3A website, comprised 592 observations spanning 72 years of accident data. However, due to the specificity of the research objectives, the dataset was reduced to 91 observations, representing a confidence level of 15.37% of the historical accident data from 1908–2020.
Through the analytical framework employed in this study, an evaluation was conducted regarding the impact of the International Civil Aviation Organization's (ICAO) establishment. The analysis incorporated an extensive review of aviation accident data, as collected by the Bureau of Aircraft Accidents Archives (B3A), utilizing descriptive statistical methods and Exploratory Data Analysis (EDA).
Complications arose due to the dual presence of qualitative and quantitative data within the dataset, each exhibiting unique characteristics. Descriptive statistical techniques allowed for the extraction of the data's inherent character. On the other hand, to glean insights and discern patterns, the EDA methodology was implemented.
Descriptive statistical analysis yielded a robust dataset, particularly in terms of the ‘year' and ‘probable causes' variables. However, the 'fatalities' variable posed a challenge due to inconsistencies and a notable presence of outliers. Despite the disparate quality of the data across variables, EDA successfully elucidated the interrelation between the three.
Through the EDA analysis, the pivotal role of ICAO in reducing aviation accidents was highlighted. Notably, a significant decline in accident occurrences was observed following the establishment of the ICAO. More specifically, a substantial decrease of 70.9% per year on average was recorded for accidents caused by technical issues.
Interestingly, while the number of accidents decreased, the data revealed a considerable increase in the number of fatalities. This paradoxical trend can be attributed to the evolution of commercial aircraft design, which has been driven by market demand for larger passenger capacities. Therefore, while the number of accidents has been curtailed, those that do occur have the potential to result in a higher number of fatalities due to the larger aircraft sizes.
In summary, the analysis supports the hypothesis that the establishment of the ICAO had a profound impact on commercial aviation safety. It fostered an enhanced sense of trust in air travel and facilitated the growth of the aviation industry. Future research endeavors might benefit from incorporating analysis related to the rules issued by ICAO (Annexes), aligning with the year of issuance. Such an extension would provide a more holistic understanding of the ICAO's impact over time.
Rossi Passarella: Writing- Original draft preparation;
Harumi Veny: Data Curation, Reviewing and Editing, Formal analysis;
Muhammad Fachrurrozi: Conceptualization, Methodology, Supervision, Reviewing Data curation;
Samsuryadi Samsuryadi: Formal analysis;
Marsella Vindriani: Investigation and Editing.
The data used to support the research findings are available from the corresponding author upon request.
The authors would like to express their appreciation to the transport group researchers from the Department of Computer Systems, as well as the ISYRG team at Universitas Sriwijaya and UiTM Shah Alam.
The authors declare no conflict of interest.