Acadlore takes over the publication of IJTDI from 2025 Vol. 9, No. 4. The preceding volumes were published under a CC BY 4.0 license by the previous owner, and displayed here as agreed between Acadlore and the previous owner. ✯ : This issue/volume is not published by Acadlore.
Detection and Classification of Road Damage Based on Water-Filled Pothole Using Convolutional Neural Network Model for Edge Device
Abstract:
Road damage surveys in Indonesia are still conducted manually through visual inspections based on the Surface Distress Index (SDI) method. Consequently, the process often requires extended completion times and yields results that lack objectivity due to heavy reliance on the surveyor's experience. As a result, road repairs frequently do not correspond accurately to the actual damage conditions. Road deterioration intensifies during the rainy season, when water accumulates in potholes, accelerating their erosion and expansion. To facilitate more objective damage assessment, particularly for potholes, a tool employing an image sensor capable of distinguishing between water-filled and dry potholes is necessary. This study utilized an image processing model based on a convolutional neural network employing MobileNet SSD V2. In detecting water-filled potholes, the system achieved a precision of 0.95, a recall of 0.514, and an F1 score of 0.667. Furthermore, performance testing across various vehicle speeds indicated that the optimal speed for the edge device system was an average of 15 km/h, at which the system maintained a precision of 0.95, a recall of 0.514, and an F1 score of 0.667.
1. Introduction
Road infrastructure is a critical component of the well-being of the Indonesian people, serving as an essential element in their daily lives. Effective road networks contribute to regional development in economic growth [1], which also has an impact on social, and cultural dimensions, promoting balanced progress throughout Indonesia. Despite these goals, many roads are in poor condition, with concerns such as potholes and cracks. This not only causes physical harm to road users but also poses a major threat to their lives, especially for two-wheeled users [2]. In Indonesia, potholes are one of the main causes of motorcycle accidents. Riders often lose control when trying to avoid a pothole suddenly, especially at high speeds or in poor lighting conditions [3]. Some of national and province road in Indonesia have been categorized in very poor and fair damaged [4-6]. Potholes, a typical type of road damage, can have serious ramifications especially in safety terms, depending on their depth and width [7]. Potholes frequently become flooded with water, especially during the rainy season, and this can last for hours after the rain has stopped [8]. Water causes road deterioration by eroding the soil and asphalt aggregate in the foundation layer, decrease in strength, expanding existing potholes and perhaps developing new ones if not corrected [9].
Road damage assessment for road repairs in Indonesia is currently conducted manually through visual inspection, using the Surface Distress Index (SDI) and Pavement Condition Index (PCI) methods [10]. However, it takes a long time between the survey, planning, and real road maintenance. Indonesia employs four Hawkeye road-survey vehicles [11]. This vehicle system is useful for a variety of surveys and inspections, but the price is not affordable. As a result, it cannot adequately analyze Indonesia's total road conditions. In addition to utilizing the Hawkeye tool, road damage assessment can be conducted through manual visual inspection using the Surface Distress Index (SDI) method. This method involves measuring the width of cracks, the area of cracks, the number of potholes, and the extent of rutting [12]. Based on previous research, it has been observed that during the rainy season, road damage accelerates significantly due to the presence of subsurface water. This subsurface water is often visually indicated by puddles at damaged locations, particularly in the case of potholes. To ensure an objective assessment of road damage, it is necessary to employ an image sensor tool capable of identifying the types of road damage, especially distinguishing between dry potholes and those filled with water.
A variety of studies have been conducted to explore the identification and categorization of road damage utilizing image processing, laser technology, and accelerometers. Kiran Kumar, for example, did research on water-filled potholes using a laser and a camera. The principle includes the laser suffering light refraction when focused toward the road. A camera then records the resulting light refraction [13]. Rani et al. conducted another investigation with the goal of designing an ADAS (Advanced Driver Assistance System) to improve safety factor on road by detecting anomalies such as potholes and speed bumps. They used the Jetson Nano with the MobileNet SSD V2 and achieved detection accuracy of 60-70% at 20 frames per image [14]. Garcillanosa et al. [15] investigate the reporting and detection of potholes using image processing, the Raspberry Pi, and additional gear installed on an automobile, such as a camera module and GPS system. The detection system scans photos to exclude other objects such as sidewalks and pedestrians, with a focus on recognizing potholes with parameter size and color of object using Canny Edge Detection. With a total processing time of 0.9967 seconds, the average detection accuracy attained is 93.72%. Lee et al. reported their research on detecting road surface damage using a smartphone’s camera for capturing the image and a smartphone’s accelerometer sensor in 2021 [16]. The research incorporates two methods for detecting road damage: image processing and vibration-based approaches. They use the Fully Connected Network technique to build a model with six layers, with the goal of reducing the computing effort on smartphones. The performance is relatively low when compared between real conditions and detection results.
CNN is one type of deep neural network that is commonly used to analyze an image, but the weakness of this method is the need for large computing power [17]. With the concept of portable and real-time surveys, a device is needed that is small enough to be carried anywhere but has the required computing power. Therefore, a combination of Raspberry Pi and Coral USB Accelerator is used. Raspberry is a System on a Module (SoM) which is one of the computing modules that is very often used in the fields of robotics and artificial intelligence. Because of its small size and low power requirements, Raspberry Pi is often used for machine learning purposes in embedded systems [18, 19]. The use of Edge TPU Coral USB Accelerator can significantly accelerate the inference process of Machine Learning models, but not all types of image detection models can be supported and truly utilize the capabilities of the device. One of the Machine Learning models that has been proven to work and is recommended by device developers is MobileNet SSD V2, where this model is a combination of MobileNet V2 and SSD [20]. MobileNet V2 uses a depthwise convolution layer that is added to the previous Expansion layer which is used for feature extraction [21]. SSD is a real-time object detection framework that utilizes a single feed-forward convolutional neural network (CNN) to predict bounding boxes and their corresponding object class probabilities. Unlike YOLO which makes predictions using one feature map, SSD uses several feature maps with different sizes [22]. Therefore, in this study, a system was created regarding the process of detecting the location of potholes using CNN based on Edge TPU using Jetson Nano with the MobileNet SSD V2 model which is expected to be able to detect the location of potholes and display their position on the website specifically water-filled potholes. In addition, a portable and easy-to-use device was also obtained for mobile use. The design of a pothole detection device has practical utility and is suitable for community implementation. The results of this study are expected to accelerate the dissemination of information to relevant authorities, thereby facilitating timely repair actions.
2. Convolutional Neural Network for Water-Filled Potholes Detection
The dataset of road damage photographs used consists of images of roadways with water-filled potholes and dry potholes. The dataset is derived from Google sources, FixMystreet website, the Roboflow website and captured manually from Indonesian roads. The datasets taken and captured were selected from Indonesia, which were manually taken in East Java with all images taken during the day between 10 am and 1 pm. For datasets of dry potholes, it was carried out in the dry season, while for datasets of water-filled potholes it was carried out in the rainy season after the rain had stopped. There is a total of 4,756 photos used as a dataset. The dataset includes photos from various angles and lighting conditions, as well as photographs with varying hole sizes. Prior to processing, the photos are made to have the same 640×640 pixels size and format file type using the JPG format. There is a total of 4,756 photos used as a dataset, with the split of the dataset for training 70% (3,329 photos), for validation 15% (713 photos) and for testing 15% (714 photos). Figure 1 shows an example of the water-filled pothole dataset. The distribution of the dataset label is shown in Table 1.

Label | Dataset Object | Amount | Percentage |
L00 | Dry pothole | 2378 | 50% |
L01 | Water-filled pothole | 2378 | 50% |
Total | 4756 | 100% |
CNN is a deep neural network that is often used for visual analysis. Neural networks are made up of multiple layers, each of which contains neurons with varying weights and biases that may be trained [23]. The TensorFlow Object identification aplication interface, is used in the modeling technique to facilitate prepocesing, training, and implementation of object identification models [24]. For this study, pre-trained model used for reduction in training time and stability performance because the model has been trained to detect many usual objects in real life [25]. In this paper, CNN and TensorFlow-based models named MobileNet SSD V2 used for vehicle detection on edge device Jetson Nano.
MobileNet SSD V2 is object detection model that contains 267 layers and 15 million parameters [26], providing inference in real-time using only edge device computing processes such as smartphones and Jetson Nano mini-computer. MobileNet SSD V2 actually consists of two models. The first one is the basic MobileNet V2 network with the SSD layer added. MobileNet is basically used as a backbone for the image classification process where the layer converts image pixels into features that describe the image [27]. Then the SSD layer is in charge of detecting the desired object by creating an object bounding box in an image [28]. As an illustration, the MobileNet SSD V2 structure can be seen in Figure 2 [29].

The implementation of edge device processing, as opposed to centralised server processing, is designed to enable real-time system responsiveness when integrated with the government monitoring framework. This approach ensures that road defects that occur on a given day can be detected immediately on that day. The centralised processing at government offices using servers can get a higher probability of processing queues, causing delays in analysing raw data from roads [30].
For this application, the Nvidia Jetson Nano with 4GB RAM was chosen as the leading device due to its compact form factor, efficient power consumption compatible with vehicle electrical systems using a cigarette lighter adapter, and its ability to effectively perform AI calculations [31]. In the system setup, the Jetson Nano is installed inside the vehicle, while the camera is mounted on the hood of the car, facing the road as shown in Figure 3.

To achieve optimal performance in detecting water-filled potholes on the road, it is essential to perform fine-tuning of the model configuration during training. The optimization process involves adjusting several hyperparameters to identify the best combination that maximizes detection accuracy while minimizing loss. In this study, the batch size and optimizer were varied across different configurations to optimize the training model. Batch size refers to the number of training samples—in this case, images—processed together in a single iteration before updating the model's weight parameters [32]. Batch sizes of 2, 4, 8, and 10 were tested. Training processes with batch sizes greater than 10 were terminated due to computational limitations. The optimizer is used during neural network training to adjust the model's weights and biases in order to minimize the loss function. This study employed two optimizers: Momentum optimizer and Adam optimizer [33]. A comprehensive overview of the testing scenarios for the optimized training model is presented in Table 2.
Variables | Testing Scenario | |||
Model | MobileNet SSD V2 | |||
Num_classes | 2 | |||
Num_steps | 25,000 | |||
Learning_rate (lr) | 0.001 | |||
Optimizer | Momentum Optimizer | Adam Optimizer | ||
Batch_size | 2 | 4 | 8 | 10 |
3. Experimental Result and Discussion
Before the model was implemented in edge device Jetson Nano. The model had to be trained and optimized using dataset that had been prepared. Model trained with the scenario like Table 2. To see the optimal and correct training results can be seen from several parameters. The first parameter is the total loss function. Total loss is divided into two training and validation. The smaller the training total loss the better which means the model learns to recognise objects better, but it must be balanced with a small validation total loss and not far from the training total loss [34]. Because the total loss validation indicates that in the training process when the model has recognised the object well and added a new object image dataset, the model can recognise it well, meaning that the total loss validation value still can follow the decrease in total loss training. If the opposite happens, it means that the model has memorised datasets and cannot recognise new datasets, it also called overfitting [35].
For the Momentum optimizer case in Figure 4, the observed divergence between decreasing training loss and increasing validation loss after 15,000 training steps clearly indicates overfitting. In comparison with Figure 5 using Adam Optimizer shows that the total loss validation can still follow the total loss training until the number of steps reaches 25,000.


The second parameter to determine the good result of training is performance metrics. Based on the confusion matrix, the performance metrics test result used three variables in Eqs. (1)-(3).
Precision denotes the degree of accuracy between actual data and predicted data results produced by model. In other words, the proportion of detected data that is actually correct is measured. Recall describes the success of the model in retrieving information. In other words, the proportion of actual correct data that is detected is measured. The F1 score describes a metric combination of precision and recall. The formula uses the harmonic mean of recall and precision [36]. It is particularly useful in scenarios involving imbalanced datasets, where the distribution of classes is uneven. The formula itself provides a balanced metric between false positive and false negative. The metric accuracy is not utilized for model evaluation because it needs calculation of true negatives, which are not defined in object detection. For detecting objects in real life, the system focuses on identifying and localizing coordinate of objects within images, and there is no practical way to count the number of possible locations where no object is detected.
As a result, metrics such as precision, recall, and F1 score are preferred, as they do not depend on true negatives and provide a more meaningful assessment of model performance in detecting and localizing objects. For MobileNet SSD V2, after training automatically shows the performance metrics with the mean Average Precision (mAP), Average Recall (AR) and F1 score using Intersection over Union (IoU) at 0.5. IoU is the value used to compare the ground truth to the expected overlap bounding box. Table 3 shows the result model optimized with the selected scenario.
Optimizer Config | Batch Size | Total Loss | Performance Metrics | |||
Training | Validation | mAP | AR | F1 Score | ||
Momentum Optimizer lr=0.001 | 2 | 0.542 | 0.697 | 0.642 | 0.477 | 0.547 |
4 | 0.362 | 0.767 | 0.673 | 0.467 | 0.551 | |
8 | 0.359 | 0.772 | 0.661 | 0.472 | 0.55 | |
10 | 0.341 | 0.786 | 0.654 | 0.471 | 0.548 | |
Adam Optimizer lr=0.001 | 2 | 0.4 | 0.647 | 0.617 | 0.488 | 0.348 |
4 | 0.679 | 0.628 | 0.68 | 0.508 | 0.582 | |
8 | 0.415 | 0.7 | 0.667 | 0.484 | 0.561 | |
10 | 0.345 | 0.732 | 0.675 | 0.471 | 0.554 |
Table 3 shows the result of selected scenario which is the optimizer and batch size. Out of 8 scenario in training model process, the highest F1 score occured in configuration using Adam Optimizer with learning rate 0.001, and the batch size using 4. The mAP obtained in 0.68, Average Recall (AR) in 0.508, and F1 score obtained in 0.582. When compared with previous research of road damage detection from Hernanda et al. [37], with the same model obtained the result of precision (mAP) in 0.0869, recall (AR) in 0.241 and F1 score not mentioned in the result. Despite the different sizes of datasets used and the label of road damage which is in that research including crack, the higher value of precision and recall obtained in this research.
For road damage surveys in Indonesia, one of the variables required in the survey is the number of potholes per 100 meters of road. For the application of real-world road damage surveys based on neural network automation technology, the precision value, recall value and especially the F1 score describe the detailed level accuracy of this technology in detecting the correct number of potholes. A precision of 0.68 implies that more than two-thirds of the detections are correct, a recall of 0.508 implies that the model identifies more than half of the actual potholes, and the F1 score can be used as a summary of the model's accuracy that a value of 0.582, means that the model when applied in the real world with the exactly number potholes and conditions as the training model dataset, the system model's accuracy is 58.2% in detecting potholes.
The system's detection performance where potholes are filled with water, is compared to its performance under dry conditions, where potholes on the same location in condition not filled with water. At the road site used for the research in Kertajaya Indah Street, Surabaya city, Indonesia, there were 37 potholes of 300 metres road. Despite the limited sample size that may limit the generalizability of finding, the research was conducted under realistic field condition using edge device system in a car, with still ensure diversity of pothole condition including variation of dimension, depth and the presence of the water in the pothole. At the road site used for the study, there were 37 potholes. Considering the weather conditions and site characteristics, the tests were conducted by deliberately filling the potholes with water. After that, data was collected, and performance was evaluated using the confusion matrix parameters: True Positive (TP), False Positive (FP), and False Negative (FN). True Positive in this test represents the circumstance where there is a pothole and the projected bounding box is correct. A false positive occurs when the projected bounding box is erroneous or does not correspond to the actual object label. A false negative occurs when there is a pothole but no projected bounding box. Definition of this matrix can be represented in Figure 6.

Then calculate the precision, recall and F1 score value from the matrix. A higher precision suggests fewer incorrectly classified objects as potholes. Recall describes the system's capacity to detect a large number of actual potholes. Higher recall implies the system will overlook fewer potholes. The F1 score is the fundamental metric for aessessing the system's accuracy, integrating precision and recall into a single score helpful for evaluating the precision-to-recall ratio. The testing procedure is computed for each Stationing (STA), a value used by road authorities to calculate the length of each 1 STA, which corresponds to 100 meters. This test covers a distance of 3 STA, or 300 meters with the speed of vehicle in 15 km/h. The first test, detecting dry potholes that the result shown in Table 4, then detecting water-filled potholes as comparison, the result shown in Table 5.
STA | Num of Potholes | TP | FP | FN | Precision | Recall | F1 Score |
1 | 4 | 2 | 1 | 2 | 0.667 | 0.50 | 0.571 |
2 | 21 | 13 | 3 | 8 | 0.813 | 0.619 | 0.703 |
3 | 12 | 5 | 2 | 7 | 0.714 | 0.417 | 0.526 |
Total | 37 | 20 | 6 | 17 | 0.769 | 0.541 | 0.635 |
STA | Num of Potholes | TP | FP | FN | Precision | Recall | F1 Score |
1 | 4 | 3 | 0 | 1 | 1 | 0.75 | 0.857 |
2 | 21 | 11 | 0 | 10 | 1 | 0.524 | 0.688 |
3 | 12 | 5 | 1 | 7 | 0.833 | 0.417 | 0.556 |
Total | 37 | 19 | 1 | 18 | 0.950 | 0.514 | 0.667 |
From Table 4 and Table 5, in overall STA of 300 metres road, the system has a tendency to get a precision value greater than the recall value, this illustrates that the system can minimise the identification of the bounding box of holes that are not actually holes. F1 score STA 1 in Table 4 get a lower value compared to water-filled pothole in Table 5. This may be due to the position of the pothole captured by the camera which cannot capture the pothole well as shown in Figure 7.

The overall STA, system gets the best performance when detecting water-filled pothole with value Precision 0.95, Recall 0.514 and F1 score 0.667. Based on the F1 score, it can be interpreted that system can detect and classify 0.667 or 66.7% from the total potholes in that road.
The system for classification of water-filled potholes was tested, with the changing variable of vehicle speed. The speed used was constant on 15 km/h, 25 km/h, and 30 km/h. The testing was done on the same road as previous test.
Speed (km/h) | Num of Potholes | TP | FP | FN | Precision | Recall | F1 Score |
15 | 37 | 19 | 1 | 18 | 0.95 | 0.514 | 0.667 |
25 | 17 | 2 | 20 | 0.895 | 0.459 | 0.607 | |
30 | 16 | 2 | 21 | 0.889 | 0.432 | 0.582 |
From Table 6, within a 300-meter road or 3 STA, the model's results vary with different vehicle speeds, impacting system performance. As the speed increases, the true positive value decreases, indicating a reduction in accurately detected potholes, while the false negative value rises, signifying numerous undetected potholes. At speeds of 25 km/h and 30 km/h, the false positive value increased compared to 15 km/h. After calculating precision and recall, it further strengthens the observation that the system is missing potholes, as the recall value is significantly lower than precision. Regarding the F1 score, the lowest speed, 15 km/h, obtains the highest F1 score of 0.667. As the speed increases, the F1 score gradually decreases, reaching 0.582 at 30 km/h.
The system performance decreased with increasing vehicle speed. First, increased vibrations in the vehicle occur when vehicle at a higher speed. This can blur the pothole objects, making them challenging for the system to detect. Second, the Jetson Nano implementing the SSD MobileNet V2 model processes frames during inference at an average of 3.5 frames per second (FPS), meaning the system processes 3 FPS. With increased speed, many frames will be skipped as the number of frames that can be processed decreases. Third, the impact of increasing vehicle speed, higher probability of over exposure or under exposure frame captured by the camera especially when lighting condition can be varied drastically like there are shadow from tree or building. The auto exposure camera doesn’t adapt fast enough in higher speed.
Based on the system performance test with varying speeds, the optimal implementation of the system is observed at a real-world speed of 15 km/h. To achieve good results at higher speeds, overall system performance improvement is required to increases the FPS produced by the device. Several solutions to mitigate issues at high speeds include using a stabilizer such as a gimbal on the camera to reduce motion blur, using a high dynamic range (HDR) camera to get details when the lighting is too bright or too dark, and utilizing the camera that can be adjusted adaptively for the exposure.
4. Conclusions
After conducting tests, analyzing data results, and implementing the system in this research, several conclusions have been drawn. The optimized model configuration is achieved with 25,000 steps, batch size 4, and learning rate 0.001, resulting in the smallest validation loss 0.628 and the largest F1 score of 0.582. For the performance testing, the system effectively detects water-filled potholes compared to dry pothole. For dry potholes, the system achieves precision 0.769, recall 0.541, and F1 score 0.635. When detecting water-filled potholes, the system achieves values with precision 0.95, recall 0.514, and F1 score 0.667. Furthermore, performance testing concerning varying vehicle speeds indicates that the optimal speed for the edge device performance’s system is an average speed of 15 km/h. At this speed, the system demonstrates precision 0.95, recall 0.514, and F1 score 0.667. However, higher speeds necessitate a larger FPS, with this system obtaining an average of 3.5 FPS. Recommendations include obtaining test locations with a larger number of potholes and longer roads, improved edge device hardware using more powerful GPU like Nvidia Jetson Xavier, using higher resolution of camera and adding more than one camera to get wider and better capture of potholes. In addition to further research, existing systems can be combined with laser-based systems such as Kumar's research [13] to improve the accuracy and validation of pothole types whether they are dry potholes or water-filled potholes. Another way that can be added to further research is to integrate the Lidar sensor into the existing system, when detected by the camera, the lidar sensor will collect precise profiling of pothole.
AR | Average Recall |
FN | False Negative |
FP | False Positive |
FPS | Frame per Second |
lr | Learning Rate |
mAP | Mean Average Precision |
STA | Stationing |
SSD | Single Shot Multibox Detector |
TP | True Positve |
