DRF-Net: Frequency-Guided Feature Reconstruction for Small Object Detection in Aerial Imagery
Abstract:
Small object detection in aerial imagery remains challenging due to limited spatial resolution, background clutter, and severe scale variation. Existing deep learning–based detectors often suffer from weakened shallow representations and insufficient cross-scale feature interaction, leading to missed detections and unstable localization in dense scenes. This work presents Dynamic Reconstruction and Fusion Network (DRF-Net), a frequency-guided feature reconstruction framework for small object detection. Built upon a one-stage detection paradigm, the proposed method introduces three key components: a frequency-guided channel–spatial augmentation (FCSA) module to enhance fine-grained representations, a multi-frequency reconstruction block (MFRB) to restore cross-scale structural information, and a Dynamic Reconstruction Fusion Neck (DRF-Neck) to adaptively regulate multi-scale feature aggregation. By jointly modeling high- and low-frequency components and integrating saliency-aware fusion mechanisms, the framework improves the preservation of small-object contours while suppressing redundant background responses. Extensive experiments conducted on the VisDrone2019 benchmark demonstrate that DRF-Net consistently outperforms the baseline detector in terms of detection accuracy, particularly for small and densely distributed objects, while maintaining real-time inference efficiency. Ablation studies further verify the complementary contributions of the proposed modules to feature representation and fusion stability. The results indicate that frequency-guided reconstruction and dynamic fusion provide an effective learning strategy for enhancing small-object detection performance in complex visual scenes.1. Introduction
With the rapid development of unmanned aerial vehicle (UAV) systems, UAV-based visual perception has been widely applied in intelligent transportation, urban management, and disaster emergency response [1]. As a core task in UAV visual perception, object detection plays a critical role in environmental perception, target recognition, and intelligent decision-making [2].
Existing object detection methods can be broadly categorized into two-stage and one-stage paradigms, represented by region-based approaches such as Faster Region-based Convolutional Neural Network (Faster R-CNN) [3] and Mask Region-based Convolutional Neural Network (Mask R-CNN) [4], and single-shot detectors such as Single Shot MultiBox Detector (SSD) [5] and the You Only Look Once (YOLO) series [6]. While these methods achieve satisfactory performance under clean backgrounds and large object scales, their effectiveness degrades significantly in UAV aerial imagery due to small object sizes, complex background clutter, and frequent occlusions, as well as stringent real-time constraints imposed by limited onboard computational resources. Consequently, extensive studies have explored improvement strategies from multiple perspectives, including network architecture design, feature enhancement, and multi-scale feature fusion.
Numerous studies have focused on improving YOLO-series models to balance detection accuracy and real-time performance. Representative methods such as Enhanced Multi-Feature Extraction YOLO (EMFE-YOLO) and Lightweight UAV-YOLO (LightUAV-YOLO) enhance feature extraction and multi-scale fusion through architectural optimization, while Efficient Lightweight Network (EL-Net), Lightweight Edge-Aware Feature YOLO (LEAF-YOLO), and Lightweight UAV Detection YOLO (LUD-YOLO) achieve efficient detection under edge-computing constraints by designing lightweight network structures [7], [8], [9], [10], [11], [12].
Another line of research focuses on module-level optimization by improving feature fusion strategies and introducing attention mechanisms to enhance small-object representation [13]. Representative methods, including Pixel-level Multi-Scale Enhancement (PMSE), Scale-Context Feature Pyramid Network (SCFPN), and attention mechanisms based on long-range dependencies, effectively alleviate the mismatch between shallow and deep features under complex backgrounds [14], [15]. In addition, YOLOv11-based methods such as Reconstruction-Enhanced Attention YOLO (REA-YOLO) and Dual-Attention UAV-YOLO (DAU-YOLO) optimize feature fusion and detection head structures to handle dense object distributions and severe occlusions [16], [17]. Knowledge distillation is also explored to improve detection performance while maintaining compact model sizes [18], [19].
Despite notable progress in UAV-based object detection, challenges such as insufficient feature representation and the trade-off between model complexity and detection performance in complex scenes remain. To address these issues, this paper proposes Dynamic Reconstruction and Fusion Network (DRF-Net), an improved object detection framework based on YOLOv11. The main contributions are summarized as follows:
(1)A frequency-guided channel–spatial attention (FCSA) module is designed to enhance the interaction between shallow fine-grained features and deep semantic representations. By leveraging high- and low-frequency feature decomposition, adaptive frequency-domain weight modeling, and cross-dimensional attention fusion, the proposed module effectively strengthens small-object feature representation while suppressing redundant background interference.
(2)A Multi-Frequency Reconstruction Block (MFRB) is proposed to reconstruct multi-scale features prior to feature fusion. By integrating multi-scale convolution, high-frequency structural compensation, and low-frequency context modeling, the proposed block enhances the structural integrity of cross-scale features and improves the network’s capability to capture fine-grained details of small objects as well as their contextual relationships.
(3)A dynamic reconstruction and fusion neck (DRF-Neck) structure, termed DRF-Neck, is constructed to enhance multi-scale feature aggregation. By incorporating deformable convolution, cross-layer dynamic feedback, and gated fusion mechanisms, the proposed neck adaptively regulates the feature fusion process, resulting in more sufficient and stable feature integration and further improving overall detection accuracy and feature propagation efficiency.
2. Proposed Method
In practical UAV vision systems, DRF-Net serves as a core perception module for real-time object detection and can be integrated into applications such as traffic monitoring, infrastructure/industrial inspection, and emergency response. It processes onboard video frames and outputs object labels, confidence scores, and bounding boxes for downstream tasks (e.g., tracking, risk assessment, warning generation, and decision support). With enhanced small-object representation while maintaining real-time inference efficiency, DRF-Net is suitable for integration into UAV edge or onboard visual perception pipelines.
To address weak small-object feature representation and insufficient cross-scale information interaction in UAV aerial imagery, this paper proposes an improved object detection model termed DRF-Net based on the YOLOv11 framework. The model enhances detection performance through frequency-guided feature enhancement, multi-frequency structural reconstruction, and saliency-driven dynamic feature fusion. The overall architecture of DRF-Net is illustrated in Figure 1, consisting of the input layer, backbone network, neck network, and detection head.

UAV aerial images pose distinctive challenges compared with ground-level images. Owing to high flight altitudes and wide fields of view, targets often occupy only a few pixels and are easily obscured by complex background textures (e.g., roads, buildings, and vegetation). As a result, spatial-domain convolutional features are frequently dominated by background responses, weakening small-object discrimination. From a frequency-domain viewpoint, small objects are primarily reflected in high-frequency components, whereas large-scale backgrounds are mainly distributed in low-frequency bands. This motivates frequency-guided feature enhancement to emphasize fine-grained target cues while suppressing redundant background information. Furthermore, large scale variations and cross-level structural inconsistencies call for multi-frequency reconstruction and dynamic fusion to restore cross-scale coherence and enable robust multi-scale representation.
The input UAV aerial images are resized to 640 × 640 × 3 and fed into the backbone network for hierarchical feature extraction. Let $F_i$ denote the feature map output by the i-th backbone layer. The backbone consists of basic convolutional layers and the proposed Frequency-Guided Feature Enhancement module, which facilitates effective interaction between shallow fine-grained features and deep semantic representations.
Before feature fusion, a MFRB is introduced to reconstruct and compensate multi-scale feature representations by integrating structural and contextual information across different feature levels. The reconstructed features are then fed into the DRF-Neck, which enables flexible and stable multi-scale feature aggregation through saliency-guided deformable convolution (SAR-DCN), adaptive sampling via simplified content-aware reassembly of features (S-CARAFE), gated fusion, and cross-layer dynamic feedback.
Furthermore, an adaptive hybrid Intersection over Union (IoU) loss, termed Autonomous Hybrid IoU loss (AHIoU), is incorporated to improve localization accuracy and training stability. AHIoU combines the geometric constraint of Efficient IoU (EIoU) [20] with the dynamic weighting mechanism of Focal IoU loss(FocalIOU) [21], enhancing the model’s sensitivity to small objects and low-IoU samples [22].
The FCSA module is designed to enhance the expressive capability of multi-scale features by introducing frequency-domain guidance and channel–spatial coupling mechanisms, thereby strengthening small-object features while suppressing background interference. The overall architecture of the FCSA module is illustrated in Figure 2.

First, given the input feature Fin, FCSA constructs two parallel branches to capture complementary information, where the high-frequency branch focuses on local fine-grained details and the low-frequency branch models broader background context:
In the formulation, High-Pass Filter Convolution (HPFConv) denotes the high-pass convolution operation, Dilated Convolution (DilatedConv) denotes the dilated convolution with a specified dilation rate, and $F_{hf}$ and $F_{lf}$ represent the feature representations extracted from the high-frequency and low-frequency branches, respectively.
Subsequently, the two frequency-domain feature representations are concatenated along the channel dimension and fused through a 1×1 fusion convolution (FConv) for feature compression. A frequency-domain weight map $W_f$ is then generated using the Sigmoid activation function to emphasize salient high-frequency regions while suppressing redundant background responses:
Based on the generated frequency-domain weight map $W_f$, the input features are recalibrated to obtain the frequency-domain enhanced representation $F_f$.
Here, Concat denotes the channel-wise concatenation operation, $\sigma(\cdot)$ represents the activation function, Conv $1 \times 1$ represents a $1 \times 1$ convolution, $W_f$ denotes the frequency-domain weight map, and $\odot$ indicates element-wise multiplication. Recalibrating features using frequency-domain weights effectively emphasizes salient highfrequency regions while suppressing background responses with low salience, thereby making the spatial distribution and response intensity of shallow features more suitable for small-object detection tasks.
Building upon the frequency-domain enhancement, FCSA introduces a channel–spatial coupling mechanism. Global pooling operations followed by a lightweight Multi-Layer Perceptron (MLP) are used to model spatial saliency and generate attention weights for adaptive feature recalibration:
where, $S_c$ is defined as:
In the formulation, Global Average Pooling (GAP) and Global Max Pooling (GMP) denote global average pooling and global max pooling operations, respectively; MLP denotes a multilayer perceptron; Deformable Convolution (DEConv) represents the deconvolution mapping operation; and $W_s$ denotes the generated spatial saliency map, Sc denotes the channel-wise context descriptor obtained by aggregating global average pooling and global max pooling features. By integrating complementary statistical information from different pooling operations, this process effectively models spatial saliency and highlights regions with strong discriminative significance.
To achieve an adaptive balance between feature enhancement and information preservation, FCSA further incorporates a cross-scale dynamic gating unit. This unit first performs contextual modeling on features from adjacent scales to capture cross-layer semantic dependencies. The resulting contextual features are then fused with the output of the previous unit and the original input features through residual connections, producing the final output feature representation:
where, $F_s$ is defined as:
In the formulation, $F_{in}$ denotes the input feature map of the FCSA module, $F_s$ denotes the saliency-enhanced feature after attention modulation,$F_n$ represents the contextual feature from adjacent scales,$F_f$ represents the frequency-domain enhanced feature, $W_s$ is the spatial saliency weight map, Sigmoid Linear Unit (SiLU) denotes the activation function, Batch Normalization (BN) denotes the batch normalization operation, and Residual connection (RES) denotes the residual connection structure, which helps mitigate potential information shifts introduced during feature enhancement and contributes to more stable network training.
Overall, FCSA enhances small-object representation by integrating frequency-guided reweighting with channel–spatial saliency modeling, thereby providing more informative features for subsequent reconstruction and multi-scale fusion.
The MFRB is designed to reconstruct feature representations by integrating multi-frequency information through parallel convolutional branches ( Figure 3).

In the first stage of feature reconstruction, MFRB constructs five parallel multi-frequency convolutional branches for the input feature $F_{f c s a}$ to capture complementary information across different frequency domains and receptive fields. The computational formulations of these branches are given as follows:
To achieve unified modeling across different frequency bands, the feature maps generated by the five parallel branches are concatenated along the channel dimension, forming a comprehensive feature representation with enhanced multi-frequency information $F_{ fre }$ :
Subsequently, the aligned feature $F_{ align}$ produced by FCSA is concatenated with the input feature $F_{ fre }$ for crosslayer feature compensation. A self-attention mechanism is then applied to model the saliency of the fused features:
where, $F_{ {refine }}$ denotes the refined feature map after self-attention modulation, which enhances salient structural information and suppresses redundant background responses.
$F_{ {res }}$ is obtained by combining the reconstructed features with the input features through residual connections:
To further enhance feature representation, MFRB introduces a cross-layer dynamic feedback mechanism. Global average pooling followed by an MLP is used to generate scale-adaptive gating coefficients, which are applied to reweight the multi-frequency features. The feature corresponding to the current scale is retained as the final output $F_{mfrb}:$
where, $W_r$ is defined as:
Overall, MFRB improves cross-scale feature consistency by integrating multi-frequency extraction, attention-based refinement, and dynamic feedback, thereby providing more robust representations for subsequent multi-scale fusion.
DRF-Neck serves as the core component of the feature fusion stage, enabling adaptive reconstruction and dynamic fusion of multi-scale features through saliency-aware modeling. The overall architecture of DRF-Neck is illustrated in Figure 4.

First, given the input feature $F_{mfrb}$, DRF-Neck constructs a saliency map generation module to characterize the importance distribution across spatial regions. Channel compression via $1 \times 1$ convolution followed by global average pooling is applied to obtain the saliency vector $S$ :
The saliency vector $S$ serves as the basis for dynamic convolution selection. When $\mathrm{S}>\tau$, saliency-guided deformable convolution (SAR-DCN) and adaptive sampling via simplified content-aware reassembly of features (S-CARAFE) are employed; otherwise, standard convolution and conventional sampling are adopted. $F_x$ is the input feature map, and $F_{ {sar}}$ is the result obtained through calculation:
SAR-DCN adaptively adjusts convolutional sampling locations according to saliency-aware offset information, thereby facilitating more accurate modeling of small-object regions and complex textures. The detailed formulation is given in Eqs. (22)–(25):
S-CARAFE is employed for saliency-aware adaptive sampling, where the saliency gain coefficient $\beta$ modulates the response intensity of salient regions:
where, $Y$ represents the characteristics of the input S-CARAFE module, $\beta$ represents the significance gain coefficient, and $F_s$ represents the output result.
After saliency-driven convolution and adaptive sampling, a saliency feedback modulation unit is introduced to dynamically update the threshold $\tau$ and saliency gain coefficient $\beta$ according to feature statistics, enhancing adaptive feature fusion:
Overall, DRF-Neck achieves adaptive and stable multi-scale fusion via saliency-aware reconstruction and feedback modulation, producing the final fused feature representation $F_{{out}}$.
3. Experiments and Results
The effectiveness of the proposed method is evaluated on one publicly available UAV aerial imagery dataset, namely VisDrone2019 [23], which is a widely adopted benchmark for UAV-based object detection under complex real-world scenarios.
VisDrone2019 is a large-scale benchmark designed for UAV-based object detection. It contains images captured from various UAV platforms under diverse urban scenes, covering complex backgrounds, dense object distributions, and significant scale variations. The dataset includes a large number of small and densely distributed objects, making it particularly suitable for evaluating detection performance in challenging UAV scenarios. In this work, the official training and validation splits are adopted for experimental evaluation.
The hardware and software configurations used in the experiments are summarized in Table 1:
Item | Specification |
|---|---|
Operating system | Windows 11 |
Programming language | Python 3.10 |
Deep learning framework | Pytorch 2.2.2 |
Compute Unified Device Architecture (CUDA) | CUDA 12.1 |
Central Processing Unit (CPU) | Intel Core i7-14700KF |
Graphics Processing Unit (GPU) | NVIDIA RTX 4070 Ti Super(16GB) |
Memory | 32GB |
The same set of hyperparameters is applied consistently throughout the training process. In addition, a cosine annealing learning rate schedule and the Stochastic Gradient Descent (SGD) optimizer are employed in the experiments. The detailed training hyperparameters are listed in Table 2.
Parameter | Value |
|---|---|
Learning Rate | 0.01 |
Image Size | 640 × 640 |
Optimizer | Stochastic Gradient Descent (SGD) |
Batch Size | 32 |
Epochs | 300 |
Weight Decay | 0.0005 |
To comprehensively evaluate the performance of the proposed method, we adopt commonly used metrics in object detection, including the number of parameters (Params), computational complexity measured by Giga Floating-Point Operations (GFLOPs), inference speed in frames per second (FPS), and detection accuracy measured by mean Average Precision (mAP).
Specifically, mAP is computed by averaging the Average Precision (AP) over all categories. We report mAP@0.5 (mAP@50) and mAP@0.5:0.95 (mAP@50–95) following the standard evaluation protocol. Precision and Recall are implicitly reflected in the AP computation.
These metrics jointly reflect the trade-off between detection accuracy and computational efficiency, which is particularly important for UAV-based real-time detection scenarios.
To verify the effectiveness of the three proposed modules, YOLOv11 is adopted as the baseline network for the ablation study. Ablation experiments are conducted on the VisDrone2019 dataset. The corresponding experimental results are summarized in Table 3, where bold values indicate the best performance under the current comparison, and ``$\sqrt{ }$ " denotes that the corresponding module is included in the model.
Table 3 presents the ablation results of different components on the VisDrone2019 dataset. Starting from the YOLOv11 baseline, introducing FCSA leads to a noticeable improvement in detection accuracy, indicating its effectiveness in enhancing feature representation for small objects under complex backgrounds.
Further incorporating MFRB brings additional performance gains, demonstrating that multi-frequency feature reconstruction contributes to more robust multi-scale feature learning. When DRF-Neck is added, the model achieves further improvements, validating the role of dynamic feature reconstruction and feedback-driven fusion in improving cross-scale information aggregation.
With all proposed modules integrated, the complete DRF-Net achieves the best overall performance in terms of mAP@50 and mAP@50–95, while maintaining a reasonable increase in computational cost. These results indicate that FCSA, MFRB, and DRF-Neck are complementary and jointly contribute to the performance improvement on UAV-based small object detection tasks.
Model | FCSA | MFRB | DRF-Neck | mAP@50(%) | mAP@50-95(%) | Param(M) | GFLOPs | FPS/(f/s) |
|---|---|---|---|---|---|---|---|---|
YOLOv11 | 33.0 | 19.0 | 2.58 | 6.32 | 214.0 | |||
YOLOv11-F | √ | 33.8 | 19.3 | 2.76 | 6.52 | 215.3 | ||
YOLOv11-M | √ | 33.2 | 19.1 | 2.76 | 6.51 | 215.1 | ||
YOLOv11-D | √ | 32.9 | 19.0 | 2.76 | 6.50 | 211.4 | ||
YOLOv11-F-M | √ | √ | 34.3 | 19.5 | 2.76 | 6.50 | 206.3 | |
YOLOv11-M-D | √ | √ | 33.6 | 19.4 | 2.76 | 6.50 | 213.9 | |
DRF-Net | √ | √ | √ | 34.5 | 19.8 | 2.76 | 6.50 | 215.1 |
To further evaluate the detection performance of DRF-Net, comparative experiments are conducted on the VisDrone2019 dataset against the baseline YOLOv11.

Figure 5 shows the training performance comparison of different models on the VisDrone2019 dataset in terms of mAP@50 and loss. As illustrated by the mAP@50 curves, DRF-Net consistently achieves higher detection accuracy throughout the training process and converges to a superior performance compared with YOLOv11, indicating more effective feature learning for small objects. In addition, the loss curves demonstrate that DRF-Net exhibits faster and more stable convergence behavior, reflecting improved optimization stability during training. These results further confirm that the proposed architectural enhancements not only improve final detection accuracy but also facilitate more efficient and stable model training.
To provide an intuitive understanding of the detection behavior of DRF-Net, representative UAV aerial scenes from the VisDrone2019 dataset are visualized ( Figure 6). As shown in the detection results, DRF-Net produces more complete and accurate bounding box predictions in complex scenes, particularly for small and densely distributed objects. In contrast, YOLOv11 tends to miss small targets under challenging conditions such as severe occlusion, low illumination, and background clutter.


By incorporating the proposed FCSA, MFRB, and the dynamic fusion mechanism of DRF-Neck, DRF-Net demonstrates improved spatial localization accuracy and more stable confidence responses, resulting in fewer missed detections and reduced false positives. As illustrated in Figure 7, the visualization results indicate that DRF-Net maintains robust detection performance across diverse scenarios, including dense daytime traffic scenes and low-visibility environments.
To further analyze the feature representation capability, Gradient-weighted Class Activation Mapping (Grad-CAM) [24], [25] is employed to visualize the feature response maps of YOLOv11 and DRF-Net. As shown in the heatmaps, YOLOv11 exhibits relatively scattered activation regions with limited focus on target areas and increased background interference, especially in dense or low-contrast scenes.
In contrast, DRF-Net generates more concentrated activation responses around target regions with enhanced feature intensity and suppressed background activation. This observation suggests that the proposed architecture effectively strengthens fine-grained feature representation and multi-scale feature fusion, thereby improving discriminative capability and robustness under complex backgrounds.
4. Conclusion
This study presented DRF-Net, an object detection framework designed for small targets in drone aerial imagery. The work focused on three technical aspects: enhancing shallow fine-grained features, reconstructing cross-scale structural information, and improving multi-scale fusion stability. The proposed FCSA module strengthened the interaction between high- and low-frequency components, the MFRB module compensated structural details across feature levels, and the DRF-Neck enabled adaptive aggregation of salient regions.
Experiments on the VisDrone2019 dataset showed that DRF-Net achieved higher mAP@50 and mAP@50–95 than the baseline YOLOv11 while preserving real-time performance. The model required only a slight increase in parameters and computational cost, yet provided more reliable localization of densely distributed and partially occluded objects. These results suggest that the combination of frequency-guided enhancement and dynamic fusion is beneficial for UAV scenarios where targets occupy limited pixels and backgrounds are complex.
From a practical perspective, the proposed model can be used as a perception component in UAV visual systems for tasks such as traffic observation, facility inspection, and emergency monitoring. However, the current evaluation relied on a single benchmark dataset. Variations in camera sensors, flight altitude, and weather conditions may affect the generalization of the method. Future work will examine cross-dataset transfer, lightweight deployment on embedded platforms, and the integration of temporal information from video streams.
Conceptualization, M.Y.T and J.L.M; methodology, M.Y.T; investigation, M.Y.T; data curation, M.Y.T and A.K; writing—original draft preparation, M.Y.T; writing—review and editing, J.L.M and A.K; visualization, M.Y.T; supervision, J.L.M. All authors have read and agreed to the published version of the manuscript.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.
