Small-Drone Camera Detection with YOLOv8

Computer VisionYOLOv8Object Detection

General

Small-Drone Camera Detection with YOLOv8

Project Overview

This project explores the effectiveness of YOLO-based systems for drone detection across multiple datasets using various data augmentation strategies. The main focus is detecting drones under different conditions—from distant, cloudy shots to closer, higher-resolution frames. By training and evaluating four separate models with different augmentation techniques (horizontal/vertical flipping) and combining two distinct datasets (Anti-UAV and VisioDECT), we analyzed how these factors impact detection performance. The study reveals insights into the benefits and limitations of each approach and data domain, achieving up to 99% precision and recall with the best model.

Version:v1.0

Time:~16.0 hours

Cost:$0

Status:complete

Materials

Anti-UAV Dataset (318 videos) × 1
VisioDECT Dataset (20,924 images) × 1
GPU for training (CUDA-capable) × 1

Tools

Python 3.8+
PyTorch
YOLOv8 (Ultralytics)
OpenCV

Build Steps

1. Project Overview and Objectives

⏱️ ~2.0h

Introduction

This project investigates the effectiveness of YOLO-based systems for drone detection across multiple datasets using various data augmentation strategies. The challenge of detecting small drones in diverse environmental conditions—from distant, cloudy shots to closer, higher-resolution frames—requires robust computer vision models that can generalize across different scenarios.

Research Motivation

Drone detection has become increasingly important for security, privacy, and airspace management. However, detecting small unmanned aerial vehicles (UAVs) presents unique challenges: they appear at various distances, under different lighting conditions, against complex backgrounds, and from multiple viewing angles. Traditional detection methods often struggle with these variations, making deep learning approaches like YOLO particularly valuable.

Project Goals

The main objectives of this research are to:

Train and evaluate four distinct YOLOv8 models with different augmentation strategies
Analyze the impact of horizontal and vertical flipping augmentations on detection performance
Compare performance across two distinct datasets with different characteristics
Identify the optimal approach for drone detection in various scenarios
Understand the trade-offs between specialization and generalization

By comparing key metrics (precision, recall, mAP) and evaluating how well each model distinguishes drones from background, we gain insights into the benefits and limitations of each approach and data domain.

2. Dataset Preparation and Analysis

⏱️ ~4.0h

Anti-UAV Dataset

The primary dataset for this project is the Anti-UAV dataset, which contains 318 videos split into:

200 videos (60%) for training
51 videos (20%) for testing
67 videos (20%) for validation

Each video runs approximately one minute at 16 frames per second, totaling 1000 frames with 1920×1080 pixel resolution. To prepare these frames for YOLO training, videos were split into individual images. From each video, exactly 500 frames were selected (50% of total frames) to build a balanced dataset. These selected frames were then organized into the required YOLO folder structure with images and corresponding annotation files.

Data Augmentation Strategy

To further increase model robustness, data augmentation was applied by flipping each image along with its bounding box annotations along horizontal and vertical axes. This resulted in three versions of the dataset:

Original image set
Horizontally flipped images
Vertically flipped images

Combining these three versions increases both variation and dataset size, helping the model learn more robust features. The augmentation strategy was designed to simulate different viewing angles and orientations that might occur in real-world scenarios.

VisioDECT Dataset

The second dataset is VisioDECT, containing 20,924 images of 6 different drone types collected over 1 year and 8 months at 12 different locations. Each drone was filmed in three scenarios (cloudy, sunny, and nighttime) at altitudes from 30m to 100m. Video sequences were converted to frames (.JPG) with dimensions 852×480, and frames without visible drones were discarded.

Trained experts annotated each image in three formats (.txt, .csv, .xml) by drawing bounding boxes around drones. The dataset is organized into 6 main folders (one per drone type), each containing an images folder (with subfolders for each scenario) and a labels folder with corresponding annotation files.

Key Differences Between Datasets

The Anti-UAV dataset primarily contains distant, cloudy shots with limited resolution, representing challenging detection scenarios. In contrast, VisioDECT provides higher-resolution images with diverse weather conditions (sunny, rainy, evening), including closer perspectives of drones. This fundamental difference in data domains became crucial in understanding model performance and specialization.

Data Preparation Summary

The final training data consisted of:

Main dataset from Anti-UAV including original and augmented (flipped) images
Complete VisioDECT dataset added in later training phases for increased diversity

By combining two sources of drone imagery and applying augmentation techniques, the dataset aims to cover a wide spectrum of drone appearances and different environments, increasing the performance and general applicability of the trained YOLO model.

3. Training Methodology and Hyperparameters

⏱️ ~3.0h

Training Approach

Training was conducted in multiple phases, with each model progressively enhanced with additional data and augmentations. Instead of starting from scratch for each new modification, a continual training approach was used where models retain previously acquired knowledge while adapting to new datasets.

Multi-Phase Training Strategy

Phase 1 - Initial Training: The model was trained from YOLO's base configuration using the original Anti-UAV dataset as the foundation. This established baseline performance for distant, cloudy drone detection scenarios.

Phase 2 - Augmentation Integration: Continued training with the same data but including horizontally and vertically flipped versions. This phase tested whether geometric augmentations could improve the model's ability to handle different viewing angles.

Phase 3 - Dataset Expansion: Final training with the additional VisioDECT dataset, introducing more scenarios including images with different resolution and weather conditions. This phase aimed to improve generalization across diverse environments.

This approach enabled gradual improvement in drone detection without losing previously learned features, a key advantage over training separate models from scratch.

Hyperparameter Configuration

Standard hyperparameters were used, adjusted to achieve balance between accuracy and efficiency:

Epochs: 20 per training phase
Batch Size: 16 images per batch
Image Size: 640×640 pixels (YOLO standard)
Optimizer: Adaptive (SGD or Adam as determined by YOLOv8)
Learning Rate: Defined by YOLOv8 auto-tuning

These parameters were chosen based on YOLOv8 best practices and computational constraints. The batch size of 16 provided a good balance between training stability and GPU memory usage, while 20 epochs per phase allowed sufficient learning without overfitting.

Model Evaluation Metrics

After completing training, each model was evaluated based on standard object detection metrics:

Precision: Reflects what proportion of detected objects are actually drones. High precision means few false positives (background incorrectly identified as drones).

Recall: Shows what proportion of actual drones were successfully detected. High recall means few false negatives (drones missed by the detector).

mAP@50: Mean Average Precision at IoU threshold of 50%. This measures detection accuracy when bounding boxes overlap at least 50% with ground truth.

mAP@50-95: Extended mean average precision considering multiple IoU thresholds from 50% to 95%. This metric provides more precise measurement of detection quality, rewarding tighter bounding boxes.

These metrics provide comprehensive evaluation of model performance, capturing both the ability to find drones (recall) and the accuracy of those detections (precision).

4. Model Results and Performance Analysis

⏱️ ~5.0h

Model 1: Baseline YOLOv8

Model 1 represents the initial implementation without additional augmentations, trained exclusively on the original Anti-UAV dataset. The training curves show stable training with gradual loss reduction and high accuracy from early epochs. The model achieves mAP@50 of approximately 0.98 and mAP@50-95 of 0.68-0.70, indicating good adaptation to test data.

The confusion matrix shows high accuracy for drone detection with around 18,400 correct predictions. However, there is a small percentage of false positives where the model marks background as drones. This baseline establishes solid performance for cloudy, long-distance scenarios.

Key Metrics: Precision 0.97, Recall 0.975, mAP@50 0.98, mAP@50-95 0.68

Model 2: Vertical Flip Augmentation

In an attempt to improve performance, Model 2 was trained with additional vertical augmentation, which should help with images showing various drone perspectives. Results show slight improvement in precision (0.976) and recall (0.978). mAP@50 reaches 0.99 in some cases, and mAP@50-95 moves toward 0.69.

This means vertical flipping has a positive but limited effect on detection accuracy. The main conclusion is that this augmentation helps for certain viewing angles but doesn't cause dramatic improvements. The benefit is measurable but modest, suggesting that vertical perspective variations were already somewhat represented in the original dataset.

Key Metrics: Precision 0.976, Recall 0.978, mAP@50 0.98-0.99, mAP@50-95 0.69

Model 3: Horizontal Flip Challenges

Model 3 was trained with horizontally flipped images, but results show this is not the best approach. Although losses continue to decrease, detection confidence significantly drops. Some detections have much lower confidence (50-60%) compared to previous models which maintain confidence of 76-78%.

This can be attributed to the fact that the data doesn't contain enough scenarios where drones are viewed from opposite horizontal orientations. Such training might actually confuse the model rather than help it generalize better. This finding highlights that not all augmentations are beneficial—transformations that don't reflect real-world variations can harm performance.

Key Metrics: Precision 0.95, Recall 0.94, mAP@50 0.93-0.95, mAP@50-95 0.62

Model 4: VisioDECT Integration

The most interesting result comes from Model 4, which was trained with the VisioDECT dataset for the first time. By including this dataset, the model became much better at detecting drones in various scenarios, especially in close and high-resolution images. Precision and recall remain around 0.99, while mAP@50-95 reaches 0.70, making it the best model in this research.

This shows that including different scenarios in training significantly improves the model's ability to generalize. The VisioDECT dataset's diversity in lighting conditions, distances, and resolutions provided the model with richer training examples, enabling better performance across a wider range of conditions.

Key Metrics: Precision 0.99, Recall 0.99, mAP@50 0.99-1.00, mAP@50-95 0.70+

Comparative Analysis

Model 4 is the best choice for real-world application due to its superior generalization, but Models 1 and 2 are also good for specific conditions. Model 3 showed limited effectiveness and is likely the least useful in this situation. The results demonstrate that dataset diversity matters more than simple geometric augmentations for improving drone detection performance.

5. Discussion and Future Recommendations

⏱️ ~4.0h

Impact of Augmentations

The applied augmentation techniques had varying impacts on the models. Vertical flipping (Model 2) brought slight improvement in certain test scenarios, mainly by making the model less sensitive to the angle from which the drone is viewed. On the other hand, horizontal flipping (Model 3) led to decreased prediction confidence, with some detections having much lower confidence values.

This indicates that not every augmentation contributes to improvement—sometimes introducing transformations that aren't common in the real domain can cause the model to become unstable. This finding emphasizes that augmentations must be carefully chosen, with assessment of whether they truly add useful variation relative to how objects appear in the real world.

Different Domains and Their Influence

The differences between the first (Anti-UAV) and second (VisioDECT) datasets played a crucial role in model performance. The first dataset mainly contains distant and blurry shots in cloudy conditions, while the second has diversity in lighting and closer drone perspectives. As a result, the model trained with the second dataset (Model 4) significantly better detects in these new conditions but doesn't show the same level of efficiency in scenes where the first dataset dominates.

This highlights the importance of alignment between the dataset and the real context in which the model will be used. If the model is trained with data too different from what it will see in production, it may face difficulties in generalization.

Specialization vs Generalization

The analysis shows that a single model can hardly achieve optimal performance in both domains. While models trained on the first dataset have higher stability for distant and blurry scenes, Model 4 dominates in close and clear images. This raises the question of training strategy: is it better to create one model that will try to generalize or two models specialized for different scenarios?

If there's a clear way to identify which domain an image belongs to (for example through camera metadata), an approach where different models are used in different situations could be tested.

Recommendations for Future Improvements

Better Dataset Combination: If the goal is to achieve good balance between distant and close shots, careful balanced training with data from both domains is needed. If possible, methods like domain adaptation or transfer learning could be explored to improve knowledge transferability between datasets.

Augmentation Realism: Vertical flipping may be useful in specific cases, but its impact should be examined to determine if it's sufficiently significant. Horizontal flipping might not be the most appropriate technique for this problem, as the nature of the data suggests drones are rarely viewed under such conditions.

Separate Models Consideration: Instead of one model trying to generalize, two models could be used, each specialized for a different domain. This could be useful if the application allows automatic recognition of image context.

Creative Augmentations: Instead of just flipping, other techniques like random cropping, scaling, rotation, or photometric augmentations like color jittering could be explored. This may help achieve better generalization across different lighting and viewing angles.

More Diverse Data Collection: If possible, collecting more images covering all variants—from distant and cloudy to close and clear, in different weather conditions—is needed. This would help the model be more balanced without needing special specialization.

Gallery

Discussion & Notes

Why did horizontal flipping reduce performance?

Horizontal flipping introduced unrealistic drone orientations not present in real-world scenarios. The augmentation confused the model rather than helping it generalize, demonstrating that augmentations must reflect actual data variations.

Which model should be used in production?

Model 4 is best for diverse conditions with close and clear images. Models 1-2 are better for distant, cloudy scenarios. Consider using separate models for different deployment contexts or combining datasets for a more balanced model.

How were the datasets different?

Anti-UAV contains distant, cloudy, lower-resolution shots while VisioDECT has closer, higher-resolution images with diverse weather conditions (sunny, rainy, nighttime). This domain difference significantly impacted model specialization.

Results

Successfully trained and evaluated four YOLOv8 models for drone detection. Model 4 (with VisioDECT integration) achieved best overall performance with 99% precision and recall. Analysis revealed that dataset diversity impacts performance more than simple geometric augmentations.

Best Precision: 99 %
Best Recall: 99 %
Best mAP@50: 99-100 %
Best mAP@50-95: 70 %

Planned Enhancements

Real-Time Video Detection

high

Implement real-time drone detection and tracking in video streams with optimized inference speed for live monitoring applications.

Domain Adaptation Techniques

high

Apply advanced domain adaptation methods to improve model transferability between different environmental conditions and camera setups.

Multi-Drone Tracking

medium

Extend detection to simultaneous tracking of multiple drones with unique ID assignment and trajectory prediction.

Mobile Deployment

medium

Optimize model for mobile and edge devices using quantization and pruning techniques for on-device inference.

Safety Notes

Ensure proper GPU memory management during training. Use appropriate batch sizes to prevent out-of-memory errors. Validate model predictions before deployment in security-critical applications. Consider ethical implications of drone detection systems.