Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

Tianyang Xu ¹ Jinjie Gu ¹ Xue-Feng Zhu ¹ Xiao-Jun Wu ^{1, *} Josef Kittler ²

¹ Jiangnan University ² University of Surrey

Highlights
Download
Evaluation & Results
Citation
Contact

News

11/23/2025: A baseline method MMA-SORT for multi-modal anti-UAV tracking is released at GitHub Page.
11/23/2025: The Evaluation Toolkit is released at GitHub Page.

Highlights

More modalities : RGB, Thermal Infrared, and Event
Large-scale dataset : 1,321 tri-modal sequences, 2.8M annotated frames
Comprehensive evaluation: Anti-UAV tracking, Multi-object tracking
Generic scene: >30 scenes

Abstract

With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, e.g., such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, i.e., RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, \ie an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods, particularly in challenging visual conditions, achieving breakthrough performance in complex low-altitude environments.

Download

The dataset will be released soon.

Evaluation & Results

For evaluation, MOTA, HOTA, IDF1, IDs are adopted. The Evaluation Toolkit is released at GitHub Page.

Citation

Contact

If you have any question, please contact us at tianyang.xu@jiangnan.edu.cn.