A Dataset and Deep Learning Benchmark for Event Recognition in Aerial Videos


Along with the now widespread availability of unmanned aerial vehicles (UAVs), large volumes of aerial videos have been produced. It is unrealistic for humans to screen such big data and understand their contents. Hence methodological research on the automatic understanding of UAV videos is of paramount importance. In this work, we introduce a novel task of event recognition in unconstrained aerial videos in the remote sensing community and present a large-scale, human-annotated dataset, named ERA (Event Recognition in Aerial videos), consisting of 2,864 videos each with a label from 25 different classes corresponding to an event unfolding 5 seconds. The ERA dataset is designed to have a significant intra-class variation and inter-class similarity and captures dynamic events in different environments and at different scales. Moreover, in order to provide a benchmark for this task, we extensively evaluate existing deep networks. We expect that the ERA dataset will facilitate further progress in automatic aerial video comprehension.


The goal of this work is to collect a large, diverse datasetfor training models for event understanding in UAV videos. Since we gather aerial videos from Youtube, the largest videosharing platform in the world, we are capable of includinga large breadth of diversity that would be more challengingthan making use of self-collected dat. In total, wehave gathered and annotated 2,864 videos for 25 classes. Eachvideo sequence is at 24 fps (frames per second), in 5 seconds,and with a spatial size of 640×640 pixels.



1) Confusion matrix of TRN for the ERA dataset.
2) Confusion matrix of Dense201 for the ERA dataset.
3) Examples of event recognition results on the ERA dataset. We show the best two single-frame classification network architectures (i.e., Inception-v3 and DenseNet-201) and the best two video classification network architectures (i.e., I3D and TRN). The ground truth label and top 3 predictions of each model are reported. For each example, we show the first (left) and last (right) frames. Best viewed zoomed in color.
4) Examples of misclassifications. We show several failure examples where the prediction is not in the top 3.


Coming Soon...