TITLE: Detect to Track and Track to Detect
AUTHOR: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
ASSOCIATION: Graz University of Technology, University of Oxford
FROM: arXiv:1710.03958
For frame-level detections, this work adopts R-FCN as the base framework to detect objects in a single frame. The inter-frame correlation features are extracted from the feature maps of the two frames. A multi-task loss of localization, classification and displacement is used to train the net work. The workflow of this work is shown in the following figure.
The key innovation of this work is an operation denoted as ROI tracking. The input of this operation is the bounding box regression features of the two frames
where −d≤p≤d and −d≤q≤d are offsets to compare features in a square neighbourhood around the locations i , j in the feature map, defined by the maximum displacement d .
The loss function is written as
A class-wise linking score is defined to combine detections and tracks across time
where the pairwise term ϕ evaluates to 1 if the IoU overlap a track correspondences Tt,t+τ with the detection boxes Dti , Dt+τi is larger than 0.5. pti,c , pt+τj,c is the softmax probability for class c . The optimal path across a video can be found by maximizing the scores over the duration T of the video. Once the optimal tube is found, the detections corresponding to that tube are removed. Then reweight the detection scores in the tube by adding the mean of the 50% highest scores in that tube. And the procedure is applied again to the remaining detections.