Appearance Learning from Sparsely Annotated Video

Overview

Our method, FlowBoost, starts from a sparse labeling of the video, and alternates the training of an appearance-based detector with and a convex, multi-target, time-based regularization. The latter relabels the full training video in a manner that is both consistent with the response of the current detector, and in accordance with physical constraints on target motions. Given a training video, we begin by annotating a small subset of frames while the remaining frames are not annotated. This limited initial training data is used to train an appearance based classifier which is subsequently evaluated on the entire video sequence. Admissible trajectories output by the time-based regularization, are retained as positive samples while the remaining data is retained as negative samples and the process is iterated.

framework

Qualitative Results

Training

Migrating Neurons
Typical results obtained with our framework on time-lapse microscopy data containing migrating neurons. In this sequence, the result from FlowBoost's third iteration is shown. Grey boxes indicate the hand-labelled ground truth throughout the sequence shown for reference. Green boxes indicate the manual ground truth that is used by FlowBoost, in this case 1 in 32 frames are used. Blue Boxes show the labels recovered by FlowBoost in remaining frames.


Pedestrians on Campus
Typical results obtained with our framework on pedestrian data. In this sequence, the result from FlowBoost's third iteration is shown. Grey boxes indicate the hand-labelled ground truth throughout the sequence shown for reference. Green boxes indicate the manual ground truth that is used by FlowBoost, in this case 1 in 64 frames are used. Blue Boxes show the labels recovered by FlowBoost in remaining frames.



Quantitative Results

We used the labels generated by AdaBoost to train a detector and test in on seperate test sequences. We have two baselines. An AdaBoost procedure with access to the same ground truth as FlowBoost and an AdaBoost procedure with access to the ground truth on the entire training video.

hands_roc cars_roc

Why it works

Though the strategy as described above seems to be a sensible one, it posseses many risks and failure modes. It should be clear that a fixed point in the iterative proccess is the labeling which corresponds to the correct target locations. Such a labeling would produce a robut appearance based model which would in turn ensure that the trajectories output by the time-based regularization correspond to actual target motions. Nothing however prevents the sytem to converge to a mixture of target patches and background patches being labeled as positive. In this case, the appearance based model can simply learn a multi-modal distribution and the regularizer would readily accept background patches which follow camera motion, producing an alternative fixed point in the iterative process.

FlowBoost avoids such a situation by ensuring that the appearance based model and the time-based regularizer minimize a common loss function. See paper for details.

Links

K. Ali, D. Hasler and F. Fleuret
IEEE Conference on Computer Vision and Pattern Recognition, 2011.