Fast Video Shot Transition Localization with Deep Structured Models

100%

Fast Video Shot Transition Localization with Deep Structured Models

Abstract. Detection of video shot transition is a crucial pre-processing step in video analysis. Previous studies are restricted on detecting sudden content changes between frames through similarity measurement and multi-scale operations are widely utilized to deal with transitions of various lengths. However, localization of gradual transitions are still underexplored due to the high visual similarity between adjacent frames. Cut shot transitions are abrupt semantic breaks while gradual shot transitions contain low-level spatial-temporal patterns caused by video effects in addition to the gradual semantic breaks, e.g. dissolve. In order to address the problem, we propose a structured network which is able to detect these two shot transitions using targeted models separately. Considering speed performance trade-offs, we design the following framework. In the first stage, a light filtering module is utilized for collecting candidate transitions on multiple scales. Then, cut transitions and gradual transitions are selected from those candidates by separate detectors. To be more specific, the cut transition detector focus on measuring image similarity and the gradual transition detector is able to capture temporal pattern of consecutive frames, even locating the positions of gradual transitions. The light filtering module can rapidly exclude most of the video frames from further processing and maintain an almost perfect recall of both cut and gradual transitions. The targeted models in the second stage further process the candidates obtained in the first stage to achieve a high precision. With one TITAN GPU, the proposed method can achieve a thirty times real-time speed. Experiments on public TRECVID zero seven and RAI databases show that our method outperforms the state-of-the-art methods. In order to train a high-performance shot transition detector, we contribute a new database ClipShots, which contains one hundred twenty-eight thousand six hundred thirty-six cut transitions and thirty-eight thousand one hundred twenty gradual transitions from four thousand thirty-nine online videos. ClipShots intentionally collect short videos for more hard cases caused by hand-held camera vibrations, large object motions, and occlusion. The database is available at github dot com slash Tangshitao slash ClipShots.

One Introduction

Shot transition detector is a necessary component in many video recognition tasks. The goal of shot transition detection is to find semantic breaks in videos. Cut transitions are defined as abrupt transitions from one sequence to another while gradual transitions are almost the same but in a gradual manner. They share one common attribute, the start of a transition and the end of a transition are semantically different. Previous methods focus on finding both cut transitions and gradual transitions with one similarity function. Such methods have shown a great success in cut transition detection in the aspects of both speed and accuracy. However, when applied to gradual transition detection, it is not effective in the detection of gradual transitions. As Figure one shows, it is widely recognized that many large motions or occlusion, e.g. camera movement, are detected as positive when only measuring similarity. In order to overcome this shortcoming, recent research begins to explore the temporal pattern of gradual transitions. Therefore, in the C three D ConvNet is adopted to classify segments into three classes (cut, gradual and background), which achieves state-of-the-art performance. Yet C three D ConvNet not only consumes too much computing resources, but is also not an effective architecture for handling both cut and gradual transitions, i.e. the lengths of gradual transitions are varying but C three D ConvNet is not designed for multi-scale detection. Inspired by this method and previous similarity measurement method, we present a cascade framework, consisting of a targeted cut transition detector and a targeted gradual transition detector. The cut transition detector, for measuring the image similarity, is fast and accurate while the gradual transition detector is capable of capturing the temporal pattern of gradual transitions in multi-scale level. In addition, compared to deepSBD, our framework can locate both cut transitions and gradual transitions accurately.

In this work, we present a new cascade framework, a fast and accurate approach for shot boundary detection. The first stage applies a ridiculously fast method to initially filter the whole video and selects the candidate segments. This stage is for accelerating the framework (up to two times faster than not) and