End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
With the rapid development of Internet technology,the spread of bloody video has become increasingly serious,causing huge harm to society.In this paper,a bloody video recognition method based on audio-visual feature fusion is proposed to complement the limitation of the single vision-modality methods.In the absence of open bloody video data,this paper first constructed a database of bloody videos through web crawlers and data augmentation methods; then it used CNN and LSTM methods to extract the spatiotemporal features of visual channels.Meanwhile,the audio channel features were extracted directly from the original waveforms using the 1D convolutional network.Finally,the neural network based on the audio-visual feature fusion layer was constructed to achieve the early fusion of multimodal cues.The accuracy of the proposed method on the bloody video test data is 95%.The experimental results on self-built bloody video databases demonstrate that the extracted audio-visual feature representations are effective and the proposed multimodal fusion model can obtain the better and discriminative recognition performance than the singlechannel model.
Bloody video recognition Feature extraction Multimodal fusion
Congcong Hou Xiaoyu Wu Ge Wang
Communication University of China,Beijing,China Columbia School of Engineering and Applied Science,Computer Science,Columbia University,New York,USA
国际会议
广州
英文
501-510
2018-11-23(万方平台首次上网日期,不代表论文的发表时间)