Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition
Deep convolutional neural networks (ConvNets) have shown remarkable capability for visual feature learning and representation. In the field of video-based action recognition, much progress has been made with the development of ConvNets. However, main-stream ConvNets used for video-based action recognition, such as two-stream ConvNets and 3D ConvNets, still lack the ability to represent fine-grained features. In this paper, we propose a novel architecture named multi-level threestream convolutional network (MLTSN), which contains three streams, i.e., the spatial stream, the temporal stream, and the multi-level correlation stream (MLCS). The MLCS contains several correlation modules, which fuse appearance and motion features at the same levels and obtain spatial-temporal correlation maps. The correlation maps will further be fed in several convolution layers to get refined features. The whole network is trained in a multi-step modality. Extensive experimental results show that the performance of the proposed network is competitive to state-of-the-art methods on HMDB51 and UCF101.
Action recognition Convolutional networks Multi-level correlation mechanism
Yijing Lv Huicheng Zheng Wei Zhang
School of Data and Computer Science,Sun Yat-sen University,Guangzhou,China;Key Laboratory of Machine Intelligence and Advanced Computing,Ministry of Education,Guangzhou,China;Guangdong Key Laboratory of Information Security Technology,135 West Xingang Road,Guangzhou 510275,China
国际会议
广州
英文
237-249
2018-11-23(万方平台首次上网日期,不代表论文的发表时间)