Timeception for Complex Action Recognition
Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders Timeception for Complex Action Recognition CVPR, 2019
学习时,别忘了总是要问自己一个为什么
前言:
这篇文章我只是粗读了第一遍,接下来还需要两天的时间来精读(结合代码与实验环节,来研究具体细节),目前我从大体上理解了文章的目的,网络结构的目的与思路以及文章写作手法上的结构思路.
我认为这篇文章对我有用的关键点在于怎样提出一个问题并试图找到解决问题的方法.
提出的问题通常来说应该是普遍且合理的,正如本文来说,在日常情况下,所谓的"action" 通常是由多个简单action组成的长时间的action,但是解决这个长时间的action的办法却没有被很好的提出来.(切入点)
1. Problems and Solutions
在阐述了两种action概念上的不同点后,可以从从 模型, 当前数据集以及当前task的难点角度概括提出的问题.并针对相应的问题提出了解决办法
ordinary life: complex actions vs. one-actions
-
complex actions:
- composed of several one-actions
- large variations in temporal duration and tem-poral order
- takes much longer to unfold
-
one actions:
- exhibit one visual pattern, possibly repetitive
- usually short in time, homogeneous in motion and coherent in form
1.1 problems
Model:
- Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling.
Data:
- main focus is the recognition of short-range actions like in HMDB, UCF and Kinetics . Few attention has been paid to the recognion of long-range and complex actions,
Task:
- minute-long temporal modeling while maintaining attention to seconds-long details
- tolerating varia-tions in temporal extent and temporal order of one-actions
1.2 Solutions
Model:
- use multi-scale temporal convolution, Timeception convolution layers
Data:
- use Charades ,Breakfast Actions, MultiTHUMOS
Task:
present Timeception:
- learns long-range temporal depen-dencies with attention to short-range details(dedicated only for temporal modeling)
- it toler-ates the differences in temporal extent of one-actions com-prising the complex action
2. Innovations
文章总结了三个主要的创新点
- introduce a con-volutional temporal layer effectively and efficiently learn minute-long action ranges of 1024 timesteps, a factor of 8 longer than best related work
- introduce multi-scale temporal kernels to account for large variations in duration of action components
- use temporalonly convolutions, which are better suited for complex actions than spatiotemporal counterparts
3. Background
背景方面,文章分成了四个关联的小块进行探讨,模块的时间程度 由旧到新, 相关程度 由轻到重.以下为我对四个模块的总结:
Temporal Modeling
downside:neglecting temporal patterns
Video vs. image
temporal dimension
statistical pooling
neural methods
Short-range Action Recognition
with shallow motion feature
too computationally expensive
deep appearance features
frame-level,2D
complement
evolve from 2D to 3D
Long-range Action Recognition
tem-poral pattern
learn video-wide represen-tation
learns relations between several video segments
learns temporal structure
different temporal resolutions
self-attention
Convolution Decomposition
channel shuffling
to solve computational complexity
separable 2D convolution
separable 2+1D convolutions
1x1 2Dconvolution
3x3 2D spatial convolution
models cross-channel correlation
grouped convolutions
multi-scale 2D spatial kernels
4. Structures
4.1 inspiration:
spatiotemporal kernel is decomposalble: w ∝ w s × w t w \propto w_s \times w_t w∝ws×wt
-> namely w ~ = w α × w β × w γ × . . . \widetilde w = w_{\alpha} \times w_{\beta} \times w_{\gamma} \times ... w =wα×wβ×wγ×...
4.2 three intuitive design principles:
Subspace Modularity
Subspace Balance
Subspace Efficiency
4.3 layer structure
- denpendency
- long-range
- temporal extent(one action’s length can range a lot)
设计这个layer模块的初衷应该就是通过decomposition 来增加模型在时间维度上的算力.
- 左边的模块是它的整体,通过group convolution在channel wise上分成了N份,同时保持T wise不变. 这样减小了整体的complexity.
- 在group conv 中的 Temporal Conv Module 是一个只在Time wise 上作用的
Depth-wise 卷积(这个卷积只针对每个channel层的 time方向)
- 然后在concat完结果之后,通过针对channel方向上的shuffle 来学习cross group correlation(这个group是channel group),并且还可以进一步reduce complexity
在右边的具体Temporal Conv Module结构上:
整体布局类似于inception 网络,其实就像是inception layer 在时间方向上的一次尝试.
- 主要分成两层,第一层是 multi-scale kernel,—>主要作用应该是通过在t方向上多尺度的卷积来tolerate的temporal extents
- 第二层的作用可能是对channel-wise做进一步的降维使最终channel的维度降为原来的 5/M
目的有点类似 :
- 为了可以学习long -time video, 来通过decomposition 获取t方向的算力
- 为了tolerate temporal extent,来做成multi-scale kernel
final model consists four Timeception layers stacked on top of the last convolution layer of a CNN
5. Code
github
这一部分应该与experiment进行结合,进一步研究
6. What’s behind
- Whether I could use the neural network/pipline ?
- Whether I could learn the way to propose an idea?
- Whether I could learn the way to solve a problem?
- Whether I could learn the writing skill?
Idea of thinking up of an task --> find solution and writing skill
6.1 Aim
intuitively feasible
我不认为他的结构是一次性成功的,肯定也有在推论导向下的不断试错的过程.
这些推论可能是以之前的文章为基础. 然后在之前基础上做了一个最重要的尝试就是把原有的类似结构拓展到T dimension上.
目的: 通过decomposition 来增加模型在时间维度上的算力
- 为了可以学习long -time video, 来通过decomposition 获取t方向的算力
- 为了tolerate temporal extent,来做成multi-scale kernel
6.2 Writing style:
6.3 Idea
- There is not much of the mathmatic formular to support the idea.
Try to use explanation to have a better expression for what you are doing