Video Mamba: State Space Model for Efficient Video Understanding

问题:

1,local redundancy

the large spatiotemporal redundancy within short video clips

2,global dependencies

the complex spatiotemporal dependencies among long contexts.

(CNNs有问题二,ViT有问题一)

贡献:

1,Sensitivity for recognizing short term actions even with fine-grained motion differences

对变化敏感,即使变化很小

More importantly, it is also suitable for masked modeling, which further enhances its temporal sensitivity.

2,Superiority in long-term video understanding

解决长程依赖(mamba本事具有的优势)

3,Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique

To counteract the overfitting : Self-Distillation strategy, which uses a smaller and well-trained model as the "teacher" to guide the training of the larger "student" model.

4,Compatibility with other modalities

(模态就比如说语音,文本,视频等,多模态相关的就比如:视频转文本,语音转文本,文本转语音等)

To augment VideoMamba's temporal sensitivity and verify its adaptability with text modalities, we adopt a masked alignment approach inspired by UMT.

Firstly, VideoMamba is trained from scratch on video data alone, aligning unmasked tokens with those from CLIP-ViT. Subsequently, it is integrated with a text encoder and a cross-modal decoder , for pretraining on both image-text and video-text datasets

本文大框架:

(多加了个时序ps和位置特征pt)

Video Mamba: State Space Model for Efficient Video Understanding_第1张图片

Video Mamba: State Space Model for Efficient Video Understanding_第2张图片 vision mamba(Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model)大框架:

(多加了位置特征Epos)

Video Mamba: State Space Model for Efficient Video Understanding_第3张图片 我的问题:

1,Unlike VMamba, which incorporates additional depthwise convolution, VideoMamba strictly follows the ViT design without downsampling layers.

所以如果改进的 话可不可以在这篇论文的基础上加上depthwise convolution来下采样减少计算量呢?

你可能感兴趣的:(深度学习,人工智能)