CVPR 2024 视频处理方向总汇(视频监控、视频理解、视频识别和视频预测等)
1、视频处理总汇
- Learning from One Continuous Video Stream
- Deep Video Inverse Tone Mapping Based on Temporal Clues
- VTimeLLM: Empower LLM to Grasp Video Moments
- Combining Frame and GOP Embeddings for Neural Video Representation
- Learning to Predict Activity Progress by Self-Supervised Video Alignment
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
⭐code
- Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video
⭐code
- Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- Understanding Video Transformers via Universal Concept Discovery
- Video Recognition in Portrait Mode
project
- VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams
project
- Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily Living
⭐code
- A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
- [Reliable Video Teller via Equal Distance to Visual Tokens]
- Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
project
- Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings
- Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models
- 睡眠监测
- SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers
- 视频理解
- Compositional Video Understanding with Spatiotemporal Structure-based Transformers
- Action Scene Graphs for Long-Form Understanding of Egocentric Videos
- HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
project
- Koala: Key Frame-Conditioned Long Video-LLM
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
⭐code
- Abductive Ego-View Accident Video Understanding for Safe Driving Perception
project
- OmniVid: A Generative Framework for Universal Video Understanding
⭐code
- A Unified Framework for Human-centric Point Cloud Video Understanding
- Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
project
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
⭐code
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
⭐code
- 视频摘要
- Previously on ... From Recaps to Story Summarization
project
- Scaling Up Video Summarization Pretraining with Large Language Models
- CSTA: CNN-based Spatiotemporal Attention for Video Summarization
⭐code
- 视频重建
- HDRFlow: Real-Time HDR Video Reconstruction with Large Motions
⭐code
- 视频表示
- DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes
project
- 视频判读
- Visual Objectification in Films: Towards a New AI Task for Video Interpretation
- 电影描述
- MICap: A Unified Model for Identity-Aware Movie Descriptions
project
- 视频监控
- Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges
dataset
- 视频预测
- Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes
- ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction
⭐code
project
- 视频稳定
- Harnessing Meta-Learning for Improving Full-Frame Video Stabilization
- 3D Multi-frame Fusion for Video Stabilization
- 视频识别
- OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition
⭐code
project
- 视频对话
- BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
⭐code
- 视频重照明
- Real-time 3D-aware Portrait Video Relighting
- 视频和谐化
- Video Harmonization with Triplet Spatio-Temporal Variation Patterns
VILP
- 视频帧插值
- Video Frame Interpolation via Direct Synthesis with the Event-based Reference
- IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation
- EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
- TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation
- Sparse Global Matching for Video Frame Interpolation with Large Motion
⭐code
- Perception-Oriented Video Frame Interpolation via Asymmetric Blending
⭐code
视频插帧视觉效果新突破!上海交大提出PerVFI,视频插帧新范式
- SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
project
- 视频主题交换
- VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
project
- 视频异常检测
- Open-Vocabulary Video Anomaly Detection
- Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning
- Harnessing Large Language Models for Training-free Video Anomaly Detection
⭐code
- Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
⭐code
- Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
- MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- PREGO: Online Mistake Detection in PRocedural EGOcentric Videos
- Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
⭐code
- Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
- GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?
project大型多模态模型能否检测视频游戏故障
- 视频场景检测
- Neighbor Relations Matter in Video Scene Detection
- 视频镜像检测
- Effective Video Mirror Detection with Inconsistent Motion Cues
- 自动生成电影预告片
- Towards Automated Movie Trailer Generation
- 视频对话式音乐推荐系统
- MuseChat: A Conversational Music Recommendation System for Videos
- Video Paragraph Grounding
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
- video Grounding
- SnAG: Scalable and Accurate Video Grounding
⭐code
- Context-Guided Spatio-Temporal Video Grounding
⭐code
- Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
- What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions