Top Author

Yu-Gang Jiang, Zuxuan Wu : Fudan University

Lorenzo Torresani, Wang Heng ,Feiszli, Matt: FAIR

Luc Van Gool : Head Toyota Lab TRACE

Yi Yang : University of Technology Sydney

Kiyoharu Aizawa: University of Tokyo

Ali Diba: KU Leuven

Abdenour Hadid: Ctr Machine Vis & Signal Anal CMVS, Oulu

将Transformer用到视频分类中:

Video Swin Transformer
ViViT: A Video Vision Transformer
Is Space-Time Attention All You Need for Video Understanding?

对特征及进行选择加速模型推理：

Adaptive Focus for Efficient Video Recognition
No frame left behind: Full Video Action Recognition(根据相似度不断对2d特征进行聚类，压缩frams)

动作识别中的时间维度建模:

Slowfast networks for video recognition (two branch，一个分支学习结构信息，一个分支学习时序信息)
TSM: Temporal Shift Module for Efficient Video Understanding(让卷积核的一部分沿着时间维度移动，学到时序信息)
Dynamic image networks for action recognition (15年，使用LSTM+光流信息进行视频分类)
Rank Pooling for Action Recognition(16年，单独训练一个网络来学习时序信息)
Video Action Transformer Network(将self-attention用到时序信息的提取中)
Directional Temporal Modeling for Action Recognition (学习3D卷积重的clip级别的时序信息)
EFFICIENT TEMPORAL-SPATIAL FEATURE GROUPING FOR VIDEO ACTION RECOGNITION(重新设计时空解耦的卷积学习特征)
TEA: Temporal Excitation and Aggregation for Action Recognition(设计TEA block对短时间和长时间分开建模)
Temporal Pyramid Network for Action Recognition(时间维度的特征金字塔)

Gate-Shift Networks for Video Action Recognition: C3D S3D GST CSN TSM GSM都是对3D卷积核中时间维度的处理

将多模态融合应用到视频分类中：

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification(将多模态双线性池化引入用以融合视频和音频的信息)
弹幕信息协助下的视频多标签分类 (探索了将弹幕信息融合后对视频进行分类的可能性，自建了从bilibili采集的数据集(后续会开放))
基于深度多模态特征融合的短视频分类（通过建立相似性损失和差异性损失，探寻短视频中不同模态之间的相似性和同一模态的差异性来辅助分类）
Multimodal Keyless Attention Fusion for Video Classification(提出了一种keyless的attetion方法用于特征融合)
Deep Multimodal Learning: An Effective Method for Video Classification(比较了一些比较常用的循环网络用于特征融合的性能)
A Deep Learning Based Video Classification System Using Multimodality Correlation Approach（person correlation integration）
Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification（multi-stream multi-class fusion，通过学习类关系来提高预测性能）
Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification (the feature fusion network that produces a fused representation through modeling feature relationships outperforms a large set of alternative fusion strategies)
Residual Attention-based Fusion for video classification(将BiLSTM和attention堆叠起来用于提取时空特征)
Multimodal video classification with stacked contractive autoencoders(用autoencoders提取模态之间的互补信息)

mmaction2是一款基于 PyTorch 的视频理解开源工具箱，是 OpenMMLab 项目的成员之一。

数据集:

年份	数据集名称	paper	类别数	视频数
2004	KTH	Recognizing human actions: a local svm approach	6	600
2005	Weizmann	Actions as space-time shapes	9	81
2008		Action mach: a spatio-temporal maximum average correlation height filter for action recognition.
2011	HMDB	HMDB: A large video database for human motion recognition	51	6766
2012	UCF101	UCF101: A dataset of 101 human actions classes from videos in the wild	101	13320
2013		Towards understanding action recognition
2014		Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling.
2014		The language of actions: Recovering the syntax and semantics of goaldirected human activities
2015	ActivityNet	ActivityNet: A large-scale video benchmark for human activity understanding	200	28K
2015		THUMOS challenge:Action recognition with a large number of classes
2016		Hollywood in homes: Crowdsourcing data collection for activity understanding.
2016		Human action localization with sparse spatial supervision
2016		Spot on: Action localization from pointly-supervised proposals
2016		Recognizing fine-grained and composite activities using hand-centric features and script data.
2017	Kinetics	Quo vadis, action recognition? a new model and the kinetics dataset
2017		The something something video database for learning and evaluating visual common sense	174	108.5K/220.8K
2018		Every moment counts: Dense detailed labeling of actions in complex videos.
2018		What do i annotate next? an empirical study of active learning for action localization.
2018		Ava: A video dataset of spatio-temporally localized atomic visual actions
2018		Scaling egocentric vision: The epic-kitchens dataset.
2019	Moments in Time	Moments in time dataset: one million videos for event understanding
2019		Hacs: Human action clips and segments dataset for recognition and temporal localization
2019	Diving48	Resound: Towards action recognition without representation bias.
2019	Jester	The Jester Dataset: A Large-Scale Video Dataset of Human Gestures
2020	FineGYM	FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
2020	OmniSource	Omni-sourced Webly-supervised Learning for Video Recognition
2020	HVU	Large Scale Holistic Video Understanding	739	572K

UCF101 (主页) (CRCV-IR-12-01) (Soomro, Roshan Zamir, and Shah 2012) is a trimmed video dataset, consisting of realistic web videos with diverse forms of camera motion and illumination. It contains 13,320 video clips with an average length of 180 frames per clip. These are labeled with 101 action classes, ranging from daily life activities to unusual sports. Each video clip is assigned just a single class label. Following the original evaluation scheme, we report the average accuracy over three training/testing splits.
ActivityNet (主页) (CVPR'2015) (Heilbron et al. 2015) is an untrimmed video dataset. We use the ActivityNet v1.3 release, which consists of more than 648 hours of untrimmed videos from a total of around 20K videos with 1.5 annotations per video, selected from 200 classes. Videos can contain more than one activity, and, typically, large time segments of a video are not related to any activity of interest. In the official split, the distribution among training, validation, and test data is about 50%, 25%, and 25% of the total videos, respectively. Because the annotations for the testing split have not yet been published, we report experimental results on the validation split.
[Kinetics-400/600/700] (主页) (CVPR'2017) (Carreira and Zisserman 2017) is a trimmed video dataset. The dataset contains 246,535 training videos, 19,907 validation videos, and 38,685 test videos, covering 400 human action classes. Each clip lasts around 10s and is labeled with a single class. The annotations for the test split have not yet been released, so we report experimental results on the validation split.
YouTube-8M (Abu-El-Haija et al. 2016) is massively large untrimmed video dataset. It contains over 1.9 billion video frames and 8 million videos. Each video can be annotated with multiple tags. Visual and audio features have been preextracted and are provided with the dataset for each second of the video. The visual features were obtained via a Google Inception CNN pre-trained on ImageNet (Deng et al. 2009), followed by PCA-based compression into a 1024- dimensional vector. The audio features were extracted via a pre-trained VGG-inspired (Simonyan and Zisserman 2014a) network. In the official split, the distribution among training, validation, and test data is about 70%, 20%, and 10%, respectively. As the annotations of the test split have not been released to the public and the number of videos in the validation set is overly large, we maintain 60K videos from the official validation set to validate the parameters. Other videos in the validation set are included into the training set. We report experimental res
Moments in Time (主页) (TPAMI'2019) consists of 800 000, 3-second YouTube clips that capture the gist of a dynamic scene involving animals, objects, people, or natural phenomena.
Something-Something v2 (SSv2) [26] contains 220 000 videos, with durations ranging from 2 to 6 seconds. In contrast to the other datasets, the objects and backgrounds in the videos are consistent across different action classes, and this dataset thus places more emphasis on a model’s ability to recognise fine-grained motion cues.
Epic Kitchens-100 consists of egocentric videos capturing daily kitchen activities spanning 100 hours and 90 000 clips . We report results following the standard “action recognition” protocol. Here, each video is labelled with a “verb” and a “noun” and we therefore predict both categories using a single network with two “heads”. The topscoring verb and action pair predicted by the network form an “action”, and action accuracy is the primary metric.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

video_classification.md

video_classification.md

Files

video_classification.md

Latest commit

History

video_classification.md

File metadata and controls