This is my submission for the Computer Vision module for the University of Portsmouth MEng Computer Science course. This research explores using the SOTA AV-HuBERT model, which is used for ASR for Lip Reading and also works as a useful feature extractor for input video frames audio-visual tasks.
This research explores using these extracted features for phoneme prediction and mel-spectrogram synthesis. The phoneme classification confusion is assessed to determine where classifiers fall short.
Link to: Paper
av_hubert/
: Meta AV HuBERT submodulestable-ts/
: OpenAI Whisper with word-level timestamp generationlib/
: Collection of utility files used for dataset preprocessing and dataset source downloadingsplit.py
: Splits a source MP4 video into 10 second clips. This is because the AV HuBERT model works best with up to 10 second clips.main.ipynb
: Contains all of the initial experimental code for this project...- AV HuBERT Feature Extraction (Base, Self-Trained Large): Generate features for 10 second clips
- SKLearn and PyTorch classifier training code
- Dataset Handling Code (Load phonemes, audio features, raw dlib facial landmarks, OpenAI Whisper Large word-level timestamps)
- Auxilliary mel spectrogram prediction experiments for more robust training
base_vox_433h.pt
: AV HuBERT BASE modelself_large_vox_433h.pt
: Self-Trained AV HuBERT LARGE model (Best performing)phoneme_dict.txt
: ARPABET phoneme dictionary
Different models are explored over the dlib and AV HuBERT features:
- PyTorch Deep Neural Network (1 hidden layer deep neural network with 256 or 512 hidden dimension and ReLU activation and then softmax). Additional projection from hidden layer to predict mel-spectrogram features.
- Support Vector Machine (Linear, Radial Basis, Poly, Sigmoid)
- Random Forest
This work explores two main types of visual features:
AV HuBERT
Embeddings (Generated from their VoxCeleb3, fine tuned modelbase_vox_433h
,self_large_vox_433h
.)- BASE (768 dim)
- Self-Trained LARGE (1024 dim)
- Base
dlib
facial landmarks
Two datasets are used for this work:
- Jordan Peterson Lecture (30fps) This dataset has a duration of ~= 11 mins and 24 secs or 684 secs and a sequence length of ~20,000 frames.
- Jordan Peterson (24fps) (The False Appeal of Communism) (Shorts clip of Jordan Peterson discussing communism. Good clip to use due to variety of phonemes present within the dataset.) This dataset has a duration of ~= 51 seconds and a sequence length of 1,233 frames.