Computer Vision - AV HuBERT Research

About

This is my submission for the Computer Vision module for the University of Portsmouth MEng Computer Science course. This research explores using the SOTA AV-HuBERT model, which is used for ASR for Lip Reading and also works as a useful feature extractor for input video frames audio-visual tasks.

This research explores using these extracted features for phoneme prediction and mel-spectrogram synthesis. The phoneme classification confusion is assessed to determine where classifiers fall short.

Link to: Paper

Directory Breakdown

av_hubert/: Meta AV HuBERT submodule
stable-ts/: OpenAI Whisper with word-level timestamp generation
lib/: Collection of utility files used for dataset preprocessing and dataset source downloading
split.py: Splits a source MP4 video into 10 second clips. This is because the AV HuBERT model works best with up to 10 second clips.
main.ipynb: Contains all of the initial experimental code for this project...
- AV HuBERT Feature Extraction (Base, Self-Trained Large): Generate features for 10 second clips
- SKLearn and PyTorch classifier training code
- Dataset Handling Code (Load phonemes, audio features, raw dlib facial landmarks, OpenAI Whisper Large word-level timestamps)
- Auxilliary mel spectrogram prediction experiments for more robust training
base_vox_433h.pt: AV HuBERT BASE model
self_large_vox_433h.pt: Self-Trained AV HuBERT LARGE model (Best performing)
phoneme_dict.txt: ARPABET phoneme dictionary

Models

Different models are explored over the dlib and AV HuBERT features:

PyTorch Deep Neural Network (1 hidden layer deep neural network with 256 or 512 hidden dimension and ReLU activation and then softmax). Additional projection from hidden layer to predict mel-spectrogram features.
Support Vector Machine (Linear, Radial Basis, Poly, Sigmoid)
Random Forest

Visual Features

This work explores two main types of visual features:

AV HuBERT Embeddings (Generated from their VoxCeleb3, fine tuned model base_vox_433h, self_large_vox_433h.)
- BASE (768 dim)
- Self-Trained LARGE (1024 dim)
Base dlib facial landmarks

Datasets

Two datasets are used for this work:

Jordan Peterson Lecture (30fps) This dataset has a duration of ~= 11 mins and 24 secs or 684 secs and a sequence length of ~20,000 frames.
Jordan Peterson (24fps) (The False Appeal of Communism) (Shorts clip of Jordan Peterson discussing communism. Good clip to use due to variety of phonemes present within the dataset.) This dataset has a duration of ~= 51 seconds and a sequence length of 1,233 frames.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lib		lib
stable-ts @ 9e3ba72		stable-ts @ 9e3ba72
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
dataset.json		dataset.json
dataset_deprecated.json		dataset_deprecated.json
main.ipynb		main.ipynb
paper.pdf		paper.pdf
phoneme_dict.txt		phoneme_dict.txt
speech_synthesis.ipynb		speech_synthesis.ipynb
split.py		split.py
verify.ipynb		verify.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Vision - AV HuBERT Research

About

Directory Breakdown

Models

Visual Features

Datasets

About

Releases

Packages

Languages

License

MiscellaneousStuff/comp-vis-avhubert

Folders and files

Latest commit

History

Repository files navigation

Computer Vision - AV HuBERT Research

About

Directory Breakdown

Models

Visual Features

Datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages