atosystem / SpeechCLIP Public

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022

atosystem.github.io/blogs/speechclip

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
avssl		avssl
config		config
dev-support		dev-support
egs		egs
script		script
test		test
zerospeech		zerospeech
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_task.py		run_task.py
run_test.sh		run_test.sh

Repository files navigation

audio-visual-ssl

Apply two types of vetor quantization - gumbel softmax and k-means (refer from fairseq.modules)

If you want to change the type of vector quantization, please modify the config yaml file under config/speechclip_c/train_flickr.yaml.
If you want to run only for validation or testing, add --eval or --test flag at egs/run_speechclip_c.sh
If you want to resume your training from specific checkpoint, add --ckpt your_checkpoint_path flag at egs/run_speechclip_c.sh
Please run autoformatter before opening PR! Autoformat audio-visual-ssl/dev-support/