This directory contains tools for audio analysis and processing built on wav2letter.
To build the tools, simply pass -DW2L_BUILD_TOOLS=ON
as a CMake flag when building wav2letter.
VoiceActivityDetection-CTC.cpp
VoiceActivityDetection-CTC
contains a simple pipeline that supports a CTC-trained acoustic model trained with wav2letter and n-gram language model in an wav2letter binary format (see the decoder documentation for more).
Build the tool with make VoiceActivityDetection-CTC
.
First, create an input list file containing the audio data. The list file should exactly follow the standard wav2letter list input format for training, but the transcriptions column should be empty. For instance:
// Example input file
[~/speech/data] head analyze.lst
train001 /tmp/000000000.flac 100.03
train002 /tmp/000000001.flac 360.57
train003 /tmp/000000002.flac 123.53
train004 /tmp/000000003.flac 999.99
...
...
Run the binary:
[path to binary]/VoiceActivityDetection-CTC \
-am [path to model] \
-lm [path to language model] \
-test [path to list file] \
--lexicon [path to lexicon file] \
--maxload -1 \
--datadir= \
--tokensdir [path to directory containing tokens file] \
--tokens [tokens file name] \
--outpath [output directory]
The script outputs four files named by each input sample ID in the directory specified by outpath:
- A
.vad
file containing chunk-level probabilities of non-speech based on the probability of silence. These are assigned for each chunk of output; for a model trained with a stride of 1, these will be each frame (10 ms), but for a model with a stride of 8, these will be (80 ms) chunks. - An
.sts
file containing the perplexity the predicted sequence based on a specified input in addition to the percentage of the audio containing speech based on the passed--vadthreshold
. - A
.tsc
file containing the most likely token-level transcription of given audio based on the acoustic model output only. - A
.fwt
file containing frame or chunk-level token emissions based on the most-likely token emitted for each sample.
Below are models compatible with the below audio analysis pipelines.
File | Dataset | Dev Set | Criterion | Architecture | Lexicon | Tokens |
---|---|---|---|---|---|---|
baseline_dev-other | LibriSpeech | dev-other | CTC | Archfile | Lexicon | Tokens |
StreamingTDSModelConverter.cpp
Once a model is trained in wav2letter++ for streaming TDS models using the provided recipe possibly customized to suit ones' use-case, the model needs to be serialized to a format which wav2letter@anywhere inference platform can load. StreamingTDSModelConverter
can be used to do this. Note that the script only supports models trained using the streaming TDS + CTC style architectures as described in the paper here.
Build the tool with make streaming_tds_model_converter
.
And to run the binary:
[path to binary]/streaming_tds_model_converter \
-am [path to model] \
--outdir [output directory]
The output directory will contain
tokens.txt
- Tokens file (with blank symbol included)acoustic_model.bin
- Serialized acoutic modelfeature_extractor.bin
- Serialized feature extraction model which perform log-mel feature extraction and local normalization
These files can be used to run inference on audio files along with a few other files required for decoding like language model, lexicon etc. See the tutorial for more details.