Writing architecture files

wav2letter++ provides a simple way to create fl::Sequential module for the acoustic model from text files. These are specified using the gflags -arch and -archdir.

Example architecture file:

# Comments like this are ignored
# the output tensor will have the shape (Time, 1, NFEAT, Batch)
V -1 1 NFEAT 0
C2 NFEAT 300 48 1 2 1 -1 -1
R
C2 300 300 32 1 1 1
R
RO 2 0 3 1
# the output should be with the shape (NLABEL, Time, Batch, 1)
L 300 NLABEL

While parsing, we ignore lines stating with # as comments. We also replace the following tokens NFEAT = input feature size (e.g. number of frequency bins), NLABEL = output size (e.g. number of grapheme tokens)

The first token in each line represents a specific flashlight/wav2letter module followed by the specification of its parameters.

Here, we describe how to specify different flashlight/wav2letter modules in the architecture files.

fl::Conv2D

C2 [inputChannels] [outputChannels] [xFilterSz] [yFilterSz] [xStride] [yStride] [xPadding <OPTIONAL>] [yPadding <OPTIONAL>] [xDilation <OPTIONAL>] [yDilation <OPTIONAL>]

Input is expected to be [Time, Width=1, inputChannels, Batch], and the output [Time, Width=1, outputChannels, Batch].

Use padding = -1 for fl::PaddingMode::SAME.

fl::Linear

L [inputChannels] [outputChannels]

Input is expected to be [inputChannels, *, * , *], and the output [outputChannels, *, * , *].

fl::BatchNorm

BN [totalFeatSize] [firstDim] [secondDim <OPTIONAL>] [thirdDim <OPTIONAL>]

Dimensions which are not presented in the list will be reduced for statistics computation.

fl::LayerNorm

LN [firstDim] [secondDim <OPTIONAL>] [thirdDim <OPTIONAL>]

Dimensions along which which normalization is computed (these axes will be reduced for statistics computation).

fl::WeightNorm

WN [normDim] [Layer]

fl::Dropout

DO [dropProb]

fl::Pool2D

Average

A [xFilterSz] [yFilterSz] [xStride] [yStride] [xPadding] [yPadding]

Max

M [xFilterSz] [yFilterSz] [xStride] [yStride] [xPadding] [yPadding]

Use padding = -1 for fl::PaddingMode::SAME.

fl::View

V [firstDim] [secondDim] [thirdDim] [fourthDim]

Use -1 to infer dimension, only one param can be a -1. Use 0 to use the corresponding input dimension.

fl::Reorder

RO [firstDim] [secondDim] [thirdDim] [fourthDim]

fl::ELU

ELU

fl::ReLU

fl::PReLU

PR [numElements <OPTIONAL>] [initValue <OPTIONAL>]

fl::Log

LG

fl::HardTanh

HT

fl::Tanh

fl::GatedLinearUnit

GLU [sliceDim]

fl::LogSoftmax

LSM [normDim]

fl::RNN

RNN

RNN [inputSize] [outputSize] [numLayers] [isBidirectional] [dropProb]

GRU

GRU [inputSize] [outputSize] [numLayers] [isBidirectional] [dropProb]

LSTM

LSTM [inputSize] [outputSize] [numLayers] [isBidirectional] [dropProb]

fl::Embedding

E [embeddingSize] [nTokens]

fl::AsymmetricConv1D

AC [inputChannels] [outputChannels] [xFilterSz] [xStride] [xPadding <OPTIONAL>] [xFuturePart <OPTIONAL>] [xDilation <OPTIONAL>]

Input is expected to be [Time, Width=1, inputChannels, Batch], and the output [Time, Width=1, outputChannels, Batch].

w2l::Residual

RES [numLayers (N)] [numResSkipConnections (K)] [numBlocks <OPTIONAL>]
[Layer1]
[Layer2]
[ResSkipConnection1]
[Layer3]
[ResSkipConnection2]
[Layer4]
...
[LayerN]
...
[ResSkipConnectionK]

Residual skip connections between layers can only be added if these layers have already been added. There two ways to define residual skip connection:

standard

SKIP [fromLayerInd] [toLayerInd] [scale <OPTIONAL, DEFAULT=1>]

with a sequence of projection layers, when, for the residual skip connection, the number of channels in the output of fromLayer differs from the number of channels expected in the input of toLayer (or some transformation is needed to be applied):

SKIPL [fromLayerInd] [toLayerInd] [nLayersInProjection (M)] [scale <OPTIONAL, DEFAULT=1>]
[Layer1]
[Layer2]
...
[LayerM]

where scale is the value by which the final output is multiplied ((x + f(x)) * scale). scale must be the same for all residual skip connections that share the same toLayer. (Use fromLayerInd = 0 for a skip connection from input, toLayerInd = N+1 for a residual skip connection to output, and fromLayerInd/toLayerInd = K for a residual skip connection from/to LayerK.)

w2l::TDSBlock

TDS [inputChannels] [kernelWidth] [inputWidth] [dropoutProb <OPTIONAL, DEFAULT=0>] [innerLinearDim <OPTIONAL, DEFAULT=0>] [rightPadding <OPTIONAL, DEFAULT=-1>] [lNormIncludeTime <OPTIONAL, DEFAULT=True>]

Description of these params can be found here

fl::PADDING

PD [value] [kernelWidth] [dim0PadBefore] [dim0PadAfter] [dim1PadBefore <OPTIONAL, DEFAULT=0>] [dim1PadAfter <OPTIONAL, DEFAULT=0>] [dim2PadBefore <OPTIONAL, DEFAULT=0>] [dim2PadAfter <OPTIONAL, DEFAULT=0>] [dim3PadBefore <OPTIONAL, DEFAULT=0>] [dim3PadAfter <OPTIONAL, DEFAULT=0>]

w2l::Transformer

TR [embeddingDim] [mlpDim] [nHeads] [maxPositions] [dropout] [layerDropout <OPTIONAL, DEFAULT=0>] [usePreNormLayer <OPTIONAL, DEFAULT=False>]

maxPositions is often max time dimension (audio with larger size cannot be processed).

Home

Installation

Dependencies
Build Instructions
Running With Docker

Training

Data Preparation
Writing Architecture Files
Train A Model
Distributed Training

Decoding

Beam Search Decoder

Python Bindings

Building Python Bindings
Python API

Inference Framework

Overview
Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing architecture files

Clone this wiki locally