PyTorch starter code for the Visual Dialog Challenge.
- Setup and Dependencies
- Download Preprocessed Data
- Training
- Evaluation
- Generate Submission
- Acknowledgements
This starter code is implemented using PyTorch v1.0 and provides out of the box support with CUDA 9 and CuDNN 7. There are two recommended ways to set up this codebase:
- Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
- Clone this repository and create an environment:
git clone https://www.github.com/batra-mlp-lab/visdial-challenge-starter-pytorch
conda create -n visdialch python=3.6
# activate the environment and install all dependencies
conda activate visdialch
pip install -r requirements.txt
# install this codebase as a package in development version
python setup.py develop
- Install nvidia-docker, which enables usage of GPUs from inside a container.
- We provide a Dockerfile which creates a light-weight image with all the dependencies installed. Build the image as:
docker build -t visdialch .
- Run this image in a container by setting current user, attaching current directory (this codebase) as a volume and setting shared memory size according to your requirements (depends on the memory usage of your code).
nvidia-docker run -u $(id -u):(id -g) -v $PWD:/workspace \
--shm-size 16G visdialch /bin/bash
Since the codebase is attached as a volume, any changes made to the source code from outside the container will be reflected immediately inside the container, hence this would fit easily in almost any development workflow.
Note: We recommend you to contain all the source code for data loading, models and other utilities inside visdialch
directory, since it is a setuptools-style package, it makes handling of absolute/relative imports and module resolving less painful. Scripts using visdialch
can be created anywhere in the filesystem, as far as the current conda environment is active.
We provide preprocessed files for VisDial v1.0 (tokenized captions, questions, answers, image indices, vocabulary mappings and image features extracted by pretrained CNN). If you wish to preprocess data or extract your own features, skip this step.
Extracted features for v1.0 train, val and test are available for download here.
visdial_data_train.h5
: Tokenized captions, questions, answers, image indices, for training ontrain
visdial_params_train.json
: Vocabulary mappings and COCO image ids for training ontrain
data_img_vgg16_relu7_train.h5
: VGG16relu7
image features for training ontrain
data_img_vgg16_pool5_train.h5
: VGG16pool5
image features for training ontrain
visdial_data_trainval.h5
: Tokenized captions, questions, answers, image indices, for training ontrain
+val
visdial_params_trainval.json
: Vocabulary mappings and COCO image ids for training ontrain
+val
data_img_vgg16_relu7_trainval.h5
: VGG16relu7
image features for training ontrain
+val
data_img_vgg16_pool5_trainval.h5
: VGG16pool5
image features for training ontrain
+val
Download these files to data
directory.
This codebase supports discriminative decoding only; read more here. For reference, we have Late Fusion Encoder from the Visual Dialog paper.
We provide a training script which accept arguments as config files. The config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments.
Train the baseline model provided in this repository as:
python train.py --config-yml configs/lf_disc_vgg16_fc7_bs32.yml --gpu-ids 0 1 # provide more ids for multi-GPU execution other args...
To extend this starter code, add your own encoder/decoder modules into their respective directories and include their names as choices in your config file. We have an --overfit
flag, which can be useful for rapid debugging. It takes a batch of 5 examples and overfits the model on them.
Saving model checkpoints: This script will save model checkpoints at every epoch as per path specified by --save-dirpath
. We recommend you to read the module docstring in visdialch/utils/checkpointing.py for more details on how checkpointing is managed.
Evaluation of a trained model checkpoint can be done as follows:
python evaluate.py --config-yml /path/to/config.yml --load-path /path/to/checkpoint.pth --split val --use-gt --gpu-ids 0
To evaluate on metrics from the Visual Dialog paper (Mean reciprocal rank, R@{1, 5, 10}, Mean rank), use the --use-gt
flag. Since the test
split has no ground truth, --split test
won't work here.
Note: The metrics reported here would be the same as those reported through EvalAI by making a submission in val
phase.
To save predictions in a format submittable to the evaluation server on EvalAI, run the evaluation script (without using the --use-gt
flag).
To generate a submission file for test-std
or test-challenge
phase:
python evaluate.py --config-yml /path/to/config.yml --load-path /path/to/checkpoint.pth --split test --save-ranks-path /path/to/submission.json --gpu-ids 0
- This starter code began as a fork of batra-mlp-lab/visdial-rl. We thank the developers for doing most of the heavy-lifting.
- The Lua-torch codebase of Visual Dialog, at batra-mlp-lab/visdial, served as an important reference while developing this codebase.
- Some documentation and design strategies of
Reader
andVocabulary
classes are inspired from AllenNLP, It is not a dependency because the use-case in this codebase would be too little in its current state.