Skip to content

Commit

Permalink
README updates
Browse files Browse the repository at this point in the history
  • Loading branch information
llajan committed Mar 24, 2018
1 parent be0d486 commit 11ce6ea
Showing 1 changed file with 42 additions and 30 deletions.
72 changes: 42 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ This is a TensorFlow implementation accompanying our paper
Lajanugen Logeswaran, Honglak Lee.
An efficient framework for learning sentence representations. In ICLR, 2018.

This codebase is based on [Chris Shallue's Tensorflow implementation](https://github.com/tensorflow/models/tree/master/research/skip_thoughts) of the SkipThought model.
This codebase is based on Chris Shallue's [Tensorflow implementation](https://github.com/tensorflow/models/tree/master/research/skip_thoughts) of the SkipThought model.
The data preparation, vocabulary expansion and evaluation scripts have been adopted with minor changes.
Other code files have been modified and re-structured with changes specific to our model.

### Contents
Expand All @@ -14,7 +15,8 @@ Other code files have been modified and re-structured with changes specific to o
# * [Install Required Packages](#install-required-packages)
# * [Download Pretrained Models (Optional)](#download-pretrained-models-optional)
#* [Training a Model](#training-a-model)
# * [Prepare the Training Data](#prepare-the-training-data) # * [Run the Training Script](#run-the-training-script)
# * [Prepare the Training Data](#prepare-the-training-data)
# * [Run the Training Script](#run-the-training-script)
#* [Expanding the Vocabulary](#expanding-the-vocabulary)
# * [Overview](#overview)
# * [Preparation](#preparation)
Expand All @@ -28,27 +30,28 @@ Other code files have been modified and re-structured with changes specific to o

## Pre-trained Models
Models trained on the above datasets can be downloaded from [https://bit.ly/2DQTHXr](https://bit.ly/2DQTHXr).
The models are the multi-channel variations (MC-QT) discussed in the paper.
These models are the multi-channel variations (MC-QT) discussed in the paper.
If you are interested in evaluating these models or using them in your tasks, jump to [Evaluation on downstream tasks](#evaluation-on-downstream-tasks).

## Model configuration files

We use json configuration files to describe models. These configuration files provide a concise description of a model. They also make it easy to concatenate representations from different models/types of models at evaluation time.

The description of a sentence encoder looks like the following.
The description of a sentence encoder has the following format.
```
{
"encoder": "gru", # Type of encoder
"encoder_dim": 1200, # Dimensionality of encoder
"bidir": true, # Uni/bi directional
"checkpoint_path": "", # Path to checkpoint
"vocab_configs": [ # Configuration of vocabulary/word embeddings
"encoder": "gru", # Type of encoder
"encoder_dim": 1200, # Dimensionality of encoder
"bidir": true, # Uni/bi directional
"checkpoint_path": "", # Path to checkpoint
"vocab_configs": [ # Configuration of vocabulary/word embeddings
{
"mode": "trained", # Vocabulary mode: fixed/trained/expand
"mode": "trained", # Vocabulary mode: fixed/trained/expand
"name": "word_embedding",
"dim": 620, # Word embedding size
"size": 50001, # Size of vocabulary
"vocab_file": "BC_dictionary.txt", # Dictionary file
"embs_file": "" # Provide external embeddings file
"dim": 620, # Word embedding size
"size": 50001, # Size of vocabulary
"vocab_file": "BC_dictionary.txt", # Dictionary file
"embs_file": "" # Provide external embeddings file
}
]
}
Expand All @@ -57,11 +60,11 @@ The description of a sentence encoder looks like the following.
Vocabulary mode can be one of *fixed*, *trained* or *expand*. These modes represent the following cases.
* *fixed* - Use fixed, pre-trained embeddings.
* *trained* - Train word embeddings from scratch.
* *expand* - Use an expanded vocabulary. Only used during evaluation on downstream tasks.
* *expand* - Use an expanded vocabulary. This mode is only used during evaluation on downstream tasks.

`checkpoint_path` and `vocab_file` have to be specified only for evaluation.

For using a concatenated representation at evaluation time, the json file can be a list of multiple encoder specifications. See `model_configs/BC/eval.json` for an example.
For concatenating representations from multiple sentence encoders at evaluation time, the json file can be a list of multiple encoder specifications. See `model_configs/BC/eval.json` for an example.

## Training a Model

Expand All @@ -70,7 +73,7 @@ For using a concatenated representation at evaluation time, the json file can be
The training script requires data to be in (sharded) TFRecord format.
`scripts/data_prep.sh` can be used to generate these files.
The script requires a dictionary file and comma-separated paths to files containing tokenized sentences.
* The dictionary file should have a single word in each line. We assume that the first token ("<unk>") represets OOV words.
* The dictionary file should have a single word in each line. We assume that the first token ("\<unk>") represets OOV words.
* The data files are expected to have a tokenized sentence in each line, in the same order as the source document.

The following datasets were used for training out models.
Expand All @@ -84,38 +87,47 @@ The dictionary files we used for training our models are available at [https://b
Use the `run.sh` script to train a model.
The following variables have to be specified.

* DATA\_DIR # Path to TFRecord files
* RESULTS\_HOME # Directory to store results
```
* DATA_DIR # Path to TFRecord files
* RESULTS_HOME # Directory to store results
* CFG # Name of model configuration
* MDL\_CFGS # Path to model configuration files
* GLOVE\_PATH # Path to GloVe dictionary and embeddings
* MDL_CFGS # Path to model configuration files
* GLOVE_PATH # Path to GloVe dictionary and embeddings
```

Example configuration files are provided in the model\_configs folder. During training, model files will be stored under a directory named $RESULTS\_HOME/$CFG.
Example configuration files are provided in the model\_configs folder. During training, model files will be stored under a directory named `$RESULTS\_HOME/$CFG`.

## Training using pre-trained word embeddings
### Training using pre-trained word embeddings

The implementation supports using fixed pre-trained GloVe word embeddings.
The code expects a numpy array file consisting of the GloVe word embeddings named `glove.840B.300d.npy` in the `$GLOVE_PATH` folder.

## Expanding the Vocabulary
## Evaluating a Model

Once the model is trained, the vocabulary used for training can be optionally expanded to a larger vocabulary using the technique proposed by the SkipThought paper. The `voc_exp.sh` script can be used to perform expansion. Since Word2Vec embeddings are used for expansion, you will have to download the Word2Vec model. The script assumes that the gensim library is avalable on the system.
### Expanding the Vocabulary

## Evaluating a Model
Once the model is trained, the vocabulary used for training can be optionally expanded to a larger vocabulary using the technique proposed by the SkipThought paper.
The `voc_exp.sh` script can be used to perform expansion.
Since Word2Vec embeddings are used for expansion, you will have to download the Word2Vec model.
The script also makes use of the gensim library.

### Evaluation on downstream tasks

Use the `eval.sh` script for evaluation. The following variables need to be set.

```
* CFG # Name of model configuration
* TASK # Name of the task
* MDLS\_PATH # Path to model files
* MDL\_CFGS # Path to model configuration files
* GLOVE\_PATH # Path to GloVe dictionary and embeddings
* MDLS_PATH # Path to model files
* MDL_CFGS # Path to model configuration files
* GLOVE_PATH # Path to GloVe dictionary and embeddings
* SKIPTHOUGHTS # Path to SkipThoughts implementation
* DATA # Data directory for downstream tasks
```

Evaluation scripts for the downstream tasks from the authors of the SkipThought model are used. These scripts train a linear layer on top of the sentence embeddings for each task.
You will need to clone or download the [skip-thoughts GitHub repository](https://github.com/ryankiros/skip-thoughts) by [ryankiros](https://github.com/ryankiros).
Set the `DATA` variable to the directory containing data for the downstream tasks.
See the above repository for further details regarding downloading and setting upthe data.
See the above repository for further details regarding downloading and setting up the data.


0 comments on commit 11ce6ea

Please sign in to comment.