Skip to content

Latest commit

 

History

History

cs-restaurant

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

TGen for CS Restaurant

To train and evaluate TGen on the CS Restaurant dataset, you need to:

  1. Convert the CS Restaurant data into a format used by TGen. This is done using the input/convert.py script. Several slots (see below) are delexicalized. The output files are:

    • *-abst.txt -- lexicalization instructions (what was delexicalized at which position in the references, can be used to lexicalize the outputs)
    • *-das.txt -- delexicalized DAs
    • *-das_l.txt -- original, lexicalized DAs (converted to TGen's representation, semantically equivalent)
    • *-text.conll -- delexicalized reference texts -- CoNLL-U format (morphology level only)
    • *-text_l.conll -- original, lexicalized reference texts -- CoNLL-U format (morphology level only)
    • *-text.txt -- delexicalized reference texts -- plain text
    • *-text_l.txt -- original, lexicalized reference texts -- plain text
    • *-tls.txt -- delexicalized reference texts -- interleaved forms/lemmas/tags
    • *-tls_l.txt -- original, lexicalized reference texts -- interleaved forms/lemmas/tags

    You need MorphoDiTa installed, and a Czech tagger model saved in the current directory (czech-morfflex-pdt-160310.tagger).

./convert.py -a name,area,address,phone,good_for_meal,near,food,price_range,count,price,postcode \
    czech-morfflex-pdt-160310.tagger surface_forms.json train.json train
./convert.py -a name,area,address,phone,good_for_meal,near,food,price_range,count,price,postcode \
    czech-morfflex-pdt-160310.tagger surface_forms.json devel.json devel
./convert.py -a name,area,address,phone,good_for_meal,near,food,price_range,count,price,postcode \
    czech-morfflex-pdt-160310.tagger surface_forms.json test.json test
  1. Train TGen on the training set. This uses the default configuration file, the converted data, and the default random seed. It will save the model into model.pickle.gz (and several other files starting with model). If you want to use the development set for validation, add -v input/devel-das.txt,input/devel-text.conll as a parameter.
../run_tgen.py seq2seq_train config/config.yaml \
    input/train-das.txt input/train-text.conll \
    model.pickle.gz
  1. Generate outputs on the development set. This will also perform lexicalization of the outputs. Change devel for test if you want to generate outputs on the test set.
../run_tgen.py seq2seq_gen -w outputs.txt -a input/devel-abst.txt \
    model.pickle.gz input/devel-das.txt

Remarks

Please refer to ../USAGE.md for TGen installation instructions.

The full configuration used Treex for data storage, tree-based generation, and output postprocessing. Getting Treex to install is a little tricky. Please contact me if you want to use it.

The Makefile in this directory contains a simple experiment management system, but this assumes running on a SGE computing cluster and there are site-specific settings hardcoded. Please contact me if you want to use it.