Skip to content

🧘 BLISS – a Benchmark for Language Induction from Small Sets

License

Notifications You must be signed in to change notification settings

taucompling/bliss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧘 BLISS – a Benchmark for Language Induction from Small Sets

BLISS is a dataset for testing the generalization capabilities of artificial models for language induction. The benchmark score represent how well a model generalizes in inverse relation how little data it was trained on.

This repository contains the datasets and data generation scripts for training and testing a model on BLISS.

For the full method and specs see the paper Benchmarking Neural Network Generalization for Language Induction.

Languages

  • aⁿbⁿ
  • aⁿbⁿcⁿ
  • aⁿbⁿcⁿdⁿ
  • aⁿbᵐcⁿ⁺ᵐ
  • Dyck-1
  • Dyck-2

Citing this work

Please use the following citation if you use the datasets in your work:

@inproceedings{Lan_Chemla_Katzir_2023,
  title={Benchmarking Neural Network Generalization for Grammar Induction},
  author={Lan, Nur and Chemla, Emmanuel and Katzir, Roni},
  booktitle={Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)},
  pages={131--140},
  year={2023}
}

String structure

Following Gers & Schmidhuber (2001), all sequences start and end with the symbol #. This makes it possible to test for strict acceptance/rejection.

All files contain strings surrounded with # from both sides. Inputs and targets need to be trimmed accordingly.

Example:

aⁿbⁿ
Input string#aaabbb
Target stringaaabbb#

Deterministic and valid symbol masks

All datasets are provided with boolean mask tensors for testing model outputs:

  • Deterministic step masks - some languages have deterministic phases where a model's accuracy can be tested. For example, aⁿbⁿ sequences become deterministic after seeing the first b. A good model will not assign any probability to a after seeing the first b.

  • Valid symbol masks - languages like Dyck don't have any deterministic parts (a new parenthesis can always be opened). But the set of valid symbols at each time step is limited. For example, for a Dyck-1 sequence, after seeing #((, a good model must not assign any probability to the end-of-sequence symbol.

Examples:

aⁿbⁿ
String exampleaaabbb
Input sequence [#,a,a,a,b,b,b]
Target sequence[a,a,a,b,b,b,#]
Vocabulary{"#": 0, "a": 1, "b": 2}
Deterministic steps mask (boolean)[0,0,0,0,1,1,1]
Deterministic step mask shape(batch_size, sequence_length)
Dyck-1
String example(())()
Input sequence [#,(,(,),),(,)]
Target sequence[(,(,),),(,),#]
Vocabulary{"#": 0, "(": 1, ")": 2}
Valid symbols mask (boolean)[[1,1,0], [0,1,1], [0,1,1], [0,1,1], [1,1,0], [0,1,1], [1,1,0]]
Valid symbol mask shape(batch_size, sequence_length, vocabulary_size)

Folder structure

Each folder in datasets has the following structure:

  • <language_name>
    • train_<batch_size>_p_<prior>_seed_<seed>.txt.zip – train set of size batch_sizesampled using probability prior and using the random seed.
    • test.txt.zip – first 15,000 strings of the language sorted by length. aⁿbᵐcⁿ⁺ᵐ is sorted by n+m values. Dyck are sorted by length+lexicographically.
    • preview.txt – first 10 strings of the language.
    • test_deterministic_mask.npz – boolean mask for deterministic time steps, for relevant languages (all but Dyck languages). Shape: (batch_size, sequence_length).
    • test_valid_next_symbols.npz – boolean mask for relevant symbols, for Dyck languages. Shape: (batch_size, sequence_length, vocabulary_size).

Load npz mask files using :

np.load(filename)["data"]

️🚨 The password to all zip files is 1234. Why?

Generating new data

To generate new training data using a different seed, prior, or batch size, run:

python generate_dataset.py --lang [language-name] --seed [seed] --prior [prior]

Example:

python generate_dataset.py --lang an_bn --seed 100 --prior 0.3

Test contamination protection

To prevent test set contamination by large language models who train on crawled data and then test on it, all dataset files except previews are zipped and password-protected.

The password to all zip files is 1234.

See Jacovi et al., 2022 – Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks.

Each dataset folder contains preview.txt for easy inspection of the data.

Requirements

  • Python ≥ 3.5

Quick setup:

pip install -r requirements.txt

About

🧘 BLISS – a Benchmark for Language Induction from Small Sets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages