README.md

Evaluation Data for NeMo Text Processing

This dataset is licensed under Creative Commons Attribution 4.0 International License. It can be among others used to evaluate the context-aware hybrid text normalization under nemo_text_processing/hybrid.

It contains 3 datasets:

EngConf.txt - manually created datasets focusing on ambiguous semiotic tokens where normalization dependends on the context.
GoogleTN.json - derived from Google Text Normalization test data.
LibriTTS.json - derived from LibriTTS where normalized text is different from written.

Find more information here.