This dataset is licensed under Creative Commons Attribution 4.0 International License. It can be among others used to evaluate the context-aware hybrid text normalization under nemo_text_processing/hybrid.
It contains 3 datasets:
- EngConf.txt - manually created datasets focusing on ambiguous semiotic tokens where normalization dependends on the context.
- GoogleTN.json - derived from Google Text Normalization test data.
- LibriTTS.json - derived from LibriTTS where normalized text is different from written.
Find more information here.