Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora

Dataset

Free ST Chinese Mandarin Corpus (Speaker Info)

This corpus was recorded in a quiet, indoor environment using a cellphone. It includes 855 speakers, each contributing 120 utterances. All utterances were meticulously transcribed and verified by human annotators, ensuring transcription accuracy.

Speaker Data Example:
20170001P00189I0036 | Gender: Female | Age: 31 | Region: Sichuan | Utterance: 我有两张君悦皇家浴场

P00189I: Speaker number
0036: Speech number

Free ST Chinese Mandarin Corpus

MAGICDATA Mandarin Chinese Read Speech Corpus (Read Speech) (Speaker Info)

Developed by Magic Data Technology Co., Ltd., this corpus contains 755 hours of scripted read speech data from 1080 native Mandarin speakers from mainland China. The sentence transcription accuracy exceeds 98%.

MAGICDATA Mandarin Chinese Read Speech Corpus

aidatatang_200zh

This corpus includes:

200 hours of mostly mobile-recorded acoustic data.
600 speakers from various accent regions in China.
Transcription accuracy for each sentence is over 98%.
Recordings were made in quiet indoor settings.
The database is split into a training set, validation set, and testing set in a 7:1:2 ratio.
Detailed information such as speech data coding and speaker information is included in the metadata file.
Segmented transcripts are provided.

aidatatang_200zh

Process Pipelines

Metadata Statistics

Forced Alignment

We utilized Charsiu for vowel-level forced alignment. Since the authors of Charsiu have released the textgrid files we require, we directly used their alignment files.

Charsiu Forced Alignment

Segmentation

Word segmentation is performed using LTP.

LTP - Language Technology Platform

Tone Annotation

Tone annotation is conducted using g2pM.

g2pM - Grapheme to Phoneme for Mandarin

Features

We generated the following features for each character in the utterances across all three datasets:

References

If you want to use the data, please cite the following papers:

For Tone Sandhi Data:

@inproceedings{Tian2022MandarinTS,
  title={Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora},
  author={Zuoyu Tian and Xiao Dong and Feier Gao and Haining Wang and Charles Steven Lin},
  booktitle={Interspeech},
  year={2022},
  url={https://www.isca-speech.org/archive/pdfs/interspeech_2022/tian22e_interspeech.pdf}
}

For Forced Alignment Files:

@article{zhu2022charsiu,
  title={Phone-to-audio alignment without text: A Semi-supervised Approach},
  author={Zhu, Jian and Zhang, Cong and Jurgens, David},
  journal={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
datatang_23.pkl		datatang_23.pkl
datatang_32.pkl		datatang_32.pkl
datatang_33.pkl		datatang_33.pkl
magicdata_23.pkl		magicdata_23.pkl
magicdata_32.pkl		magicdata_32.pkl
magicdata_33.pkl		magicdata_33.pkl
stcmds_23.pkl		stcmds_23.pkl
stcmds_32.pkl		stcmds_32.pkl
stcmds_33.pkl		stcmds_33.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora

Dataset

Free ST Chinese Mandarin Corpus (Speaker Info)

MAGICDATA Mandarin Chinese Read Speech Corpus (Read Speech) (Speaker Info)

aidatatang_200zh

Process Pipelines

Metadata Statistics

Forced Alignment

Segmentation

Tone Annotation

Features

References

About

Releases

Packages

zytian9/Mandarin-tone-sandhi-statistics

Folders and files

Latest commit

History

Repository files navigation

Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora

Dataset

Free ST Chinese Mandarin Corpus (Speaker Info)

MAGICDATA Mandarin Chinese Read Speech Corpus (Read Speech) (Speaker Info)

aidatatang_200zh

Process Pipelines

Metadata Statistics

Forced Alignment

Segmentation

Tone Annotation

Features

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages