Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
Peerat Limkonchotiwat, Raheem Sawar, Wannaphong Phatthiyaphaibun, Ekapol Chuangsuwanich, Sarana Nutanong
CRF as Stacked Model and DeepCut as Baseline model
- Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
- Blog: Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ
pip install sefr_cut
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0
You can play the example on SEFR Example notebook
- ws1000, tnhc
- ws1000: Model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: Model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: Trained on BEST-2010 Corpus (NECTEC)
SEFR_CUT.load_model(engine='ws1000') # OR SEFR_CUT.load_model(engine='tnhc') # OR SEFR_CUT.load_model(engine='best')
- tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
SEFR_CUT.load_model(engine='tl-deepcut-ws1000') # OR SEFR_CUT.load_model(engine='tl-deepcut-tnhc')
- deepcut
- We also provide the original deepcut
SEFR_CUT.load_model(engine='deepcut')
- Segment with default k
SEFR_CUT.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'])) print(sefr_cut.tokenize(['สวัสดีประเทศไทย'])) print(sefr_cut.tokenize('สวัสดีประเทศไทย')) [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย']] [['สวัสดี', 'ประเทศ', 'ไทย']]
- Segment with different k
SEFR_CUT.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
- Character & Word Evaluation is provided by call fuction
evaluation()
- For example
- You can re-train model in folder Notebooks We provided everything for you!!
- You need to XXXXXXXXXXX
- Link:HERE
- You need to XXXXXXXXXXX
- Link:HERE
- You need to XXXXXXXXXXX
- Wait our paper shown in ACL Anthology
Thank you many code from
- Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
- @bact (CRF training code) : We used some from https://github.com/bact/nlp-thai in training CRF Model