Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
CRF as Stacked Model and DeepCut as Baseline model
Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble Blog: Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ
pip install sefr_cut
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0
You can play the example on SEFR Example notebook
- ws1000, tnhc
- ws1000: Model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: Model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: Trained on BEST-2010 Corpus (NECTEC)
SEFR_CUT.load_model(engine='ws1000') # OR SEFR_CUT.load_model(engine='tnhc') # OR SEFR_CUT.load_model(engine='best')
- tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
SEFR_CUT.load_model(engine='tl-deepcut-ws1000') # OR SEFR_CUT.load_model(engine='tl-deepcut-tnhc')
- deepcut
- We also provide the original deepcut
SEFR_CUT.load_model(engine='deepcut')
- Segment with default k
SEFR_CUT.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'])) print(sefr_cut.tokenize(['สวัสดีประเทศไทย'])) print(sefr_cut.tokenize('สวัสดีประเทศไทย')) [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย']] [['สวัสดี', 'ประเทศ', 'ไทย']]
- Segment with different k
SEFR_CUT.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
- You can re-train model in folder Notebooks in 1 File and show the example in 1 File !!
- Wait our paper shown in ACL Anthology
Thank you many code from
- Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
- @bact (CRF training code) : We used some from https://github.com/bact/nlp-thai in training CRF Model