Skip to content

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP2020)

License

Notifications You must be signed in to change notification settings

mrpeerat/SEFR_CUT

Repository files navigation

SEFR CUT

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
CRF as Stacked Model and DeepCut as Baseline model

Read more:

Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble Blog: Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ

Install

pip install sefr_cut

How To use

Requirements

  • python >= 3.6
  • python-crfsuite >= 0.9.7
  • pyahocorasick == 1.4.0

Example

You can play the example on SEFR Example notebook

Load Engine & Engine Mode

  • ws1000, tnhc
    • ws1000: Model trained on Wisesight-1000 and test on Wisesight-160
    • tnhc: Model trained on TNHC (80:20 train&test split with random seed 42)
    • BEST: Trained on BEST-2010 Corpus (NECTEC)
    SEFR_CUT.load_model(engine='ws1000')
    # OR
    SEFR_CUT.load_model(engine='tnhc')
    # OR
    SEFR_CUT.load_model(engine='best')
    
  • tl-deepcut-XXXX
    • We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
    SEFR_CUT.load_model(engine='tl-deepcut-ws1000')
    # OR
    SEFR_CUT.load_model(engine='tl-deepcut-tnhc')
    
  • deepcut
    • We also provide the original deepcut
    SEFR_CUT.load_model(engine='deepcut')
    

Segment Example

  • Segment with default k
    SEFR_CUT.load_model(engine='ws1000')
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ']))
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย']))
    print(sefr_cut.tokenize('สวัสดีประเทศไทย'))
    
    [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
    [['สวัสดี', 'ประเทศ', 'ไทย']]
    [['สวัสดี', 'ประเทศ', 'ไทย']]
    
  • Segment with different k
    SEFR_CUT.load_model(engine='ws1000')
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number
    
    [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']]
    [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
    

Performance

How to re-train?

  • You can re-train model in folder Notebooks in 1 File and show the example in 1 File !!

Citation

  • Wait our paper shown in ACL Anthology

Thank you many code from

About

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP2020)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published