Statistical tokenization algorithms #207

tejasvaidhyadev · 2020-04-23T09:32:16Z

Hi @aviks and @oxinabox
Statical tokenizers are used in lot of Transformer based models including BERT family becasue of their ablity to tackle out of vocabulary problem.
After going through Tokenizers in WordTokenizers.jl which i think is pretty good and fast and it will be great if we can built statical tokenizers like BPE, unigram language models etc. on top of it.
I have gone through the following papers-
BPE
unigram language model
any suggestions how to proceed ?
Where should we keep it in TextAnlaysis.jl or WordTokenizers.jl ?

The text was updated successfully, but these errors were encountered:

aviks · 2020-04-23T09:57:23Z

I think this should go into the WordTokenizers.jl. Unless there are big new dependencies, in which case they should probably go in its own package.

oxinabox · 2020-04-23T10:22:57Z

Closing in favor of
JuliaText/WordTokenizers.jl#44

oxinabox closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical tokenization algorithms #207

Statistical tokenization algorithms #207

tejasvaidhyadev commented Apr 23, 2020 •

edited

Loading

aviks commented Apr 23, 2020

oxinabox commented Apr 23, 2020

Statistical tokenization algorithms #207

Statistical tokenization algorithms #207

Comments

tejasvaidhyadev commented Apr 23, 2020 • edited Loading

aviks commented Apr 23, 2020

oxinabox commented Apr 23, 2020

tejasvaidhyadev commented Apr 23, 2020 •

edited

Loading