Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical tokenization algorithms #207

Closed
tejasvaidhyadev opened this issue Apr 23, 2020 · 2 comments
Closed

Statistical tokenization algorithms #207

tejasvaidhyadev opened this issue Apr 23, 2020 · 2 comments

Comments

@tejasvaidhyadev
Copy link
Member

tejasvaidhyadev commented Apr 23, 2020

Hi @aviks and @oxinabox
Statical tokenizers are used in lot of Transformer based models including BERT family becasue of their ablity to tackle out of vocabulary problem.
After going through Tokenizers in WordTokenizers.jl which i think is pretty good and fast and it will be great if we can built statical tokenizers like BPE, unigram language models etc. on top of it.
I have gone through the following papers-
BPE
unigram language model
any suggestions how to proceed ?
Where should we keep it in TextAnlaysis.jl or WordTokenizers.jl ?

@aviks
Copy link
Member

aviks commented Apr 23, 2020

I think this should go into the WordTokenizers.jl. Unless there are big new dependencies, in which case they should probably go in its own package.

@oxinabox
Copy link
Member

Closing in favor of
JuliaText/WordTokenizers.jl#44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants