Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rehearsal.py is using ontoNotes raw format not bilou format #17

Open
YanLiang1102 opened this issue Jul 12, 2018 · 8 comments
Open

Rehearsal.py is using ontoNotes raw format not bilou format #17

YanLiang1102 opened this issue Jul 12, 2018 · 8 comments
Assignees

Comments

@YanLiang1102
Copy link
Collaborator

better to make it look at bilou format and change to prodigy format since if in OntoNotes format it does not take advantage of the ner tag merged and anercorp data merged that we already worked on.

@ahalterman
Copy link
Collaborator

It is converting it to Prodigy format before putting it into the DB. See here.

@YanLiang1102
Copy link
Collaborator Author

@ahalterman yeah I got this, but the problem is it is looking at the annotations format that directly from ontoNotes, but not the bilou format, when I passed in the BILOU format data it returned 0 records being transferred, but if ontoNOtes format everything got transferred.

@ahalterman
Copy link
Collaborator

I was confused: the current rehearsal.py uses CoNLL format, not BILOU. Change rehearsal.py to handle BILOU formats, too.

@YanLiang1102
Copy link
Collaborator Author

@ahalterman do you get it now Andy? we need rehearsal to mixed in Bilou with Prodigy not Cornll with Prodigy.

ahalterman added a commit that referenced this issue Jul 31, 2018
@ahalterman
Copy link
Collaborator

I just added some code to do this, along with the code needed to use Arabic. (It was giving me some major git errors when I tried to put this in master). I realized I'm still confused, through: Prodigy doesn't handle BILOU, only spans. So are you training with spaCy or Prodigy for this step?

@YanLiang1102
Copy link
Collaborator Author

YanLiang1102 commented Jul 31, 2018

@ahalterman
so the problem is I am using Prodigy to train, but as you said Prodigy only look at the CONLL format not BILOU, so our previous effort like merge in the tag class and attache AnerCorp all in vein, since it does not look at the cleaned BILOU format, I was like is there any quick and dirty way to change the BILOU format into CONLL instead of directly looking at the raw "ontoNOtes data" , in that way Prodigy can directly look at it. since otherwise we need to do the preprocessing again on the conll format before we can make it to train on Prodigy.
Does it make sense this time?
:)

@ahalterman
Copy link
Collaborator

🤦‍♂️ So we need it to go from BILOU to Prodigy format...got it. Sorry about my confusion!

@YanLiang1102
Copy link
Collaborator Author

@ahalterman no problem, :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants