Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning code #19

Open
ahalterman opened this issue Jul 25, 2018 · 8 comments
Open

Cleaning code #19

ahalterman opened this issue Jul 25, 2018 · 8 comments

Comments

@ahalterman
Copy link
Collaborator

@YanLiang1102, can you post the code that produces combined_cleaned_removed (from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.

@YanLiang1102
Copy link
Collaborator

YanLiang1102 commented Jul 25, 2018

def cleanupToken(data):
    token_count=0;
    for d in data:
        para=d['paragraphs'][0];
        for sen in para['sentences']:
            for tok in sen['tokens']:
                token_count+=1;
                tok['orth']=cleantext(tok['orth'])
    return token_count;

https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb it is at the end of this ipynb
@khaledJabr @ahalterman

@YanLiang1102
Copy link
Collaborator

here is the command for transfer ontoNotes format to BILOU format

python onto_to_spacy_json.py -i "ontonotes-release-5.0/data/arabic/annotations/nw/ann/00" -t "ar_train.json" -e "ar_eval.json" -v 0.1

@YanLiang1102
Copy link
Collaborator

@khaled I will post the ontoNotes raw data to you tomorrow it is on my lab computer.

@YanLiang1102
Copy link
Collaborator

@ahalterman Hi Andy do you still have the LDC raw data, I did not find it on my local, did not remember where I put it, we can give that to Khaled for him to take a look.

@ahalterman
Copy link
Collaborator Author

Just sent you and Khaled a message.

@YanLiang1102
Copy link
Collaborator

YanLiang1102 commented Jul 29, 2018

@khaled @ahalterman
so I use the onto_spacy_json.py to convert the Conll format to BILOU, for anercorp I just made all the tagging into one documents, and append to the LDC one, LDC has 401 docs, Anercorp just one,Anercorp does not have a lot of token, we can ignore that for now, if you want to check LDC is right or not just use the first 401 docs, let me know if you need more info on this

And after that I merge the tag into common ones, with the tag label both in anercorp and LDC
the data are here:/home/yan/nerdata on Manchester

ar_eval_all.json
ar_train_all.json( these two without merge tag without remove any diacritics)

ar_eval_all_cleaned.json
combined.json (there two has the merged tag, get rid of the last doc in combined json you can just look at the first 401 docs, the last one is Anercorp)
cleaned_combined_removed.json (this is the merged tag and removed diacritics version)

@khaled
Copy link

khaled commented Jul 31, 2018

@YanLiang1102, FYI you're mentioning the wrong khaled - I have no connection with this project :-)

@YanLiang1102
Copy link
Collaborator

@khaledJabr Hey Khaled I hope u saw the stuff, I mentioned a wrong Khaled, :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants