Cleaning code #19

ahalterman · 2018-07-25T20:29:37Z

@YanLiang1102, can you post the code that produces combined_cleaned_removed (from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.

The text was updated successfully, but these errors were encountered:

YanLiang1102 · 2018-07-25T22:23:09Z

def cleanupToken(data):
    token_count=0;
    for d in data:
        para=d['paragraphs'][0];
        for sen in para['sentences']:
            for tok in sen['tokens']:
                token_count+=1;
                tok['orth']=cleantext(tok['orth'])
    return token_count;

https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb it is at the end of this ipynb
@khaledJabr @ahalterman

YanLiang1102 · 2018-07-25T22:31:10Z

here is the command for transfer ontoNotes format to BILOU format

python onto_to_spacy_json.py -i "ontonotes-release-5.0/data/arabic/annotations/nw/ann/00" -t "ar_train.json" -e "ar_eval.json" -v 0.1

YanLiang1102 · 2018-07-25T22:35:06Z

@khaled I will post the ontoNotes raw data to you tomorrow it is on my lab computer.

YanLiang1102 · 2018-07-26T14:59:11Z

@ahalterman Hi Andy do you still have the LDC raw data, I did not find it on my local, did not remember where I put it, we can give that to Khaled for him to take a look.

ahalterman · 2018-07-27T12:33:03Z

Just sent you and Khaled a message.

YanLiang1102 · 2018-07-29T17:48:22Z

@khaled @ahalterman
so I use the onto_spacy_json.py to convert the Conll format to BILOU, for anercorp I just made all the tagging into one documents, and append to the LDC one, LDC has 401 docs, Anercorp just one,Anercorp does not have a lot of token, we can ignore that for now, if you want to check LDC is right or not just use the first 401 docs, let me know if you need more info on this

And after that I merge the tag into common ones, with the tag label both in anercorp and LDC
the data are here:/home/yan/nerdata on Manchester

ar_eval_all.json
ar_train_all.json( these two without merge tag without remove any diacritics)

ar_eval_all_cleaned.json
combined.json (there two has the merged tag, get rid of the last doc in combined json you can just look at the first 401 docs, the last one is Anercorp)
cleaned_combined_removed.json (this is the merged tag and removed diacritics version)

khaled · 2018-07-31T18:32:02Z

@YanLiang1102, FYI you're mentioning the wrong khaled - I have no connection with this project :-)

YanLiang1102 · 2018-07-31T19:00:54Z

@khaledJabr Hey Khaled I hope u saw the stuff, I mentioned a wrong Khaled, :P

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning code #19

Cleaning code #19

ahalterman commented Jul 25, 2018

YanLiang1102 commented Jul 25, 2018 •

edited

Loading

YanLiang1102 commented Jul 25, 2018

YanLiang1102 commented Jul 25, 2018

YanLiang1102 commented Jul 26, 2018

ahalterman commented Jul 27, 2018

YanLiang1102 commented Jul 29, 2018 •

edited

Loading

khaled commented Jul 31, 2018

YanLiang1102 commented Jul 31, 2018

Cleaning code #19

Cleaning code #19

Comments

ahalterman commented Jul 25, 2018

YanLiang1102 commented Jul 25, 2018 • edited Loading

YanLiang1102 commented Jul 25, 2018

YanLiang1102 commented Jul 25, 2018

YanLiang1102 commented Jul 26, 2018

ahalterman commented Jul 27, 2018

YanLiang1102 commented Jul 29, 2018 • edited Loading

khaled commented Jul 31, 2018

YanLiang1102 commented Jul 31, 2018

YanLiang1102 commented Jul 25, 2018 •

edited

Loading

YanLiang1102 commented Jul 29, 2018 •

edited

Loading