Average Perceptron POS Tagger (Issue #2) #131

ComputerMaestro · 2019-03-12T14:28:58Z

I added two files "averagePerceptron.jl" and "tagger.jl" to the "src" folder in "TextAnalysis" , former file contains the Average Perceptron model and latter file is used to train a model, predict tags or reuse a model using model in former file. I have not made any changes to any of the original files so currently the model can't be used through "TextAnalysis.jl", because I in tagger I have to include an optional argument "load" which specifies whether user want to load pretrain weights or not. If the user want to load pretrained weights then the input should "true" (Boolean), which I don't know how should I take (Since I am little new to julia).
I have commented the code for "loadModel" function in "tagger.jl" which should run if user gives that argument as true.
But for now we can train and test the model.
I will give the pretrained weights shortly.
Please suggest me changes to make the better and a way to take that argument("load") from the user
If I am able to take that argument the model will be completed.

I added two files "averagePerceptron.jl" and "tagger.jl" to the "src" folder in "TextAnalysis" , former file contains the Average Perceptron model and latter file is used to train a model, predict tags or reuse a model using model in former file. I have not made any changes to any of the original files so currently the model can't be used through "TextAnalysis.jl", because I in tagger I have to include an optional argument "load" which specifies whether user want to load pretrain weights or not. If the user want to load pretrained weights then the input should "true" (Boolean), which I don't know how should I take (Since I am little new to julia). I have commented the code for "loadModel" function in "tagger.jl" which should run if user gives that argument as true. But for now we can train and test the model. I will give the pretrained weights shortly. Please suggest me changes to make the better and a way to take that argument("load") from the user If I am able to take that argument the model will be completed.

aviks · 2019-03-12T16:30:44Z

Nice, thanks.

Let's iterate on this step by step.

first, please use BSON.jl to store the weights.

For how to load a model take a look at sentiment.jl. Basically, create a model object, and load the pre-trained weights on calling the default constructor of the model. Does that make sense?

ComputerMaestro · 2019-03-12T18:06:35Z

Thankyou, I was actually confused between BSON and JLD.
I looked into , and I am thinking of using struct instead of "module" is it fine or should I try something else.

aviks · 2019-03-12T20:01:41Z

src/averagePerceptron.jl

+    function predict(features)
+        """Predicting the class using current weights by doing Dot-product of features
+        and weights and return the scores"""
+        global self


What's this for? Have you written this based on some python code?

Yes, I tried to implement it like as in python I would. But the is wrong and I am now using struct instead of module and I am merging both files.
I have one thought in functions in sentiment.jl like:
function(m::SentimentModel)(text::Array{T, 1}) where T <: AbstractString return get_sentiment(text, m.weight, m.words) end
what is the use of these type of functions ??

This kind of a function definition allows us to use a model object a function to do the inference. So we can write code like

m=SentimentModel() #loads saved weights m(input_data) #performs inference

Oh, That means that function is associated to all the objects ?

aviks · 2019-03-12T21:12:18Z

src/tagger.jl

+    include("averagePerceptron.jl")
+    import .AveragePerceptron
+
+    mutable struct init


init is a bad name for a type

okay, I will change that. And I will both the files into one.

Sir, I just read somewhere that I can't make member functions with struct in julia.
So I should use module for member functions.
is it true or is there a way to make member functions without using modules??

I have changed the tagger.jl to use BSON and have committed it . But currently is showing some checks are pending . Please have a look and tell whether are correct or not.

I have implemented BSON in the `tagger.jl` file.

I have deleted the averagePerceptron.jl file . Because I have merged both tagger and model in a new file

deleting tagger.jl as well because this no more needed as I will upload a new file which contains this.

This file is improved version of the file I committed earlier. This has no issues as I had earlier with those files. This contains fully working average perceptron model and Perceptron tagger both. It can be trained, tested and pretrain weights can be used as well. And in earlier files "averagePerceptron.jl" and "tagger.jl" I was not able to load the pretrained weights but now that problem has been solved now trained weights can be loaded successfully and easily. Earlier I used JLD for saving this time I have changed to BSON format. I will upload the weights shortly. Till then please have a look at it and suggest for improvements

There were some small issues I faced while I was testing different things and have resolved them. I have changed `classes` variables from `OrderedSet` to `Set` since I found issue in loading `OrderedCollections` with `BSON` package. And also removed some silly mistakes I did.

ComputerMaestro · 2019-03-15T11:45:17Z

Sir , I trained model and saved the weights in bson format , with weights I have to save two more variable tagdict and classes. But there was problem in loading the classes I found that there is some issue in bson for loading Set type variables. Any suggestions what should do ??

aviks · 2019-03-15T19:54:33Z

Does is need to be a Set? Can you not make it a plain Array? That might be a simple workaround.

ComputerMaestro · 2019-03-15T22:43:33Z

yes, I did the same. I will commit it with trained weights.

The problem that we could not load `classes` from the file which contains pretrained weights, tagdict and classes , since `classes` was saved as `Set`. Because there is some issue in BSON , so that it can't load `Set` variables properly. Now the classes are first converted to `Array` and then saved. And while load it is loaded as `Array` and then converted to `Set`.

ComputerMaestro · 2019-03-17T19:18:15Z

I could not upload weigths because the file is big.
So I have the link to weights
https://drive.google.com/open?id=1AQE5VRmI1qt3UMsxTRmpFcXrpchQRPwp
The weights got are not very much accurate becaue they are only trained on 200000 examples I will train and update them more.
Please check the averagePerceptronTagger.jl for any suggestion.

ComputerMaestro · 2019-03-17T19:21:42Z

Sorry sir, by mistake cloased it.
I could not upload weights because the file is big in size.
So this is the link to the weights
https://drive.google.com/open?id=1AQE5VRmI1qt3UMsxTRmpFcXrpchQRPwp
There pretrained are not very accurate because they are only trained on 200000 examples.
I will update them more .
Please check the averagePerceptronTagger.jl, and suggest if any change is to be made.
Improved averagePerceptronTagger,.jl is in pending check now.

aviks · 2019-03-17T20:45:42Z

src/averagePerceptronTagger.jl

+mutable struct AveragePerceptron
+    classes :: Set
+    weights :: Dict
+    _totals :: DefaultDict


why underscore in the this variable name?

These variables are acting as accumulators. _totals accumulates the weights by summing up value of the weight at each step. _tstamps is used to store the number to timestamps the weight is not changed. _tstamps is not actually accumulating the number of timestamps but it store the value of self.i for a weight whenever there is a change in that weight. These steps can be better understand in the update function inside the upd_feat function.

aviks · 2019-03-17T20:46:07Z

src/averagePerceptronTagger.jl

+    classes :: Set
+    weights :: Dict
+    _totals :: DefaultDict
+    _tstamps :: DefaultDict


What's the difference between Dict and DefaultDict?

We can specify a default value to the DefaultDict , if there a key whose value is yet defined in the dict then if we reference it in Dict it will show and error but if we do same with DefaultDict the default value will be assigned to that key and we can do any operation even if didn't specify any value for that key the dict earlier.
Ex.
a = DefaultDict(0)
a["some_key"] += 5
this won't show any error because by default 0 is assigned to "some_key" in the dict.

b = Dict{String, Int64}()
b["some_key"] += 5
this will show error because "some_key " was not defined earlier.

aviks · 2019-03-17T20:46:23Z

src/averagePerceptronTagger.jl

+    _totals :: DefaultDict
+    _tstamps :: DefaultDict
+    i :: Int64
+    START :: Array


Why START is in capitals?

I have named it in capitals because it is a constant which contains ["-START-", "-START2-"]. And it is not changed through out the program. Same for the END which contains ["-END-", "-END2-"].

aviks · 2019-03-17T21:26:13Z

src/averagePerceptronTagger.jl

+        self._totals = DefaultDict(0)
+        self._tstamps = DefaultDict(0)
+        self.i = 1
+        self.START = ["-START-", "-START2-"]


write this constructor like this:

function AveragePreceptron() new(Set(), Dict(), DefaultDict(0), DefaultDict(0), 1, ["-START-", "-START2-"] end

See below for what to do with the function.

aviks · 2019-03-17T21:30:34Z

src/averagePerceptronTagger.jl

+        self.i = 1
+        self.START = ["-START-", "-START2-"]
+
+        self.predict = function (features)


So while julia does have some object oriented features, generally code in Julia should be written in a more functional style. The biggest difference is, objects do not contain methods inside them. So predict should not be a method inside the object. Instead, write this as a function definition at the top level (not inside the type), and let the function take an instance of the AvergagePerceptron as the first parameter. Do this for all the functions below.

function predict(ap::AveragePerceptron, features) <body of predict function> end

This happened due to working for long time in object-oriented programming languages. I will make these changes.

aviks · 2019-03-17T21:33:00Z

Hi @yash270200, Thanks, this is good. I have made some suggestions to make this look more idiomatic Julia. Please ask for clarification if any of the feedback is not clear.

ComputerMaestro · 2019-03-18T03:05:27Z

Thanks sir for your suggestions, I will surely make the changes as soon as possible.

I have made the changes so that it is not like object oriented .

ComputerMaestro · 2019-03-18T04:11:14Z

Sir I have made the suggested changes. Please have a look and give feedback for any further improvements.

ComputerMaestro · 2019-03-20T11:22:48Z

Sir, apart from this issue. I am also interested to work on issue #28, in that what interface do you expect for wordnet. Please give some hint so that I can proceed with it as well.

And one more thing, Sir I am making a doc for the average perceptron model. In that, should I explain full functionality of the functions or give examples only.

syncing with original repo

ComputerMaestro · 2019-05-15T17:45:05Z

@aviks , I have fixed that bug.
I will complete the documentation as soon as possible.

Ayushk4

I have some small suggestions, mostly documentation relation related for now. These are based on the ones from official Julia documentation.
( https://docs.julialang.org/en/v1/manual/documentation/index.html ).

(Copy-pasting following from documentation linked above.) -

Place the starting and ending """ characters on lines by themselves.
Always show the signature of a function at the top of the documentation, with a four-space indent so that it is printed as Julia code.
Include a single one-line sentence describing what the function does or what the object represents after the simplified signature block. If needed, provide more details in a second paragraph, after a blank line.

Ayushk4 · 2019-05-16T10:35:03Z

docs/src/features.md

@@ -172,3 +172,51 @@ julia> summarize(s, ns=2)
 "Assume this Short Document as an example."
 "This has too foo sentences."
 ```
+
+##Parts of Speech Tagger


Can you try building the documentation locally? I feel that this should have a space in between # and heading, for mardown to take affect.

Thanks @Ayushk4 for this

Ayushk4 · 2019-05-16T10:36:28Z

docs/src/features.md

+The model can be trained from scratch and weights are saved in specified location.
+The pretrained model can also be loaded and can be used directly to predict tags.
+
+#To train model:


Can you remove this as a main heading of the document? I think it is better as plain text explaining with a sentence or 2, or a sub-sub heading (using ### ).

Check out the julia docs - https://docs.julialang.org/en/v1/stdlib/Markdown/#Headers-1

Ayushk4 · 2019-05-16T10:38:12Z

docs/src/features.md

+iteration : 5
+```
+
+#To load pretrained model:


Same as above, remove this from Heading 1 type.

Ayushk4 · 2019-05-16T10:40:54Z

src/TextAnalysis.jl

@@ -50,6 +50,7 @@ module TextAnalysis
    export strip_numbers, strip_non_letters, strip_indefinite_articles, strip_definite_articles, strip_articles
    export strip_prepositions, strip_pronouns, strip_stopwords, strip_sparse_terms, strip_frequent_terms, strip_html_tags
    export SentimentAnalyzer
+    export PerceptronTagger, train, tag


If you are exporting train and tag, then I guess it would be better to change the name to something hinting that it is specifically for the perceptron tagger. (Now we also have a Naive Bayes Classifier, soon more will be added 🙂 )

I will make them as trainPerceptron and predictTag. If you have any names please let me know

Not sure, maybe not export it at all, perhaps. Similar to Naive Bayes Classifier.

@laschuet May you suggest on this?

Not sure, maybe not export it at all, perhaps. Similar to Naive Bayes Classifier.

yes, even I am thinking of not exporting it

Since, Naive Bayes Classifier is using fit! instead of train and predict instead of tag so I am thinking of using the same names for Average Perceptron tagger as well. Is it fine?

Ayushk4 · 2019-05-16T10:43:12Z

src/averagePerceptronTagger.jl

+    end
+end
+
+"""Predicting the class using current weights by doing Dot-product of features


Point 8 on what official julia docs say about proper documentation - https://docs.julialang.org/en/v1/manual/documentation/index.html

Starting and Ending quotes """ should be on a separate line by themselves.

ComputerMaestro · 2019-05-17T05:30:35Z

I have the changes @Ayushk4 , Have a look through docs.

Ayushk4 · 2019-05-18T11:43:24Z

This can be removed with the merging of this PR.

TextAnalysis.jl/src/preprocessing.jl

Line 218 in c6841c6

tag_pos!(entity) = error("Not yet implemented")

ComputerMaestro · 2019-05-18T12:14:26Z

This can be removed with the merging of this PR.

TextAnalysis.jl/src/preprocessing.jl

Line 218 in c6841c6

tag_pos!(entity) = error("Not yet implemented")

Yeahh, Removed that

Ayushk4 · 2019-05-18T12:26:10Z

What should be done about this?

TextAnalysis.jl/src/preprocessing.jl

Line 6 in e52fe67

const tag_part_of_speech = UInt32(0x1) << 3

The function tag_pos! is still being used by prepare!( ) in preprocess.jl and also being exported in TextAnalysis.jl

ComputerMaestro · 2019-05-19T04:05:08Z

I think @aviks can give a proper solution to this

Ayushk4 · 2019-06-04T10:31:00Z

src/preprocessing.jl

@@ -36,29 +36,17 @@ function mk_regex(regex_string)
 end


-"""
-    remove_corrupt_utf8(str)


I guess something, went wrong while resolving conflicts. All the Docstrings (offline documentation) from preprocessing.jl got removed entirely.

Yes, I preprocessing.jl while merging there was some problem.

Cool. Thanks a lot.

ComputerMaestro · 2019-06-04T11:14:42Z

What should be done about this?

TextAnalysis.jl/src/preprocessing.jl

Line 6 in e52fe67

const tag_part_of_speech = UInt32(0x1) << 3

The function tag_pos! is still being used by prepare!( ) in preprocess.jl and also being exported in TextAnalysis.jl

@aviks , what should be done for this???

aviks · 2019-06-09T19:26:56Z

@aviks , what should be done for this???

If we think tag_pos should be removed, then change the implementation of prepare! and add a deprecation message to tag_pos.

Ayushk4 · 2019-06-16T08:59:36Z

src/pretrainedMod.bson is greater than 50MB, I get this warning upon doing the rebase with master -remote: warning: File src/pretrainedMod.bson is 50.83 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB.

Maybe we can store it elsewhere and call using DataDeps when needed.

aviks reviewed Mar 12, 2019

View reviewed changes

ComputerMaestro added 5 commits March 13, 2019 13:02

Changes made to use BSON

a0f5de6

I have implemented BSON in the `tagger.jl` file.

Deleted averagePerceptron.jl

d159778

I have deleted the averagePerceptron.jl file . Because I have merged both tagger and model in a new file

Deleted tagger.jl

812f991

deleting tagger.jl as well because this no more needed as I will upload a new file which contains this.

aviks closed this Mar 15, 2019

aviks reopened this Mar 15, 2019

ComputerMaestro closed this Mar 17, 2019

ComputerMaestro reopened this Mar 17, 2019

aviks requested changes Mar 17, 2019

View reviewed changes

Update averagePerceptronTagger.jl file

e0101c5

I have made the changes so that it is not like object oriented .

ComputerMaestro added 4 commits April 17, 2019 10:31

added averagePerceptronTagger to TextAnalysis.jl

3e631e7

Merge branch 'master' of https://github.com/JuliaText/TextAnalysis.jl

4f485e6

syncing with original repo

Merge branch 'master' of https://github.com/JuliaText/TextAnalysis.jl

ccee46c

bug fix

4b01c2b

bug fix

cee1d40

ComputerMaestro added 3 commits May 16, 2019 14:03

Merge branch 'master' of https://github.com/JuliaText/TextAnalysis.jl

88889b8

Docs added

d78d6ba

doc changes

391183e

Ayushk4 reviewed May 16, 2019

View reviewed changes

Changes in docs

1527762

Changed function names

8eafc29

removed pos from preprocessing.jl

512e990

ComputerMaestro added 3 commits June 4, 2019 13:21

merge

107b0da

bug fix

5aab551

removed 0.7 from travis

23ab387

Ayushk4 reviewed Jun 4, 2019

View reviewed changes

ComputerMaestro added 2 commits June 4, 2019 16:14

0.7 added

55fda7b

copied preprocessing.jl

39b7674

ComputerMaestro added 3 commits June 15, 2019 11:47

Merge branch 'master' of https://github.com/JuliaText/TextAnalysis.jl

a220382

adding POS tagging to prepare!

af94ab9

adding deprecation for tag_pos!

94b27e8

aviks approved these changes Jun 15, 2019

View reviewed changes

aviks merged commit 5730ba6 into JuliaText:master Jun 15, 2019

This was referenced Jan 19, 2020

Feature Request: Part of speech tagging #2

Closed

Train the Avg Perceptron POS model till converged. #188

Open

Average Perceptron POS Tagger (Issue #2) #131

Average Perceptron POS Tagger (Issue #2) #131

Conversation

ComputerMaestro commented Mar 12, 2019

aviks commented Mar 12, 2019

ComputerMaestro commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro commented Mar 15, 2019

aviks commented Mar 15, 2019

ComputerMaestro commented Mar 15, 2019

ComputerMaestro commented Mar 17, 2019

ComputerMaestro commented Mar 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro Mar 18, 2019 • edited Loading

Choose a reason for hiding this comment

aviks commented Mar 17, 2019

ComputerMaestro commented Mar 18, 2019

ComputerMaestro commented Mar 18, 2019

ComputerMaestro commented Mar 20, 2019 • edited Loading

ComputerMaestro commented May 15, 2019

Ayushk4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayushk4 May 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayushk4 May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro commented May 17, 2019

Ayushk4 commented May 18, 2019

ComputerMaestro commented May 18, 2019

Ayushk4 commented May 18, 2019

ComputerMaestro commented May 19, 2019

Ayushk4 Jun 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ComputerMaestro commented Jun 4, 2019

aviks commented Jun 9, 2019

Ayushk4 commented Jun 16, 2019 • edited Loading

ComputerMaestro Mar 13, 2019 •

edited

Loading

ComputerMaestro commented Mar 17, 2019 •

edited

Loading

ComputerMaestro Mar 18, 2019 •

edited

Loading

ComputerMaestro commented Mar 20, 2019 •

edited

Loading

Ayushk4 May 16, 2019 •

edited

Loading

Ayushk4 May 17, 2019 •

edited

Loading

ComputerMaestro May 17, 2019 •

edited

Loading

Ayushk4 Jun 4, 2019 •

edited

Loading

Ayushk4 commented Jun 16, 2019 •

edited

Loading