Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average Perceptron POS Tagger (Issue #2) #131

Merged
merged 30 commits into from
Jun 15, 2019
Merged

Average Perceptron POS Tagger (Issue #2) #131

merged 30 commits into from
Jun 15, 2019

Conversation

ComputerMaestro
Copy link
Contributor

I added two files "averagePerceptron.jl" and "tagger.jl" to the "src" folder in "TextAnalysis" , former file contains the Average Perceptron model and latter file is used to train a model, predict tags or reuse a model using model in former file. I have not made any changes to any of the original files so currently the model can't be used through "TextAnalysis.jl", because I in tagger I have to include an optional argument "load" which specifies whether user want to load pretrain weights or not. If the user want to load pretrained weights then the input should "true" (Boolean), which I don't know how should I take (Since I am little new to julia).
I have commented the code for "loadModel" function in "tagger.jl" which should run if user gives that argument as true.
But for now we can train and test the model.
I will give the pretrained weights shortly.
Please suggest me changes to make the better and a way to take that argument("load") from the user
If I am able to take that argument the model will be completed.

I added two files "averagePerceptron.jl" and "tagger.jl" to the "src" folder in "TextAnalysis" , former file contains the Average Perceptron model and latter file is used to train a model, predict tags or reuse a model using model in former file. I have not made any changes to any of the original files so currently the model can't be used through "TextAnalysis.jl", because I in tagger I have to include an optional argument "load" which specifies whether user want to load pretrain weights or not. If the user want to load pretrained weights then the input should "true" (Boolean), which I don't know how should I take (Since I am little new to julia).
I have commented the code for "loadModel" function in "tagger.jl" which should run if user gives that argument as true.
But for now we can train and test the model.
I will give the pretrained weights shortly.
Please suggest me changes to make the better and a way to take that argument("load") from the user
If I am able to take that argument the model will be completed.
@aviks
Copy link
Member

aviks commented Mar 12, 2019

Nice, thanks.

Let's iterate on this step by step.

first, please use BSON.jl to store the weights.

For how to load a model take a look at sentiment.jl. Basically, create a model object, and load the pre-trained weights on calling the default constructor of the model. Does that make sense?

@ComputerMaestro
Copy link
Contributor Author

Thankyou, I was actually confused between BSON and JLD.
I looked into , and I am thinking of using struct instead of "module" is it fine or should I try something else.

function predict(features)
"""Predicting the class using current weights by doing Dot-product of features
and weights and return the scores"""
global self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? Have you written this based on some python code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to implement it like as in python I would. But the is wrong and I am now using struct instead of module and I am merging both files.
I have one thought in functions in sentiment.jl like:
function(m::SentimentModel)(text::Array{T, 1}) where T <: AbstractString return get_sentiment(text, m.weight, m.words) end
what is the use of these type of functions ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of a function definition allows us to use a model object a function to do the inference. So we can write code like

m=SentimentModel() #loads saved weights
m(input_data)      #performs inference

Copy link
Contributor Author

@ComputerMaestro ComputerMaestro Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, That means that function is associated to all the objects ?

src/tagger.jl Outdated
include("averagePerceptron.jl")
import .AveragePerceptron

mutable struct init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init is a bad name for a type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will change that. And I will both the files into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sir, I just read somewhere that I can't make member functions with struct in julia.
So I should use module for member functions.
is it true or is there a way to make member functions without using modules??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the tagger.jl to use BSON and have committed it . But currently is showing some checks are pending . Please have a look and tell whether are correct or not.

I have implemented BSON in the `tagger.jl` file.
I have deleted the averagePerceptron.jl file . Because I have merged both tagger and model in a new file
deleting tagger.jl as well because this no more needed as I will upload a new file which contains this.
This file is improved version of the file I committed earlier. This has no issues as I had earlier with those files. This contains fully working average perceptron model and Perceptron tagger both. It can be trained, tested and pretrain weights can be used as well. And in earlier files "averagePerceptron.jl" and "tagger.jl" I was not able to load the pretrained weights but now that problem has been solved now trained weights can be loaded successfully and easily. Earlier I used JLD for saving this time I have changed to BSON format.
I will upload the weights shortly.
Till then please have a look at it and suggest for improvements
There were some small issues I faced while I was testing different things and have resolved them.
I have changed `classes` variables from `OrderedSet` to `Set` since I found issue in loading `OrderedCollections` with `BSON` package.
And also removed some silly mistakes I did.
@ComputerMaestro
Copy link
Contributor Author

Sir , I trained model and saved the weights in bson format , with weights I have to save two more variable tagdict and classes. But there was problem in loading the classes I found that there is some issue in bson for loading Set type variables. Any suggestions what should do ??

@aviks
Copy link
Member

aviks commented Mar 15, 2019

Does is need to be a Set? Can you not make it a plain Array? That might be a simple workaround.

@aviks aviks closed this Mar 15, 2019
@aviks aviks reopened this Mar 15, 2019
@ComputerMaestro
Copy link
Contributor Author

yes, I did the same. I will commit it with trained weights.

The problem that we could not load `classes` from the file which contains pretrained weights, tagdict and classes , since `classes` was saved as `Set`. Because there is some issue in BSON , so that it can't load `Set` variables properly. Now the classes are first converted to `Array` and then saved. And while load it is loaded as `Array` and then converted to `Set`.
@ComputerMaestro
Copy link
Contributor Author

I could not upload weigths because the file is big.
So I have the link to weights
https://drive.google.com/open?id=1AQE5VRmI1qt3UMsxTRmpFcXrpchQRPwp
The weights got are not very much accurate becaue they are only trained on 200000 examples I will train and update them more.
Please check the averagePerceptronTagger.jl for any suggestion.

@ComputerMaestro
Copy link
Contributor Author

ComputerMaestro commented Mar 17, 2019

Sorry sir, by mistake cloased it.
I could not upload weights because the file is big in size.
So this is the link to the weights
https://drive.google.com/open?id=1AQE5VRmI1qt3UMsxTRmpFcXrpchQRPwp
There pretrained are not very accurate because they are only trained on 200000 examples.
I will update them more .
Please check the averagePerceptronTagger.jl, and suggest if any change is to be made.
Improved averagePerceptronTagger,.jl is in pending check now.

mutable struct AveragePerceptron
classes :: Set
weights :: Dict
_totals :: DefaultDict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why underscore in the this variable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are acting as accumulators. _totals accumulates the weights by summing up value of the weight at each step. _tstamps is used to store the number to timestamps the weight is not changed. _tstamps is not actually accumulating the number of timestamps but it store the value of self.i for a weight whenever there is a change in that weight. These steps can be better understand in the update function inside the upd_feat function.

classes :: Set
weights :: Dict
_totals :: DefaultDict
_tstamps :: DefaultDict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between Dict and DefaultDict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can specify a default value to the DefaultDict , if there a key whose value is yet defined in the dict then if we reference it in Dict it will show and error but if we do same with DefaultDict the default value will be assigned to that key and we can do any operation even if didn't specify any value for that key the dict earlier.
Ex.
a = DefaultDict(0)
a["some_key"] += 5
this won't show any error because by default 0 is assigned to "some_key" in the dict.

b = Dict{String, Int64}()
b["some_key"] += 5
this will show error because "some_key " was not defined earlier.

_totals :: DefaultDict
_tstamps :: DefaultDict
i :: Int64
START :: Array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why START is in capitals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have named it in capitals because it is a constant which contains ["-START-", "-START2-"]. And it is not changed through out the program. Same for the END which contains ["-END-", "-END2-"].

self._totals = DefaultDict(0)
self._tstamps = DefaultDict(0)
self.i = 1
self.START = ["-START-", "-START2-"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write this constructor like this:

function AveragePreceptron()
    new(Set(), Dict(), DefaultDict(0), DefaultDict(0), 1, ["-START-", "-START2-"]
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for what to do with the function.

self.i = 1
self.START = ["-START-", "-START2-"]

self.predict = function (features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while julia does have some object oriented features, generally code in Julia should be written in a more functional style. The biggest difference is, objects do not contain methods inside them. So predict should not be a method inside the object. Instead, write this as a function definition at the top level (not inside the type), and let the function take an instance of the AvergagePerceptron as the first parameter. Do this for all the functions below.

function predict(ap::AveragePerceptron, features)
    <body of predict function>
end

Copy link
Contributor Author

@ComputerMaestro ComputerMaestro Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happened due to working for long time in object-oriented programming languages. I will make these changes.

@aviks
Copy link
Member

aviks commented Mar 17, 2019

Hi @yash270200, Thanks, this is good. I have made some suggestions to make this look more idiomatic Julia. Please ask for clarification if any of the feedback is not clear.

@ComputerMaestro
Copy link
Contributor Author

Thanks sir for your suggestions, I will surely make the changes as soon as possible.

I have made the changes so that it is not like object oriented .
@ComputerMaestro
Copy link
Contributor Author

Sir I have made the suggested changes. Please have a look and give feedback for any further improvements.

@ComputerMaestro
Copy link
Contributor Author

ComputerMaestro commented Mar 20, 2019

Sir, apart from this issue. I am also interested to work on issue #28, in that what interface do you expect for wordnet. Please give some hint so that I can proceed with it as well.

And one more thing, Sir I am making a doc for the average perceptron model. In that, should I explain full functionality of the functions or give examples only.

@ComputerMaestro
Copy link
Contributor Author

@aviks , I have fixed that bug.
I will complete the documentation as soon as possible.

Copy link
Member

@Ayushk4 Ayushk4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small suggestions, mostly documentation relation related for now. These are based on the ones from official Julia documentation.
( https://docs.julialang.org/en/v1/manual/documentation/index.html ).

(Copy-pasting following from documentation linked above.) -

  • Place the starting and ending """ characters on lines by themselves.
  • Always show the signature of a function at the top of the documentation, with a four-space indent so that it is printed as Julia code.
  • Include a single one-line sentence describing what the function does or what the object represents after the simplified signature block. If needed, provide more details in a second paragraph, after a blank line.

@@ -172,3 +172,51 @@ julia> summarize(s, ns=2)
"Assume this Short Document as an example."
"This has too foo sentences."
```

##Parts of Speech Tagger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try building the documentation locally? I feel that this should have a space in between # and heading, for mardown to take affect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ayushk4 for this

The model can be trained from scratch and weights are saved in specified location.
The pretrained model can also be loaded and can be used directly to predict tags.

#To train model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this as a main heading of the document? I think it is better as plain text explaining with a sentence or 2, or a sub-sub heading (using ### ).

Check out the julia docs - https://docs.julialang.org/en/v1/stdlib/Markdown/#Headers-1

iteration : 5
```

#To load pretrained model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, remove this from Heading 1 type.

@@ -50,6 +50,7 @@ module TextAnalysis
export strip_numbers, strip_non_letters, strip_indefinite_articles, strip_definite_articles, strip_articles
export strip_prepositions, strip_pronouns, strip_stopwords, strip_sparse_terms, strip_frequent_terms, strip_html_tags
export SentimentAnalyzer
export PerceptronTagger, train, tag
Copy link
Member

@Ayushk4 Ayushk4 May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are exporting train and tag, then I guess it would be better to change the name to something hinting that it is specifically for the perceptron tagger. (Now we also have a Naive Bayes Classifier, soon more will be added 🙂 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make them as trainPerceptron and predictTag. If you have any names please let me know

Copy link
Member

@Ayushk4 Ayushk4 May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, maybe not export it at all, perhaps. Similar to Naive Bayes Classifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laschuet May you suggest on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, maybe not export it at all, perhaps. Similar to Naive Bayes Classifier.

yes, even I am thinking of not exporting it

Copy link
Contributor Author

@ComputerMaestro ComputerMaestro May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, Naive Bayes Classifier is using fit! instead of train and predict instead of tag so I am thinking of using the same names for Average Perceptron tagger as well. Is it fine?

end
end

"""Predicting the class using current weights by doing Dot-product of features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 8 on what official julia docs say about proper documentation - https://docs.julialang.org/en/v1/manual/documentation/index.html

  • Starting and Ending quotes """ should be on a separate line by themselves.

@ComputerMaestro
Copy link
Contributor Author

I have the changes @Ayushk4 , Have a look through docs.

@Ayushk4
Copy link
Member

Ayushk4 commented May 18, 2019

This can be removed with the merging of this PR.

tag_pos!(entity) = error("Not yet implemented")

@ComputerMaestro
Copy link
Contributor Author

This can be removed with the merging of this PR.

tag_pos!(entity) = error("Not yet implemented")

Yeahh, Removed that

@Ayushk4
Copy link
Member

Ayushk4 commented May 18, 2019

What should be done about this?

const tag_part_of_speech = UInt32(0x1) << 3

The function tag_pos! is still being used by prepare!( ) in preprocess.jl and also being exported in TextAnalysis.jl

@ComputerMaestro
Copy link
Contributor Author

I think @aviks can give a proper solution to this

@@ -36,29 +36,17 @@ function mk_regex(regex_string)
end


"""
remove_corrupt_utf8(str)
Copy link
Member

@Ayushk4 Ayushk4 Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something, went wrong while resolving conflicts. All the Docstrings (offline documentation) from preprocessing.jl got removed entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I preprocessing.jl while merging there was some problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks a lot.

@ComputerMaestro
Copy link
Contributor Author

What should be done about this?

const tag_part_of_speech = UInt32(0x1) << 3

The function tag_pos! is still being used by prepare!( ) in preprocess.jl and also being exported in TextAnalysis.jl

@aviks , what should be done for this???

@aviks
Copy link
Member

aviks commented Jun 9, 2019

@aviks , what should be done for this???

If we think tag_pos should be removed, then change the implementation of prepare! and add a deprecation message to tag_pos.

@aviks aviks merged commit 5730ba6 into JuliaText:master Jun 15, 2019
@Ayushk4
Copy link
Member

Ayushk4 commented Jun 16, 2019

src/pretrainedMod.bson is greater than 50MB, I get this warning upon doing the rebase with master -remote: warning: File src/pretrainedMod.bson is 50.83 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB.

Maybe we can store it elsewhere and call using DataDeps when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants