Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

methods to help manipulate and update DocumentTermMatrix incrementally #244

Merged
merged 3 commits into from
Dec 15, 2020

Conversation

tanmaykm
Copy link
Contributor

This adds a few methods to help manipulate and update DocumentTermMatrix incrementally:

  • serialize and deserialize: optimized by not serializing the column index unnecessarily - that is re-constructed from the terms vector upon deserialization
  • prune!: removes documents and terms from an existing index - those that would correspond to deleted documents
  • merge!: merges two instances - one being the incremental update to be applied on an existing full index

Also relaxed StatsBase version requirement to include both 0.32 and 0.33. This was preventing this package being used together with certain other packages.

fixes #243

Add serialization method for DocumentTermMatrix that optimizes by not serializing the column index unnecessarily. That is re-constructed from the terms vector on deserialization.
@aviks aviks merged commit b5c9dce into JuliaText:master Dec 15, 2020
@tanmaykm
Copy link
Contributor Author

I see now that copyto! for sparse matrices is quite slow for large matrices. It should be possible to implement a different copy method optimized for the specific needs of merging DTMs in TextAnalysis. I shall send another PR with that in a bit.

tanmaykm added a commit to tanmaykm/TextAnalysis.jl that referenced this pull request Dec 16, 2020
This updates the `merge!` method for document term matrices introduced in JuliaText#244 with specific implementation of sparse matric operations optimized for the merging operation.
tanmaykm added a commit to tanmaykm/TextAnalysis.jl that referenced this pull request Dec 16, 2020
This updates the `merge!` method for document term matrices introduced in JuliaText#244 with specific implementation of sparse matric operations optimized for the merging operation.
tanmaykm added a commit to tanmaykm/TextAnalysis.jl that referenced this pull request Dec 16, 2020
This updates the `merge!` method for document term matrices introduced in JuliaText#244 with specific implementation of sparse matric operations optimized for the merging operation.
tanmaykm added a commit to tanmaykm/TextAnalysis.jl that referenced this pull request Dec 17, 2020
This updates the `merge!` method for document term matrices introduced in JuliaText#244 with specific implementation of sparse matric operations optimized for the merging operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Methods to merge two DocumentTermMatrix instances
2 participants