Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

[326b] NiH Curation/Aggregation #336

Merged
merged 123 commits into from
Nov 26, 2020
Merged
Changes from 1 commit
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
0227a9f
setting up new dir structure
Oct 12, 2020
cd99ebf
updated orm, simplified collect pipeline
Oct 13, 2020
2a06f2a
added a lot of preprocessing
Oct 13, 2020
2a44476
added cleaning and preprocessing steps and tests
Oct 14, 2020
55e5e42
migrated from health_data to nih
Oct 14, 2020
90d282a
migrated from health_data to nih
Oct 14, 2020
9c259e5
added test for auto splitting json fields
Oct 14, 2020
a015fce
test running
Oct 14, 2020
ae66c97
added cleaning for terms fields, added date parsing
Oct 20, 2020
279d0e4
added cleaning for terms fields, added date parsing
Oct 20, 2020
1dd7451
added upsert logic
Oct 22, 2020
e1cb248
add docstrings
jaklinger Oct 22, 2020
9f57e79
add docstrings
jaklinger Oct 22, 2020
662ea1a
updated orm
Oct 22, 2020
014f207
Merge branch '326_nih' of github.com:nestauk/nesta into 326_nih
Oct 22, 2020
41e81d5
speedups for inserts
Oct 23, 2020
8875c02
closed session
Oct 23, 2020
2996698
added vectors to nih
Oct 26, 2020
c9ea838
nih links and clinical trials
Oct 26, 2020
280fcc4
file renaming factored from 326_nih
Oct 26, 2020
42e9baf
revert autobatch
Oct 26, 2020
dd29100
Merge branch '326_rename' into 326_nih
Oct 26, 2020
ac24fd7
removed references to old health_data terminology in favour of nih
Oct 27, 2020
1080137
traded lambda for partial
Oct 27, 2020
6325cea
factored out utils
Oct 27, 2020
fe9e8a1
factored in utils from 326_rename
Oct 27, 2020
99b7c27
Merge branch '326_utils' into 326_nih
Oct 27, 2020
15f593b
added docstrings and comments, and small amount of refactoring
Oct 27, 2020
5fcb660
added pk tests
Oct 27, 2020
f27c836
Merge branch 'dev' into 326_utils
Oct 28, 2020
246159f
test for auto pkey check
Oct 28, 2020
ca7591f
test for generate pk
Oct 28, 2020
6419139
added retrieve pk tests
Oct 28, 2020
8bd9d89
added delete stmt test
Oct 28, 2020
6755381
added delete stmt test
Oct 28, 2020
3ac4d2b
added merge tests
Oct 28, 2020
037daf6
Merge branch '326_utils' into 326_nih
Oct 28, 2020
c66766b
added bucket keys tests
Oct 28, 2020
b4b345e
backwards compatability for older pipelines
Oct 28, 2020
dfa1787
Merge branch '326_utils' into 326_nih
Oct 28, 2020
3fb1f2c
fixes wrt to new insert method
Oct 28, 2020
9e71399
updated range
Oct 28, 2020
bed7772
Merge branch '326_utils' into 326_nih
Oct 28, 2020
9f35145
removed debugging lines
Oct 28, 2020
228bcab
removing todo comments
Oct 28, 2020
e8d22c6
tidying up
Oct 28, 2020
42a5572
added minor fix to gtr routine due to tanzania
Oct 28, 2020
a8eb08a
Merge branch '326_utils' into 326_nih
Oct 28, 2020
4370bbf
merge conflicts
Oct 29, 2020
c6e1cb4
added faiss and sim search
Oct 29, 2020
58f6fa8
rebuilding docker images with faiss removed from reqs
Oct 29, 2020
ec62eaa
added comments to faiss processor
Oct 29, 2020
753f01c
tasks run after committing more regularly
Oct 30, 2020
faca422
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Oct 30, 2020
440c22d
added logging
Oct 30, 2020
2ebc42f
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Oct 30, 2020
dfc94fa
updated orm
Oct 30, 2020
cf4e0e2
rewritten dedupe tasked
Oct 30, 2020
404c2b7
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Oct 30, 2020
625e7a5
faiss speedups
Oct 30, 2020
c0ac36d
dev runs
Oct 30, 2020
78d7927
added filters to query
Nov 3, 2020
607a7ba
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Nov 3, 2020
1a57f87
dedupe runs
Nov 4, 2020
72da6e8
rmd testing comments
Nov 4, 2020
dbf9a39
added explanatory comments
Nov 4, 2020
bccf387
added explanatory comments
Nov 4, 2020
e9b95ba
added explanatory comments
Nov 4, 2020
9623b7d
removed testing assert
Nov 4, 2020
7a9bab6
removed commented out lines
Nov 4, 2020
014d047
added docs to read
Nov 4, 2020
d3e5ea6
added helpful comments to similarity
Nov 5, 2020
f428eec
added helpful comments to similarity
Nov 5, 2020
8bba1c6
added index to date fields
Nov 6, 2020
9504c2c
fine-tuned some hyperparams to speed up large searches
Nov 6, 2020
5257d50
fine-tuned some hyperparams to speed up large searches
Nov 6, 2020
7c19f0c
removed experimental feature
Nov 6, 2020
856e8ae
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Nov 6, 2020
57577e4
adding curate
Nov 6, 2020
ef766ba
needed to drop edge case abstracts for dedupe FK
Nov 9, 2020
9baef95
Update nih_orm.py
Nov 9, 2020
74f0ded
Update nih_orm.py
Nov 9, 2020
6dc33ff
added faiss and pytorch to reqs
Nov 9, 2020
1fa4fe0
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Nov 9, 2020
4254406
stashing changes
Nov 10, 2020
d2fffcb
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Nov 10, 2020
bd817f1
Merge branch '326a_nihbert' into 326b_nihagg
Nov 10, 2020
9b11087
updated reqs
Nov 11, 2020
ae6ad93
updated reqs
Nov 11, 2020
a2b83f4
added exception for dud ids
Nov 11, 2020
8e5699a
adding orm and impute functions
Nov 11, 2020
d7a34d0
added more info to the readme
Nov 11, 2020
449705a
Merge branch '326a_nihbert' of github.com:nestauk/nesta into 326a_nih…
Nov 11, 2020
7859d53
Merge branch '326a_nihbert' into 326b_nihagg
Nov 11, 2020
2519075
added tests
Nov 11, 2020
c1c4f55
added tests
Nov 11, 2020
12e569d
updating curate
Nov 12, 2020
74b7a73
merge conflict resovled
Nov 12, 2020
c3d7bf1
similar ids routine
Nov 12, 2020
7dcb5ba
added aggregation
Nov 12, 2020
cc6a1ba
dev runs
Nov 17, 2020
29b47ba
comments into run.py
Nov 17, 2020
4d25186
rmd dangling test file
Nov 17, 2020
abdabc6
refactored getattr
Nov 17, 2020
4e0d2f6
rmd whitespace
Nov 17, 2020
5fc3e51
added zillion tests
Nov 17, 2020
d8f48ad
added tests
Nov 19, 2020
a8fafbb
imputing base id
Nov 19, 2020
e5cc81c
added tests
Nov 19, 2020
67ca589
added missing import
Nov 19, 2020
4e71ac4
Merge branch '326a_impute' into 326b_nihagg
Nov 19, 2020
d83e491
added option
Nov 19, 2020
562a55e
switched to instance checking
Nov 19, 2020
985e906
Merge branch 'dev' into 326a_impute
Nov 19, 2020
6089962
Merge branch '326a_impute' into 326b_nihagg
Nov 19, 2020
a966d52
added country edge cases
Nov 21, 2020
a2d4ce5
bad dict
Nov 21, 2020
7e27b80
try both title and non-title
Nov 21, 2020
5ab60f3
resolve merge conflicts
jaklinger Nov 25, 2020
48abd56
added config
jaklinger Nov 25, 2020
9cea9cd
fixed test after options removed and individual fields introduced
jaklinger Nov 25, 2020
9065693
needed to clear lru cache between tests
Nov 26, 2020
7dad615
Merge branch '326b_nihagg' of github.com:nestauk/nesta into 326b_nihagg
Nov 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update nih_orm.py
  • Loading branch information
Joel Klinger committed Nov 9, 2020
commit 74f0ded7a82099ed01a548095bddba6ab6570858
8 changes: 8 additions & 0 deletions nesta/core/orms/nih_orm.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,14 @@ class TextDuplicate(Base):
almost exact duplicates of each other) or contextually
(if weight > 0.5 it is normally in the same general
subject area).

The cut-off for inclusion in this table is a weight of 0.5,
because the core interest for using this method is
to identify texts which are near duplicates,
since texts which are contextually similar can
also be found by other metrics (topic modelling, etc) and there
can be some weird side-effects of using BERT for this;
e.g. finding texts with a similar writing style rather than topic.
"""
__tablename__ = 'nih_duplicates'
application_id_1 = Column(INTEGER, ForeignKey(Projects.application_id),
Expand Down