Skip to content
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.

[326b] NiH Curation/Aggregation #336

Merged
merged 123 commits into from
Nov 26, 2020
Merged

[326b] NiH Curation/Aggregation #336

merged 123 commits into from
Nov 26, 2020

Conversation

jaklinger
Copy link
Contributor

@jaklinger jaklinger commented Nov 6, 2020

Closes #326

Curation and aggregation of NiH data, ready for ingestion into Elasticsearch.

To run do:

luigi --module general_curate CurateTask --dataset nih

from nesta/core/routines/projects/general

Data in curated form looks like:

       application_id: 8925918
base_core_project_num: U19mh095687
                   fy: 2015
             org_city: London
          org_country: United Kingdom
             org_name: London Sch/hygiene & Tropical Medicine
            org_state: NULL
          org_zipcode: Wc1e 7ht
        project_title: Administrative Core
                  phr: (See instructions): The administrative arrangement for SHARE seems to establish a collaboration between Indian and Pakistani institutions with complementary expertise and experience, with clearly defined roles and responsibilities and Pis and communication plans, and presents a unique platform for promoting research to reduce the large treatment gap for mental disorders in the region.
              ic_name: National Institute Of Mental Health
        abstract_text: SHARE is a multi-component, multi-country program whose goal is to establish a collaborative network of institutions to carry out and to utilize research that answers policy relevant questions related to reducing the treatment gap for mental disorders in South Asia. An important component of the Hub is to establish a strong administrative core which can coordinate and provide leadership for all other components of the program. Due to the delicate political context in the region which makes implementation of programs led by either India or Pakistan in the other country difficult, SHARE will function through two distinct administrative cores, one in each country with the primary core based in India. Each will have specific mandates, clearly delineated roles and responsibilities and a leadership plan starting from a PI for each core. The primary SHARE South Asian Core (SHARE-SAC) will be based in the Indian Institute of Public Health of the Public Health Foundation of India. The roles of SHARE-SAC will be to: 1) oversee the day-to-day administration of the Hub's network of partners in all the countries of the region, except Pakistan; 2) be responsible for allocating and monitoring finances in these countries, co-ordinating approaches for multiplier funding and organizing financial reports for each component and for the Hub as a whole; 3) coordinate the implementation of the research component with Sangath, the research implementing organization in India; 4) track progress of each activity according to the original milestones; 5) coordinate the research capacity building activities; 6) coordinate the annual meetings, including the two meetings of other regional Hubs to be hosted by SHARE; 7) coordinate all communications between SHARE partners, in particular with the Pakistan core and the governance groups; and 8) oversee all communications with audiences external to SHARE, including NIMH. The partner core in Pakistan (SHARE-PAC ) will: 1) oversee the day-to-day administration of the core's network of partners in Pakistan; 2) ensure the highest standards of fiscal, administrative and research governance of all activities in Pakistan; 3) track progress of each activity according to the original milestones in Pakistan; and 4) coordinate the implementation of the capacity building program and the research component in Pakistan. Clearly defined governance mechanisms and communication and monitoring strategies will ensure smooth coordination of all activities between the two cores and communication between network partners, with other Hubs and the NIMH.
    clinicaltrial_ids: ["Nct02104232", "Nct02111915"]
 clinicaltrial_titles: ["Thinking Healthy Program - Peer Delivered (Pakistan)", "Thinking Healthy Program - Peer Delivered, India (THPP-I)"]
             currency: USD
   fairly_similar_ids: null
   near_duplicate_ids: null
           patent_ids: null
        patent_titles: null
                pmids: [23667345, 23737736, 24054170, 24321171, 24366490, 24632847, 24976552, 25113958, 25847276, 26131019, 26360733, 26450582, 26604001, 26925160, 26985235, 28093575, 28596910, 30686385, 30686386, 30819173]
          project_end: 2018-02-28 00:00:00
        project_start: 2011-09-20 00:00:00
        project_terms: ["Address", "Advocacy", "Advocate", "Affect", "Algorithms", "Area", "Arm", "Asia", "Base", "Budgets", "Car Phone", "Career", "Career Development", "Charge", "Child Health Care", "Civil Society", "Clinical", "Clinical Decision-making", "Cluster Randomized Trial", "Cognitive Therapy", "Cohort", "Collaborations", "Commit", "Communication", "Communities", "Community Health", "Community Setting", "Computer Software", "Conflict (psychology)", "Consensus", "Consultations", "Cost", "Cost Effective", "Country", "Data", "Data Aggregation", "Data Collection", "Data Management", "Data Security", "Data Set", "Design", "Detection", "Development", "Diagnosis", "Diagnostic", "Disability", "Disasters", "Disease", "Dissemination Research", "Distance Learning", "Dsm-iv", "Education", "Educational Aspects", "Effectiveness", "Ensure", "Evaluation", "Event", "Evidence Base", "Evidence Based Intervention", "Evidence Based Treatment", "Experience", "Fellowship", "Flexibility", "Follow-up", "Foundations", "Funding", "Future", "Goals", "Government", "Grant", "Guidelines", "Health", "Health Policy", "Health Services Research", "Health System", "Health Technology", "Health Training", "Healthcare", "Heart", "High Prevalence", "High Risk", "High Standard", "Human Resources", "Implementation Research", "Improve Access", "Improved", "Income", "India", "Individual", "Infant", "Informed Consent", "Innovation", "Institutes", "Institution", "Instruction", "Intention", "International", "Interoperability", "Intervention", "Intervention Program", "Interview", "Journals", "Knowledge", "Knowledge Translation", "Leadership", "Lectures", "Life", "Link", "Maternal And Child Health", "Maternal Depression", "Maternal Health", "Meetings", "Member", "Mental Depression", "Mental Disorders", "Mental Health", "Mental Health Services", "Mentors", "Methodology", "Methods", "Mhealth", "Modeling", "Monitor", "Mothers", "National Institute Of Mental Health", "National Institute Of Mental Health (u.s.)", "Neurologic", "Outcome", "Pakistan", "Paper", "Participant", "Pathway Interactions", "Patients", "Peer", "Perinatal", "Peripartum Depression", "Persons", "Phase", "Policies", "Policy Maker", "Population", "Preparation", "Primary Care Physician", "Primary Health Care", "Primary Outcome", "Procedures", "Process", "Professional Counselor", "Programs", "Protocols Documentation", "Provider", "Psyche Structure", "Psychiatry", "Psychologic", "Public Health", "Public Health Medicine (field)", "Public Health Priorities", "Public Health Relevance", "Publishing", "Qualitative Evaluations", "Quality Assurance", "Randomized", "Randomized Controlled Trials", "Randomized Trial", "Reading", "Reporting", "Research", "Research Design", "Research Infrastructure", "Research Personnel", "Research Priority", "Research Project Grants", "Resources", "Role", "Rural", "Rural Population", "Sampling", "Scale Up", "Screening", "Screening Procedure", "Secondary Outcome", "Services", "Site", "Skills", "Social Development", "Solutions", "Source", "South Asian", "Specialist", "Specific Qualifier Value", "Staging", "Substance Abuse Problem", "Success", "Supervision", "Sustainable Development", "Symposium", "Technical Expertise", "Technology", "Telephone", "Testing", "Text", "Thinking", "Thinking, Function", "Time", "Tool", "Training", "Translating", "Translational Research", "Treatment Adherence", "Twin Multiple Birth", "United States National Institutes Of Health", "Update", "Urban Population", "Usability", "Vision", "Web Site", "Woman", "Work", "Writing"]
          grouped_ids: [8212778, 8324442, 8324444, 8324447, 8324448, 8334372, 8380273, 8380275, 8380276, 8380277, 8475816, 8475817, 8475818, 8475819, 8475820, 8546246, 8546247, 8546248, 8546249, 8546250, 8735998, 8735999, 8736000, 8736001, 8736002, 8925917, 8925918, 8925919, 8925920, 8925921, 9125470, 9137841]
       grouped_titles: ["Administrative Core", "Research Capacity Building Component", "Research Component", "Shared Research Project", "South Asian Hub for Advocacy, Research & Education on Mental Health (SHARE)", "South Asian Hub for Advocacy, Research &Education on Mental Health (SHARE)"]
           total_cost: 2907063
     very_similar_ids: null
         yearly_funds: [{"year": 2011, "total_cost": 2907063, "project_end": "2018-02-28T00:00:00", "project_start": "2011-09-20T00:00:00"}, {"year": 2012, "total_cost": null, "project_end": "2013-08-31T00:00:00", "project_start": null}, {"year": 2013, "total_cost": null, "project_end": null, "project_start": null}, {"year": 2014, "total_cost": null, "project_end": null, "project_start": null}, {"year": 2015, "total_cost": null, "project_end": "2018-02-28T00:00:00", "project_start": null}]
       continent_iso2: EU
       continent_name: Europe
          coordinates: {"lat": "51.5073219", "lon": "-0.1276474"}
     country_mentions: ["GB", "IN", "PK"]
                is_eu: 1
                 iso2: GB
           state_name: NULL

@jaklinger jaklinger changed the base branch from dev to 326a_impute November 19, 2020 14:26
@jaklinger jaklinger added this to the EURITO milestone Nov 20, 2020
@jaklinger jaklinger marked this pull request as ready for review November 20, 2020 15:29
Copy link
Contributor

@bishax bishax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments where things are slightly unclear, but good to merge.

nesta/core/orms/general_orm.py Show resolved Hide resolved
from sqlalchemy.dialects.mysql import TEXT as _TEXT
from functools import partial

TEXT = _TEXT(collation='utf8mb4_unicode_ci')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ORM's I missed that the TEXT import was from Nesta and not SqlAlchemy.
Perhaps it should be named something other than TEXT, or by convention be imported as something other than TEXT?

Copy link
Contributor Author

@jaklinger jaklinger Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collation=utf8mb4_[...] is default in MySQL8+ so this module is effectively deprecated on porting to daps2- so probs not such a big deal!

Comment on lines 62 to 75
# NiH run conditions
nih_pk = NihProject.application_id
nih_core = NihProject.base_core_project_num
nih_is_null = nih_core == None

# Iterate over run params
# Crunchbase
params = (('companies', CrunchbaseOrg.id, None, {}),
# NiH Core IDs != Null
('nih', nih_core, ~nih_is_null,
{'using_core_ids': True}),
# NiH Core IDs == Null
('nih', nih_pk, nih_is_null,
{'using_core_ids': False}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that this section has the potential to large and messy as more datasets are added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I'll switch to a config setup- one sec and I'll recommit

result = pycountry.countries.get(**query)
if result is not None:
return result
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions is a keyerror raised here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pycountry.get doesn't find an exact match. I'll add a comment to that effect!

Base automatically changed from 326a_impute to dev November 24, 2020 15:52
@jaklinger jaklinger merged commit 8d1c29e into dev Nov 26, 2020
@jaklinger jaklinger deleted the 326b_nihagg branch November 26, 2020 10:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EURITO] Factor out dataset-specific processing from NiH
2 participants