Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for WGR #235

Merged
merged 49 commits into from
Jun 23, 2020
Merged

Documentation for WGR #235

merged 49 commits into from
Jun 23, 2020

Conversation

karenfeng
Copy link
Collaborator

What changes are proposed in this pull request?

Creates documentation for WGR.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

henrydavidge and others added 30 commits May 15, 2020 09:58
Add Leland's demo notebook
…or WGR (#2)

* blocks

Signed-off-by: kianfar77 <[email protected]>

* test vcf

Signed-off-by: kianfar77 <[email protected]>

* transformer

Signed-off-by: kianfar77 <[email protected]>

* remove extra

Signed-off-by: kianfar77 <[email protected]>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <[email protected]>

* test

Signed-off-by: kianfar77 <[email protected]>

* test files

Signed-off-by: kianfar77 <[email protected]>

* remove extra file

Signed-off-by: kianfar77 <[email protected]>

* sort_key

Signed-off-by: kianfar77 <[email protected]>
* feat: ridge models for wgr added
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard ([email protected])
Signed-off-by: Leland Barnard <[email protected]>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard ([email protected])
Signed-off-by: Leland Barnard <[email protected]>
* WIP

Signed-off-by: Karen Feng <[email protected]>

* existing tests pass

Signed-off-by: Karen Feng <[email protected]>

* rename file

Signed-off-by: Karen Feng <[email protected]>

* Add compat test

Signed-off-by: Karen Feng <[email protected]>

* scalafmt

Signed-off-by: Karen Feng <[email protected]>

* collect minimal columns

Signed-off-by: Karen Feng <[email protected]>

* address comments

Signed-off-by: Karen Feng <[email protected]>

* Test fixup

Signed-off-by: Karen Feng <[email protected]>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <[email protected]>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <[email protected]>

* Don't use toPandas()

Signed-off-by: Karen Feng <[email protected]>

* Upgrade pyarrow

Signed-off-by: Karen Feng <[email protected]>

* Only register once

Signed-off-by: Karen Feng <[email protected]>

* Minimize memory usage

Signed-off-by: Karen Feng <[email protected]>

* Select before head

Signed-off-by: Karen Feng <[email protected]>

* set up/tear down

Signed-off-by: Karen Feng <[email protected]>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <[email protected]>

* No teardown

Signed-off-by: Karen Feng <[email protected]>

* Extend timeout

Signed-off-by: Karen Feng <[email protected]>
* WIP

Signed-off-by: Karen Feng <[email protected]>

* existing tests pass

Signed-off-by: Karen Feng <[email protected]>

* rename file

Signed-off-by: Karen Feng <[email protected]>

* Add compat test

Signed-off-by: Karen Feng <[email protected]>

* scalafmt

Signed-off-by: Karen Feng <[email protected]>

* collect minimal columns

Signed-off-by: Karen Feng <[email protected]>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <[email protected]>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <[email protected]>

* Set arrow env var in build

Signed-off-by: Henry D <[email protected]>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <[email protected]>
* yapf

Signed-off-by: Karen Feng <[email protected]>

* yapf transform

Signed-off-by: Karen Feng <[email protected]>

* Set driver memory

Signed-off-by: Karen Feng <[email protected]>

* Try changing spark mem

Signed-off-by: Karen Feng <[email protected]>

* match java tests

Signed-off-by: Karen Feng <[email protected]>

* whoops

Signed-off-by: Karen Feng <[email protected]>

* remove driver memory flag

Signed-off-by: Karen Feng <[email protected]>
* cleanup

Signed-off-by: Karen Feng <[email protected]>

* whoops

Signed-off-by: Karen Feng <[email protected]>

* cleanup

Signed-off-by: Karen Feng <[email protected]>
* WIP

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* whoops

Signed-off-by: Karen Feng <[email protected]>

* tests

Signed-off-by: Karen Feng <[email protected]>

* simplify tests

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* yapf

Signed-off-by: Karen Feng <[email protected]>

* index map compat

Signed-off-by: Karen Feng <[email protected]>

* Add docs

Signed-off-by: Karen Feng <[email protected]>

* Add more tests

Signed-off-by: Karen Feng <[email protected]>

* pass args as ints

Signed-off-by: Karen Feng <[email protected]>

* Don't roll our own splitter

Signed-off-by: Karen Feng <[email protected]>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <[email protected]>
* Add type-checking to APIs

Signed-off-by: Karen Feng <[email protected]>

* Check valid alphas

Signed-off-by: Karen Feng <[email protected]>

* check 0 sig

Signed-off-by: Karen Feng <[email protected]>

* Add to install_requires list

Signed-off-by: Karen Feng <[email protected]>

* cleanup comments

Signed-off-by: Karen Feng <[email protected]>
* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard ([email protected])
Signed-off-by: Leland Barnard <[email protected]>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard ([email protected])

Signed-off-by: Leland Barnard <[email protected]>

Co-authored-by: Leland Barnard <[email protected]>
Co-authored-by: Karen Feng <[email protected]>
* WIP

Signed-off-by: Karen Feng <[email protected]>

* Clean up tests

Signed-off-by: Karen Feng <[email protected]>

* WIP

Signed-off-by: Karen Feng <[email protected]>

* Order to match labeldf

Signed-off-by: Karen Feng <[email protected]>

* Check we tie-break

Signed-off-by: Karen Feng <[email protected]>

* cleanup

Signed-off-by: Karen Feng <[email protected]>

* tests

Signed-off-by: Karen Feng <[email protected]>

* test var name

Signed-off-by: Karen Feng <[email protected]>

* clean up tests

Signed-off-by: Karen Feng <[email protected]>

* Clean up docs

Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
karenfeng and others added 11 commits June 22, 2020 08:06
* Rename levels to wgr

Signed-off-by: Karen Feng <[email protected]>

* rename test files

Signed-off-by: Karen Feng <[email protected]>
* headers

* executable

* fix template rendering

* yapf
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
@codecov
Copy link

codecov bot commented Jun 22, 2020

Codecov Report

Merging #235 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #235   +/-   ##
=======================================
  Coverage   93.75%   93.75%           
=======================================
  Files          90       90           
  Lines        4339     4339           
  Branches      406      406           
=======================================
  Hits         4068     4068           
  Misses        271      271           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3ad87...75ffd4c. Read the comment docs.

Copy link
Contributor

@williambrandler williambrandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments and clarifications!

docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved
docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved

The genotype data may be read from any variant datasource supported by Glow, such as VCF, BGEN or PLINK. The DataFrame
must also include a column ``values`` containing a numeric representation of each genotype. The genotypic values may
not be missing, or equal for every sample in a variant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does equal mean here? All homozygous reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically, we're trying to filter out variants for which all samples have the same calls and therefore values has a variance/stddev of 0 (eg.
all hom ref, all hom-alt, or even all het). I'm not sure what the best way to phrase this is.

- Split multiallelic variants with the ``split_multiallelics`` transformer.
- Calculate the number of alternate alleles for biallelic variants with ``glow.genotype_states``.
- Replace any missing values with the mean of the non-missing values using ``glow.mean_substitute``.
- Filter out all homozygous SNPs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter out all SNPs that contain zero non-reference alleles

docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved
docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved
The fields in the model DataFrame are:

- ``header_block``: An ID assigned to the block x0 corresponding to the coefficients in this row.
- ``sample_block``: An ID assigned to the block x0 corresponding to the coefficients in this row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header_block and sample_block have the same description?

docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved
docs/source/tertiary/whole-genome-regression.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@williambrandler williambrandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a comment up front that GlowGR only supports quantitative phenotypes for now, and we plan to implement binary traits in the near future?

Otherwise LGTM

@karenfeng
Copy link
Collaborator Author

Is it worth having a comment up front that GlowGR only supports quantitative phenotypes for now, and we plan to implement binary traits in the near future?

Otherwise LGTM

I added a note that this only supports quantitative phenotypes. I'm going to avoid making promises in our docs.

Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Karen Feng <[email protected]>
Copy link
Contributor

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Thanks @karenfeng !

@karenfeng karenfeng merged commit e0680a7 into master Jun 23, 2020
@henrydavidge henrydavidge deleted the wgr-docs branch August 5, 2020 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants