Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas based logistic regression #316

Merged

Conversation

henrydavidge
Copy link
Contributor

@henrydavidge henrydavidge commented Dec 10, 2020

What changes are proposed in this pull request?

  • Moves logic shared by pandas based linear and logistic regression to a common file
  • Adds scaffolding for a pandas based logistic regression test with fallback logic for potentially significant variants
  • Implements a fast multi-pheno, multi-geno score test

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

@codecov
Copy link

codecov bot commented Dec 10, 2020

Codecov Report

Merging #316 (3ea40cd) into master (a63306e) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #316   +/-   ##
=======================================
  Coverage   93.64%   93.64%           
=======================================
  Files          95       95           
  Lines        4814     4814           
  Branches      472      472           
=======================================
  Hits         4508     4508           
  Misses        306      306           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a63306e...3ea40cd. Read the comment docs.

@henrydavidge henrydavidge changed the title Logistic regression pandas Pandas based logistic regression Dec 12, 2020
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of high-level questions I had when writing a first cut for the approximate firth correction.

phenotype_df: pd.DataFrame,
covariate_df: pd.DataFrame = pd.DataFrame({}),
offset_df: pd.DataFrame = pd.DataFrame({}),
# TODO: fallback is probably not the best name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to fallback (I propose correction as an alternative name), we should expose a parallel to pvalue_threshold below which we perform the correction.

sql_type = gwas_fx._regression_sql_type(dt)
genotype_df = gwas_fx._prepare_genotype_df(genotype_df, values_column, sql_type)
result_fields = [
# TODO: Probably want to put effect size and stderr here for approx-firth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still calculate effect size and stderr without the corrections, right? As in: https://github.com/rgcgithub/regenie/blob/247483cd5617f048682062553265837c2b95d6ee/src/Data.cpp#L2456

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that they compute bhat, but I don't understand what they actually represent. Based on the regenie paper and other resources, it seems that "effect size" universally corresponds to the maximum likelihood coefficient for the genotype feature. If that's the case, how could you know the effect size without fitting a model?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that effect size here simply refers to the difference between means, which is similar to the t-test stat (without the scaling).

Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! I had some comments.

@@ -113,50 +86,23 @@ def linear_regression(genotype_df: DataFrame,
np.nan_to_num(Y, copy=False)
_residualize_in_place(Y, Q)

if not offset_df.empty:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need to give an error message when the number of rows in phenotype_df and offset_df do not match, similar to what we do with phenotype_df and covariate_df.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check that they have the same row index, which is actually more strict.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I just saw the columns comparison.


On the driver node, we fit a logistic regression model based on the covariates for each
phenotype. We broadcast the resulting residuals, gamma vectors
(where gamma is defined as y_hat * (1 - y_hat)), and (C.T gamma C)^-1 matrices. In each task,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to write the logit linear expression somewhere before this for notations you use here to make sense.

genotype_df : Spark DataFrame containing genomic data
phenotype_df : Pandas DataFrame containing phenotypic data
covariate_df : An optional Pandas DataFrame containing covariates
offset_df : An optional Pandas DataFrame containing the phenotype offset. The actual phenotype used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence 'The actual phenotype ...` needs to be adjusted for logistic regression context.

X[y_mask],
family=sm.families.Binomial(),
offset=offset,
missing='ignore')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sm documentation https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM, I see none, drop, and raise for the missing argument. Does 'ignore' work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It looks like if you specify an invalid option, statsmodels defaults to none, which is actually what I want. I will change to none.



@typechecked
def _assemble_log_reg_state(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you use create for linear regression and assemble here. Perhaps better to use same verb in both.

])
gamma = Y_pred * (1 - Y_pred)
CtGammaC = C.T @ (gamma[:, :, None] * C)
CtGammaC_inv = np.linalg.inv(CtGammaC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some places CtGammaC_inv is used and some other places inv_CtGammaC, can we fix on one?

`genotype_df` should have a column with this name and a numeric array type. If a column expression
is provided, the expression should return a numeric array type.
dt : The numpy datatype to use in the linear regression test. Must be `np.float32` or `np.float64`.
'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add Returns descriptions?

Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple of nits.

have one or two levels of indexing. If one level, the index should be the same as the `phenotype_df`.
If two levels, the level 0 index should be the same as the `phenotype_df`, and the level 1 index
offset_df : An optional Pandas DataFrame containing the phenotype offset. This value will be used
as a offset in the covariate only and per variant logistic regression models. The ``offset_df`` may
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: a offset

@@ -23,8 +23,8 @@ def logistic_regression(
phenotype_df: pd.DataFrame,
covariate_df: pd.DataFrame = pd.DataFrame({}),
offset_df: pd.DataFrame = pd.DataFrame({}),
# TODO: fallback is probably not the best name
fallback: str = 'none', # TODO: Make approx-firth default
correction: str = 'none', # TODO: Make approx-firth default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the term correction is misleading for this argument. Something like alternative may make more sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction is the term used in the regenie paper to refer to Firth/SPA.

Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, can you set a numpy random seed? Otherwise, we can end up with test sets with perfect separation (which we should probably have as well, but only when we expect it).

@henrydavidge henrydavidge merged commit e1b52e4 into projectglow:master Dec 23, 2020
bcajes pushed a commit to bcajes/glow that referenced this pull request Sep 27, 2021
* initial work

Signed-off-by: Henry D <[email protected]>

* add file

Signed-off-by: Henry D <[email protected]>

* workign score test

Signed-off-by: Henry D <[email protected]>

* seems to work

Signed-off-by: Henry D <[email protected]>

* continue

Signed-off-by: Henry D <[email protected]>

* offset support; more tests

Signed-off-by: Henry D <[email protected]>

* delete lin_reg.py

Signed-off-by: Henry D <[email protected]>

* add docs, few more tests

Signed-off-by: Henry D <[email protected]>

* add test file

Signed-off-by: Henry D <[email protected]>

* fix last test

Signed-off-by: Henry D <[email protected]>

* Fix docs, tests

Signed-off-by: Henry D <[email protected]>

* memory limit

Signed-off-by: Henry D <[email protected]>

* try explicitly broadcasting

Signed-off-by: Henry D <[email protected]>

* update environment

Signed-off-by: Henry D <[email protected]>

* undo explicit broadcast

Signed-off-by: Henry D <[email protected]>

* fix typo

Signed-off-by: Henry D <[email protected]>

* f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment

Signed-off-by: Henry D <[email protected]>
Signed-off-by: brian cajes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants