Pandas based logistic regression #316

henrydavidge · 2020-12-10T21:07:44Z

What changes are proposed in this pull request?

Moves logic shared by pandas based linear and logistic regression to a common file
Adds scaffolding for a pandas based logistic regression test with fallback logic for potentially significant variants
Implements a fast multi-pheno, multi-geno score test

How is this patch tested?

Unit tests
Integration tests
Manual tests

(Details)

Signed-off-by: Henry D <[email protected]>

…gression-pandas

Signed-off-by: Henry D <[email protected]>

codecov · 2020-12-10T21:28:20Z

Codecov Report

Merging #316 (3ea40cd) into master (a63306e) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #316   +/-   ##
=======================================
  Coverage   93.64%   93.64%           
=======================================
  Files          95       95           
  Lines        4814     4814           
  Branches      472      472           
=======================================
  Hits         4508     4508           
  Misses        306      306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a63306e...3ea40cd. Read the comment docs.

Signed-off-by: Henry D <[email protected]>

karenfeng

I left a couple of high-level questions I had when writing a first cut for the approximate firth correction.

karenfeng · 2020-12-15T19:04:25Z

python/glow/gwas/logistic_regression.py

+        phenotype_df: pd.DataFrame,
+        covariate_df: pd.DataFrame = pd.DataFrame({}),
+        offset_df: pd.DataFrame = pd.DataFrame({}),
+        # TODO: fallback is probably not the best name


In addition to fallback (I propose correction as an alternative name), we should expose a parallel to pvalue_threshold below which we perform the correction.

karenfeng · 2020-12-15T19:42:52Z

python/glow/gwas/logistic_regression.py

+    sql_type = gwas_fx._regression_sql_type(dt)
+    genotype_df = gwas_fx._prepare_genotype_df(genotype_df, values_column, sql_type)
+    result_fields = [
+        # TODO: Probably want to put effect size and stderr here for approx-firth


I think we can still calculate effect size and stderr without the corrections, right? As in: https://github.com/rgcgithub/regenie/blob/247483cd5617f048682062553265837c2b95d6ee/src/Data.cpp#L2456

I saw that they compute bhat, but I don't understand what they actually represent. Based on the regenie paper and other resources, it seems that "effect size" universally corresponds to the maximum likelihood coefficient for the genotype feature. If that's the case, how could you know the effect size without fitting a model?

I believe that effect size here simply refers to the difference between means, which is similar to the t-test stat (without the scaling).

kianfar77

Looks nice! I had some comments.

kianfar77 · 2020-12-14T23:46:55Z

python/glow/gwas/linear_regression.py

@@ -113,50 +86,23 @@ def linear_regression(genotype_df: DataFrame,
    np.nan_to_num(Y, copy=False)
    _residualize_in_place(Y, Q)

-    if not offset_df.empty:


We probably also need to give an error message when the number of rows in phenotype_df and offset_df do not match, similar to what we do with phenotype_df and covariate_df.

We check that they have the same row index, which is actually more strict.

My bad. I just saw the columns comparison.

kianfar77 · 2020-12-15T00:15:30Z

python/glow/gwas/logistic_regression.py

+
+    On the driver node, we fit a logistic regression model based on the covariates for each
+    phenotype. We broadcast the resulting residuals, gamma vectors
+    (where gamma is defined as y_hat * (1 - y_hat)), and (C.T gamma C)^-1 matrices. In each task,


You probably need to write the logit linear expression somewhere before this for notations you use here to make sense.

kianfar77 · 2020-12-15T00:17:03Z

python/glow/gwas/logistic_regression.py

+        genotype_df : Spark DataFrame containing genomic data
+        phenotype_df : Pandas DataFrame containing phenotypic data
+        covariate_df : An optional Pandas DataFrame containing covariates
+        offset_df : An optional Pandas DataFrame containing the phenotype offset. The actual phenotype used


The sentence 'The actual phenotype ...` needs to be adjusted for logistic regression context.

kianfar77 · 2020-12-15T00:38:54Z

python/glow/gwas/logistic_regression.py

+                   X[y_mask],
+                   family=sm.families.Binomial(),
+                   offset=offset,
+                   missing='ignore')


In sm documentation https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM, I see none, drop, and raise for the missing argument. Does 'ignore' work?

Good catch! It looks like if you specify an invalid option, statsmodels defaults to none, which is actually what I want. I will change to none.

kianfar77 · 2020-12-15T21:45:14Z

python/glow/gwas/logistic_regression.py

+
+
+@typechecked
+def _assemble_log_reg_state(


nit: you use create for linear regression and assemble here. Perhaps better to use same verb in both.

kianfar77 · 2020-12-15T21:46:53Z

python/glow/gwas/logistic_regression.py

+    ])
+    gamma = Y_pred * (1 - Y_pred)
+    CtGammaC = C.T @ (gamma[:, :, None] * C)
+    CtGammaC_inv = np.linalg.inv(CtGammaC)


nit: some places CtGammaC_inv is used and some other places inv_CtGammaC, can we fix on one?

kianfar77 · 2020-12-15T21:51:27Z

python/glow/gwas/logistic_regression.py

+                        `genotype_df` should have a column with this name and a numeric array type. If a column expression
+                        is provided, the expression should return a numeric array type.
+        dt : The numpy datatype to use in the linear regression test. Must be `np.float32` or `np.float64`.
+    '''


Can we add Returns descriptions?

Signed-off-by: Henry D <[email protected]>

kianfar77

Looks great! Just a couple of nits.

kianfar77 · 2020-12-22T17:55:38Z

python/glow/gwas/log_reg.py

-                    have one or two levels of indexing. If one level, the index should be the same as the `phenotype_df`.
-                    If two levels, the level 0 index should be the same as the `phenotype_df`, and the level 1 index
+        offset_df : An optional Pandas DataFrame containing the phenotype offset. This value will be used
+                    as a offset in the covariate only and per variant logistic regression models. The ``offset_df`` may


typo: a offset

kianfar77 · 2020-12-22T18:08:17Z

python/glow/gwas/log_reg.py

@@ -23,8 +23,8 @@ def logistic_regression(
        phenotype_df: pd.DataFrame,
        covariate_df: pd.DataFrame = pd.DataFrame({}),
        offset_df: pd.DataFrame = pd.DataFrame({}),
-        # TODO: fallback is probably not the best name
-        fallback: str = 'none',  # TODO: Make approx-firth default
+        correction: str = 'none',  # TODO: Make approx-firth default


nit: I think the term correction is misleading for this argument. Something like alternative may make more sense.

Correction is the term used in the regenie paper to refer to Firth/SPA.

karenfeng

As discussed offline, can you set a numpy random seed? Otherwise, we can end up with test sets with perfect separation (which we should probably have as well, but only when we expect it).

Signed-off-by: Henry D <[email protected]>

* initial work Signed-off-by: Henry D <[email protected]> * add file Signed-off-by: Henry D <[email protected]> * workign score test Signed-off-by: Henry D <[email protected]> * seems to work Signed-off-by: Henry D <[email protected]> * continue Signed-off-by: Henry D <[email protected]> * offset support; more tests Signed-off-by: Henry D <[email protected]> * delete lin_reg.py Signed-off-by: Henry D <[email protected]> * add docs, few more tests Signed-off-by: Henry D <[email protected]> * add test file Signed-off-by: Henry D <[email protected]> * fix last test Signed-off-by: Henry D <[email protected]> * Fix docs, tests Signed-off-by: Henry D <[email protected]> * memory limit Signed-off-by: Henry D <[email protected]> * try explicitly broadcasting Signed-off-by: Henry D <[email protected]> * update environment Signed-off-by: Henry D <[email protected]> * undo explicit broadcast Signed-off-by: Henry D <[email protected]> * fix typo Signed-off-by: Henry D <[email protected]> * f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment Signed-off-by: Henry D <[email protected]> Signed-off-by: brian cajes <[email protected]>

henrydavidge added 5 commits December 1, 2020 11:12

initial work

6a2a976

Signed-off-by: Henry D <[email protected]>

Merge branch 'master' of github.com:projectglow/glow into logistic-re…

770c621

…gression-pandas

add file

9df2d0f

Signed-off-by: Henry D <[email protected]>

workign score test

95241a3

Signed-off-by: Henry D <[email protected]>

seems to work

ad8a137

Signed-off-by: Henry D <[email protected]>

henrydavidge added 3 commits December 10, 2020 21:44

continue

ce7f08f

Signed-off-by: Henry D <[email protected]>

offset support; more tests

13f28a7

Signed-off-by: Henry D <[email protected]>

delete lin_reg.py

8a9a46b

Signed-off-by: Henry D <[email protected]>

henrydavidge requested a review from karenfeng December 11, 2020 21:07

add docs, few more tests

62562e9

Signed-off-by: Henry D <[email protected]>

henrydavidge changed the title ~~Logistic regression pandas~~ Pandas based logistic regression Dec 12, 2020

henrydavidge added 2 commits December 11, 2020 20:17

add test file

c9dfad1

Signed-off-by: Henry D <[email protected]>

fix last test

d753b19

Signed-off-by: Henry D <[email protected]>

karenfeng requested a review from kianfar77 December 14, 2020 21:36

karenfeng reviewed Dec 15, 2020

View reviewed changes

kianfar77 requested changes Dec 15, 2020

View reviewed changes

henrydavidge added 4 commits December 21, 2020 11:57

Fix docs, tests

b48975d

Signed-off-by: Henry D <[email protected]>

memory limit

daa8766

Signed-off-by: Henry D <[email protected]>

try explicitly broadcasting

fd2edd0

Signed-off-by: Henry D <[email protected]>

clean up missingness tests

cc47711

Signed-off-by: Henry D <[email protected]>

henrydavidge requested a review from kianfar77 December 22, 2020 16:25

kianfar77 approved these changes Dec 22, 2020

View reviewed changes

karenfeng reviewed Dec 22, 2020

View reviewed changes

henrydavidge added 4 commits December 22, 2020 16:30

update environment

97b0a5a

Signed-off-by: Henry D <[email protected]>

undo explicit broadcast

f15c0eb

Signed-off-by: Henry D <[email protected]>

fix typo

138559c

Signed-off-by: Henry D <[email protected]>

f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment

3ea40cd

Signed-off-by: Henry D <[email protected]>

henrydavidge force-pushed the logistic-regression-pandas branch from e05a351 to 3ea40cd Compare December 23, 2020 15:51

henrydavidge merged commit e1b52e4 into projectglow:master Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas based logistic regression #316

Pandas based logistic regression #316

henrydavidge commented Dec 10, 2020 •

edited

Loading

codecov bot commented Dec 10, 2020 •

edited

Loading

karenfeng left a comment

karenfeng Dec 15, 2020

karenfeng Dec 15, 2020

henrydavidge Dec 16, 2020

karenfeng Dec 21, 2020

kianfar77 left a comment

kianfar77 Dec 14, 2020

henrydavidge Dec 16, 2020

kianfar77 Dec 16, 2020

kianfar77 Dec 15, 2020

kianfar77 Dec 15, 2020

kianfar77 Dec 15, 2020

henrydavidge Dec 21, 2020

kianfar77 Dec 15, 2020

kianfar77 Dec 15, 2020

kianfar77 Dec 15, 2020

kianfar77 left a comment

kianfar77 Dec 22, 2020

kianfar77 Dec 22, 2020

karenfeng Dec 22, 2020

karenfeng left a comment

Pandas based logistic regression #316

Pandas based logistic regression #316

Conversation

henrydavidge commented Dec 10, 2020 • edited Loading

What changes are proposed in this pull request?

How is this patch tested?

codecov bot commented Dec 10, 2020 • edited Loading

Codecov Report

karenfeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kianfar77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kianfar77 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karenfeng left a comment

Choose a reason for hiding this comment

henrydavidge commented Dec 10, 2020 •

edited

Loading

codecov bot commented Dec 10, 2020 •

edited

Loading