-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas based logistic regression #316
Pandas based logistic regression #316
Conversation
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #316 +/- ##
=======================================
Coverage 93.64% 93.64%
=======================================
Files 95 95
Lines 4814 4814
Branches 472 472
=======================================
Hits 4508 4508
Misses 306 306 Continue to review full report at Codecov.
|
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of high-level questions I had when writing a first cut for the approximate firth correction.
phenotype_df: pd.DataFrame, | ||
covariate_df: pd.DataFrame = pd.DataFrame({}), | ||
offset_df: pd.DataFrame = pd.DataFrame({}), | ||
# TODO: fallback is probably not the best name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to fallback
(I propose correction
as an alternative name), we should expose a parallel to pvalue_threshold
below which we perform the correction.
sql_type = gwas_fx._regression_sql_type(dt) | ||
genotype_df = gwas_fx._prepare_genotype_df(genotype_df, values_column, sql_type) | ||
result_fields = [ | ||
# TODO: Probably want to put effect size and stderr here for approx-firth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still calculate effect size and stderr without the corrections, right? As in: https://github.com/rgcgithub/regenie/blob/247483cd5617f048682062553265837c2b95d6ee/src/Data.cpp#L2456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that they compute bhat
, but I don't understand what they actually represent. Based on the regenie paper and other resources, it seems that "effect size" universally corresponds to the maximum likelihood coefficient for the genotype feature. If that's the case, how could you know the effect size without fitting a model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that effect size here simply refers to the difference between means, which is similar to the t-test stat (without the scaling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! I had some comments.
@@ -113,50 +86,23 @@ def linear_regression(genotype_df: DataFrame, | |||
np.nan_to_num(Y, copy=False) | |||
_residualize_in_place(Y, Q) | |||
|
|||
if not offset_df.empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also need to give an error message when the number of rows in phenotype_df
and offset_df
do not match, similar to what we do with phenotype_df
and covariate_df
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check that they have the same row index, which is actually more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I just saw the columns comparison.
|
||
On the driver node, we fit a logistic regression model based on the covariates for each | ||
phenotype. We broadcast the resulting residuals, gamma vectors | ||
(where gamma is defined as y_hat * (1 - y_hat)), and (C.T gamma C)^-1 matrices. In each task, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to write the logit linear expression somewhere before this for notations you use here to make sense.
genotype_df : Spark DataFrame containing genomic data | ||
phenotype_df : Pandas DataFrame containing phenotypic data | ||
covariate_df : An optional Pandas DataFrame containing covariates | ||
offset_df : An optional Pandas DataFrame containing the phenotype offset. The actual phenotype used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence 'The actual phenotype ...` needs to be adjusted for logistic regression context.
X[y_mask], | ||
family=sm.families.Binomial(), | ||
offset=offset, | ||
missing='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In sm documentation https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM, I see none
, drop
, and raise
for the missing
argument. Does 'ignore' work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It looks like if you specify an invalid option, statsmodels defaults to none
, which is actually what I want. I will change to none
.
|
||
|
||
@typechecked | ||
def _assemble_log_reg_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you use create
for linear regression and assemble
here. Perhaps better to use same verb in both.
]) | ||
gamma = Y_pred * (1 - Y_pred) | ||
CtGammaC = C.T @ (gamma[:, :, None] * C) | ||
CtGammaC_inv = np.linalg.inv(CtGammaC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: some places CtGammaC_inv
is used and some other places inv_CtGammaC
, can we fix on one?
`genotype_df` should have a column with this name and a numeric array type. If a column expression | ||
is provided, the expression should return a numeric array type. | ||
dt : The numpy datatype to use in the linear regression test. Must be `np.float32` or `np.float64`. | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add Returns
descriptions?
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a couple of nits.
python/glow/gwas/log_reg.py
Outdated
have one or two levels of indexing. If one level, the index should be the same as the `phenotype_df`. | ||
If two levels, the level 0 index should be the same as the `phenotype_df`, and the level 1 index | ||
offset_df : An optional Pandas DataFrame containing the phenotype offset. This value will be used | ||
as a offset in the covariate only and per variant logistic regression models. The ``offset_df`` may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: a offset
@@ -23,8 +23,8 @@ def logistic_regression( | |||
phenotype_df: pd.DataFrame, | |||
covariate_df: pd.DataFrame = pd.DataFrame({}), | |||
offset_df: pd.DataFrame = pd.DataFrame({}), | |||
# TODO: fallback is probably not the best name | |||
fallback: str = 'none', # TODO: Make approx-firth default | |||
correction: str = 'none', # TODO: Make approx-firth default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the term correction
is misleading for this argument. Something like alternative
may make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction is the term used in the regenie paper to refer to Firth/SPA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, can you set a numpy random seed? Otherwise, we can end up with test sets with perfect separation (which we should probably have as well, but only when we expect it).
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
Signed-off-by: Henry D <[email protected]>
e05a351
to
3ea40cd
Compare
* initial work Signed-off-by: Henry D <[email protected]> * add file Signed-off-by: Henry D <[email protected]> * workign score test Signed-off-by: Henry D <[email protected]> * seems to work Signed-off-by: Henry D <[email protected]> * continue Signed-off-by: Henry D <[email protected]> * offset support; more tests Signed-off-by: Henry D <[email protected]> * delete lin_reg.py Signed-off-by: Henry D <[email protected]> * add docs, few more tests Signed-off-by: Henry D <[email protected]> * add test file Signed-off-by: Henry D <[email protected]> * fix last test Signed-off-by: Henry D <[email protected]> * Fix docs, tests Signed-off-by: Henry D <[email protected]> * memory limit Signed-off-by: Henry D <[email protected]> * try explicitly broadcasting Signed-off-by: Henry D <[email protected]> * update environment Signed-off-by: Henry D <[email protected]> * undo explicit broadcast Signed-off-by: Henry D <[email protected]> * fix typo Signed-off-by: Henry D <[email protected]> * f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment Signed-off-by: Henry D <[email protected]> Signed-off-by: brian cajes <[email protected]>
What changes are proposed in this pull request?
How is this patch tested?
(Details)