-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for Hail interoperation #310
Changes from 5 commits
f7c7ae7
ffe9b56
86ca37b
16bbd21
4bb9348
b7261c1
d4f872d
c7139e7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Hail Interoperation Functions | ||
----------------------------- | ||
|
||
Glow includes functionality to enable interoperation with `Hail <https://hail.is/>`_. | ||
|
||
.. automodule:: glow.hail.functions | ||
:members: |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
=================== | ||
Hail Interoperation | ||
=================== | ||
|
||
.. invisible-code-block: python | ||
|
||
import glow | ||
import hail as hl | ||
hl.init(spark.sparkContext, idempotent=True, quiet=True) | ||
glow.register(spark) | ||
|
||
vcf = 'test-data/NA12878_21_10002403.vcf' | ||
mt = hl.import_vcf(vcf) | ||
|
||
Glow includes functionality to enable conversion between a | ||
`Hail MatrixTable <https://hail.is/docs/0.2/overview/matrix_table.html>`_ and a Spark DataFrame, similar to one created | ||
with the :ref:`native Glow datasources <variant_data>`. | ||
|
||
Create a Hail cluster | ||
===================== | ||
|
||
To use the Hail interoperation functions, you need Hail to be installed on the cluster. | ||
On a Databricks cluster, | ||
`install Hail with an environment variable <https://docs.databricks.com/applications/genomics/tertiary/hail.html#create-a-hail-cluster>`_. | ||
See the `Hail installation documentation <https://hail.is/docs/0.2/getting_started.html>`_ to install Hail in other setups. | ||
|
||
Convert to a Glow DataFrame | ||
=========================== | ||
|
||
Convert from a Hail MatrixTable to a Glow-compatible DataFrame with the function ``from_matrix_table``. | ||
|
||
.. code-block:: python | ||
|
||
from glow.hail import functions | ||
df = functions.from_matrix_table(mt, include_sample_ids=True) | ||
|
||
.. invisible-code-block: python | ||
|
||
from pyspark.sql import Row | ||
native_glow_df = spark.read.format('vcf').load(vcf).drop('splitFromMultiAllelic') | ||
assert_rows_equal(df.head(), native_glow_df.head()) | ||
|
||
By default, the genotypes contain sample IDs. To remove the sample IDs, set the parameter ``include_sample_ids=False``. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -101,7 +101,7 @@ def _get_base_cols(row: StructExpression) -> List[Column]: | |
if 'rsid' in row and row.rsid.dtype == tstr: | ||
names_elems.append("rsid") | ||
names_col = fx.expr( | ||
f"filter(nullif(array({','.join(names_elems)}), array()), n -> isnotnull(n))").alias("names") | ||
f"nullif(filter(array({','.join(names_elems)}), n -> isnotnull(n)), array())").alias("names") | ||
|
||
reference_allele_col = fx.element_at("alleles", 1).alias("referenceAllele") | ||
|
||
|
@@ -123,11 +123,9 @@ def _get_other_cols(row: StructExpression) -> List[Column]: | |
if 'qual' in row and row.qual.dtype == tfloat64: | ||
# -10 qual means missing | ||
other_cols.append(fx.expr("if(qual = -10, null, qual)").alias("qual")) | ||
# null filters means missing, [] filters means PASS | ||
# [] filters means PASS, null filters means missing | ||
if 'filters' in row and row.filters.dtype == tset(tstr): | ||
other_cols.append( | ||
fx.expr("if(size(filters) = 0, array('PASS'), if(isnull(filters), array(), filters))"). | ||
alias("filters")) | ||
other_cols.append(fx.expr("if(size(filters) = 0, array('PASS'), filters)").alias("filters")) | ||
# Rename info.* columns to INFO_* | ||
if 'info' in row and isinstance(row.info.dtype, tstruct): | ||
for f in row.info: | ||
|
@@ -156,6 +154,8 @@ def from_matrix_table(mt: MatrixTable, include_sample_ids: bool = True) -> DataF | |
""" | ||
Converts a Hail MatrixTable to a Glow DataFrame. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you describe what the schema will look like? How is it translated from the Hail schema? |
||
|
||
Requires that the MatrixTable rows contain locus and alleles fields. | ||
|
||
Args: | ||
mt : The Hail MatrixTable to convert | ||
include_sample_ids : If true, include sample IDs in the Glow DataFrame | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: could you document why you might want this to be false? |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make this into a variable
hailPaths
in case we have more hail related docs in the future