Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Hail interoperation #310

Merged
merged 8 commits into from
Nov 25, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,11 @@ ThisBuild / installHail := {

lazy val uninstallHail = taskKey[Unit]("Uninstall Hail")
ThisBuild / uninstallHail := {
"conda env remove --name hail" ### "rm -rf hail" !
Seq(
"/bin/bash",
"-c",
"conda env remove --name hail;" + "rm -rf hail"
) !
}

lazy val sparkClasspath = taskKey[String]("sparkClasspath")
Expand Down Expand Up @@ -301,11 +305,11 @@ lazy val python =
)
.dependsOn(core % "test->test")

lazy val hail = (project in file("python/glow/hail"))
lazy val hail = (project in file("hail"))
.settings(
pythonSettings,
test in Test := {
hailtest.toTask(" --doctest-modules python/glow/hail/").value
hailtest.toTask(" --doctest-modules python/glow/hail/ docs/source/etl/hail.rst").value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this into a variable hailPaths in case we have more hail related docs in the future

}
)
.dependsOn(core % "test->test", python)
Expand All @@ -314,7 +318,7 @@ lazy val docs = (project in file("docs"))
.settings(
pythonSettings,
test in Test := {
pytest.toTask(" docs").value
pytest.toTask(" --ignore=docs/source/etl/hail.rst docs").value
}
)
.dependsOn(core % "test->test", python)
Expand Down
7 changes: 7 additions & 0 deletions docs/source/api-docs/hail-functions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Hail Interoperation Functions
-----------------------------

Glow includes functionality to enable interoperation with `Hail <https://hail.is/>`_.

.. automodule:: glow.hail.functions
:members:
1 change: 1 addition & 0 deletions docs/source/api-docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ Glow's Python API is designed to work seamlessly with PySpark and other tools in
toplevel-functions
pyspark-functions
glowgr
hail-functions
43 changes: 43 additions & 0 deletions docs/source/etl/hail.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
===================
Hail Interoperation
===================

.. invisible-code-block: python

import glow
import hail as hl
hl.init(spark.sparkContext, idempotent=True, quiet=True)
glow.register(spark)

vcf = 'test-data/NA12878_21_10002403.vcf'
mt = hl.import_vcf(vcf)

Glow includes functionality to enable conversion between a
`Hail MatrixTable <https://hail.is/docs/0.2/overview/matrix_table.html>`_ and a Spark DataFrame, similar to one created
with the :ref:`native Glow datasources <variant_data>`.

Create a Hail cluster
=====================

To use the Hail interoperation functions, you need Hail to be installed on the cluster.
On a Databricks cluster,
`install Hail with an environment variable <https://docs.databricks.com/applications/genomics/tertiary/hail.html#create-a-hail-cluster>`_.
See the `Hail installation documentation <https://hail.is/docs/0.2/getting_started.html>`_ to install Hail in other setups.

Convert to a Glow DataFrame
===========================

Convert from a Hail MatrixTable to a Glow-compatible DataFrame with the function ``from_matrix_table``.

.. code-block:: python

from glow.hail import functions
df = functions.from_matrix_table(mt, include_sample_ids=True)

.. invisible-code-block: python

from pyspark.sql import Row
native_glow_df = spark.read.format('vcf').load(vcf).drop('splitFromMultiAllelic')
assert_rows_equal(df.head(), native_glow_df.head())

By default, the genotypes contain sample IDs. To remove the sample IDs, set the parameter ``include_sample_ids=False``.
1 change: 1 addition & 0 deletions docs/source/etl/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ enabling seamless manipulation, filtering, quality control and transformation be
variant-normalization
variant-splitter
merge
hail
utility-functions
2 changes: 1 addition & 1 deletion docs/source/etl/lift-over.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ you can use to download the required file for liftOver from the b37 to the hg38
Coordinate liftOver
====================

To perform liftOver for genomic coordinates, use the function ``lift_over_coordinates``. ``lift_over_coordinates``, which has
To perform liftOver for genomic coordinates, use the function ``lift_over_coordinates``. ``lift_over_coordinates`` has
the following parameters.

- chromosome: ``string``
Expand Down
10 changes: 5 additions & 5 deletions python/glow/hail/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def _get_base_cols(row: StructExpression) -> List[Column]:
if 'rsid' in row and row.rsid.dtype == tstr:
names_elems.append("rsid")
names_col = fx.expr(
f"filter(nullif(array({','.join(names_elems)}), array()), n -> isnotnull(n))").alias("names")
f"nullif(filter(array({','.join(names_elems)}), n -> isnotnull(n)), array())").alias("names")

reference_allele_col = fx.element_at("alleles", 1).alias("referenceAllele")

Expand All @@ -123,11 +123,9 @@ def _get_other_cols(row: StructExpression) -> List[Column]:
if 'qual' in row and row.qual.dtype == tfloat64:
# -10 qual means missing
other_cols.append(fx.expr("if(qual = -10, null, qual)").alias("qual"))
# null filters means missing, [] filters means PASS
# [] filters means PASS, null filters means missing
if 'filters' in row and row.filters.dtype == tset(tstr):
other_cols.append(
fx.expr("if(size(filters) = 0, array('PASS'), if(isnull(filters), array(), filters))").
alias("filters"))
other_cols.append(fx.expr("if(size(filters) = 0, array('PASS'), filters)").alias("filters"))
# Rename info.* columns to INFO_*
if 'info' in row and isinstance(row.info.dtype, tstruct):
for f in row.info:
Expand Down Expand Up @@ -156,6 +154,8 @@ def from_matrix_table(mt: MatrixTable, include_sample_ids: bool = True) -> DataF
"""
Converts a Hail MatrixTable to a Glow DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what the schema will look like? How is it translated from the Hail schema?


Requires that the MatrixTable rows contain locus and alleles fields.

Args:
mt : The Hail MatrixTable to convert
include_sample_ids : If true, include sample IDs in the Glow DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you document why you might want this to be false?

Expand Down