Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Hail interoperation #310

Merged
merged 8 commits into from
Nov 25, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Add schema mapping docs
Signed-off-by: Karen Feng <[email protected]>
  • Loading branch information
karenfeng committed Nov 23, 2020
commit c7139e796b7dbb7c1a288847c0b3e0003095a020
75 changes: 75 additions & 0 deletions docs/source/etl/hail.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,78 @@ Convert from a Hail MatrixTable to a Glow-compatible DataFrame with the function
assert_rows_equal(df.head(), native_glow_df.head())

By default, the genotypes contain sample IDs. To remove the sample IDs, set the parameter ``include_sample_ids=False``.

Schema mapping
==============

The Glow DataFrame variant fields are derived from the Hail MatrixTable row fields.

.. list-table::
:header-rows: 1

* - Required
- Glow DataFrame variant field
- Hail MatrixTable row field
* - Yes
- ``contigName``
- ``locus.contig``
* - Yes
- ``start``
- ``locus.position - 1``
* - Yes
- ``end``
- ``info.END`` or ``locus.position - 1 + len(alleles[0])``
* - Yes
- ``referenceAllele``
- ``alleles[0]``
* - No
- ``alternateAlleles``
- ``alleles[1:]``
* - No
- ``names``
- ``[rsid, varid]``
* - No
- ``qual``
- ``qual``
* - No
- ``filters``
- ``filters``
* - No
- ``INFO_<ANY_FIELD>``
- ``info.<ANY_FIELD>``

The Glow DataFrame genotype sample IDs are derived from the Hail MatrixTable column fields.

All of the other Glow DataFrame genotype fields are derived from the Hail MatrixTable entry fields.

.. list-table::
:header-rows: 1

* - Glow DataFrame genotype field
- Hail MatrixTable entry field
* - ``phased``
- ``GT.phased``
* - ``calls``
- ``GT.alleles``
* - ``depth``
- ``DP``
* - ``filters``
- ``FT``
* - ``genotypeLikelihoods``
- ``GL``
* - ``phredLikelihoods``
- ``PL``
* - ``posteriorProbabilities``
- ``GP``
* - ``conditionalQuality``
- ``GQ``
* - ``haplotypeQualities``
- ``HQ``
* - ``expectedAlleleCounts``
- ``EC``
* - ``mappingQuality``
- ``MQ``
* - ``alleleDepths``
- ``AD``
* - ``<ANY_FIELD>``
- ``<ANY_FIELD>``
2 changes: 1 addition & 1 deletion docs/source/etl/variant-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ You can control the behavior of the VCF reader with a few parameters. All parame
+--------------------------+---------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| Parameter | Type | Default | Description |
+==========================+=========+=============+=========================================================================================================================================================+
| ``includeSampleIds`` | boolean | ``true`` | If true, each genotype includes the name of the sample ID it belongs to. Sample names increases the size of each row, both in memory and on storage. |
| ``includeSampleIds`` | boolean | ``true`` | If true, each genotype includes the name of the sample ID it belongs to. Sample names increase the size of each row, both in memory and on storage. |
+--------------------------+---------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``flattenInfoFields`` | boolean | ``true`` | If true, each info field in the input VCF will be converted into a column in the output DataFrame with each column typed as specified in the VCF header.|
| | | | If false, all info fields will be contained in a single column with a string -> string map of info keys to values. |
Expand Down
7 changes: 5 additions & 2 deletions python/glow/hail/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,16 @@ def _require_row_variant_w_struct_locus(mt: MatrixTable) -> NoReturn:

def from_matrix_table(mt: MatrixTable, include_sample_ids: bool = True) -> DataFrame:
"""
Converts a Hail MatrixTable to a Glow DataFrame.
Converts a Hail MatrixTable to a Glow DataFrame. The variant fields are derived from the Hail MatrixTable
row fields. The sample IDs are derived from the Hail MatrixTable column fields. All other genotype fields are
derived from the Hail MatrixTable entry fields.

Requires that the MatrixTable rows contain locus and alleles fields.

Args:
mt : The Hail MatrixTable to convert
include_sample_ids : If true, include sample IDs in the Glow DataFrame
include_sample_ids : If true (default), include sample IDs in the Glow DataFrame.
Sample names increase the size of each row, both in memory and on storage.

Returns:
Glow DataFrame converted from the MatrixTable.
Expand Down