Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate notebook sources #301

Merged
merged 29 commits into from
Nov 2, 2020
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ install_pyspark2: &install_pyspark2
conda remove -y pyspark
pip install pyspark==2.4.5


check_clean_repo: &check_clean_repo
run:
name: Verify that repo is clean
Expand All @@ -45,7 +46,7 @@ orbs:
codecov: codecov/[email protected]
jobs:

check-links:
check-docs:
<<: *setup_base
steps:
- checkout
Expand All @@ -64,6 +65,19 @@ jobs:
export PATH=$HOME/conda/envs/glow/bin:$PATH
cd docs
make linkcheck
- run:
name: Configure Databricks CLI
command: |
printf "[docs-ci]\nhost = https://westus2.azuredatabricks.net\ntoken = ${DATABRICKS_API_TOKEN}\n" > ~/.databrickscfg
- run:
name: Generate notebook source files
command: |
export PATH=$HOME/conda/bin:$PATH
source activate glow
for f in $(find docs/source/_static/notebooks -type f -name '*.html'); do
python docs/dev/gen-nb-src.py --html "${f}"
done
- *check_clean_repo

scala-2_11-tests:
<<: *setup_base
Expand Down Expand Up @@ -200,7 +214,7 @@ workflows:
version: 2
test:
jobs:
- check-links
- check-docs
- scala-2_11-tests
- scala-2_12-tests
- spark-3-tests
Expand All @@ -213,5 +227,5 @@ workflows:
only:
- master
jobs:
- check-links
- check-docs
- spark-3-tests
59 changes: 59 additions & 0 deletions docs/dev/gen-nb-src.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
'''
Transforms a .html notebook into its source .py/.scala/.r/,sql file.

This script is used by the CircleCI job 'check-docs'. Before running this, configure
your Databricks CLI profile.

Example usage:
python3 docs/dev/gen-nb-src.py \
--html docs/source/_static/notebooks/etl/variant-data.html
'''
import click
import subprocess
import os
import uuid

NOTEBOOK_DIR = 'docs/source/_static/notebooks'
SOURCE_DIR = 'docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE'
SOURCE_EXTS = ['scala', 'py', 'r', 'sql']


@click.command()
@click.option('--html', required=True, help='Path of the HTML notebook.')
@click.option('--cli-profile', default='docs-ci', help='Databricks CLI profile name.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make more sense to rely on the databricks CLI's default profile?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this useful during manual testing, as I already have a default Databricks CLI profile and wanted to test this on the same shard as the CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that it makes sense to have this argument, but the default value should be the default profile. If you're running locally, I don't think there's any reason to expect a profile called docs-ci to exist. We can pass the docs-ci argument from our CI script.

@click.option('--workspace-tmp-dir', default='/tmp/glow-docs-ci', help='Base workspace dir; a temporary directory will be generated under this for import/export.')
def main(html, cli_profile, workspace_tmp_dir):
assert os.path.commonpath([NOTEBOOK_DIR, html]) == NOTEBOOK_DIR, \
f"HTML notebook must be under {NOTEBOOK_DIR} but got {html}."
rel_path = os.path.splitext(os.path.relpath(html, NOTEBOOK_DIR))[0]

if not os.path.exists(html): # html notebook was deleted
print(f"{html} does not exist. Deleting the companion source file...")
for ext in SOURCE_EXTS:
source_path = os.path.join(SOURCE_DIR, rel_path + "." + ext)
if os.path.exists(source_path):
os.remove(source_path)
print(f"\tDeleted {source_path}.")
return

print(f"Generating source file for {html} under {SOURCE_DIR} ...")

def run_cli_workspace_cmd(args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a nested definition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just so we can leverage cli_profile from the main args, but I can split this out if it's clearer that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a bit cleaner.

cmd = ['databricks', '--profile', cli_profile, 'workspace'] + args
subprocess.check_output(cmd, stderr=subprocess.STDOUT)

work_dir = os.path.join(workspace_tmp_dir, str(uuid.uuid4()))
workspace_path = os.path.join(work_dir, rel_path)

run_cli_workspace_cmd(['mkdirs', os.path.join(work_dir, os.path.dirname(rel_path))])
try:
# `-l PYTHON` is required by CLI but ignored with `-f HTML`
# This command works for all languages in SOURCE_EXTS
run_cli_workspace_cmd(['import', '-o', '-l', 'PYTHON', '-f', 'HTML', html, workspace_path])
run_cli_workspace_cmd(['export_dir', '-o', work_dir, SOURCE_DIR])
finally:
run_cli_workspace_cmd(['rm', '-r', work_dir])


if __name__ == '__main__':
main()
73 changes: 73 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/gff.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Databricks notebook source
from pyspark.sql.types import *

# Human genome annotations in GFF3 are available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/
gff_path = "/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz"

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with inferred schema

# COMMAND ----------

# DBTITLE 0,Print inferred schema
original_gff_df = spark.read \
.format("gff") \
.load(gff_path) \

original_gff_df.printSchema()

# COMMAND ----------

# DBTITLE 0,Read in the GFF3 with the inferred schema
display(original_gff_df)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with user-specified schema

# COMMAND ----------

mySchema = StructType( \
[StructField('seqId', StringType()),
StructField('start', LongType()),
StructField('end', LongType()),
StructField('ID', StringType()),
StructField('Dbxref', ArrayType(StringType())),
StructField('gene', StringType()),
StructField('mol_type', StringType())]
)

original_gff_df = spark.read \
.schema(mySchema) \
.format("gff") \
.load(gff_path) \

display(original_gff_df)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with user-specified schema including original GFF3 ``attributes`` column

# COMMAND ----------

mySchema = StructType( \
[StructField('seqId', StringType()),
StructField('start', LongType()),
StructField('end', LongType()),
StructField('ID', StringType()),
StructField('Dbxref', ArrayType(StringType())),
StructField('gene', StringType()),
StructField('mol_type', StringType()),
StructField('attributes', StringType())]
)

original_gff_df = spark.read \
.schema(mySchema) \
.format("gff") \
.load(gff_path) \

display(original_gff_df)
83 changes: 83 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/lift-over.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Databricks notebook source
# MAGIC %md
# MAGIC
# MAGIC To perform coordinate or variant liftover, you must download a chain file to each node.
# MAGIC
# MAGIC On a Databricks cluster, an example of a [cluster-scoped init script](https://docs.azuredatabricks.net/clusters/init-scripts.html#cluster-scoped-init-scripts) you can use to download the required file is as follows:
# MAGIC
# MAGIC ```
# MAGIC #!/usr/bin/env bash
# MAGIC set -ex
# MAGIC set -o pipefail
# MAGIC mkdir /opt/liftover
# MAGIC curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain
# MAGIC ```
# MAGIC In this demo, we perform coordinate and variant liftover from b37 to hg38.
# MAGIC
# MAGIC To perform variant liftover, you must download a reference file to each node of the cluster. Here, we assume the reference genome is downloaded to
# MAGIC ```/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa```
# MAGIC
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be achieved by setting [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) `refGenomeId=grch38`.

# COMMAND ----------

# DBTITLE 1,Import glow and define path variables
import glow
glow.register(spark)
chain_file = '/opt/liftover/b37ToHg38.over.chain'
reference_file = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa'
vcf_file = 'dbfs:/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'

# COMMAND ----------

# DBTITLE 1,First, read in a VCF from a flat file or Delta Lake table.
input_df = spark.read.format("vcf") \
.load(vcf_file) \
.cache()

# COMMAND ----------

# MAGIC %md
# MAGIC
# MAGIC Now apply the `lift_over_coordinates` UDF, with the parameters as follows:
# MAGIC - chromosome (`string`)
# MAGIC - start (`long`)
# MAGIC - end (`long`)
# MAGIC - CONSTANT: chain file (`string`)
# MAGIC - OPTIONAL: minimum fraction of bases that must remap (`double`), defaults to `.95`
# MAGIC
# MAGIC This creates a column with the new coordinates.

# COMMAND ----------

from pyspark.sql.functions import *

# COMMAND ----------

liftover_expr = "lift_over_coordinates(contigName, start, end, chain_file, .99)"
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr))

# COMMAND ----------

# DBTITLE 1,Filter rows for which liftover succeeded and see which rows changed.
changed_with_lifted_df = input_with_lifted_df.filter("lifted is not null").filter("start != lifted.start")
display(changed_with_lifted_df)

# COMMAND ----------

# MAGIC %md
# MAGIC
# MAGIC Now apply the `lift_over_variants` transformer, with the following options.
# MAGIC - `chain_file`: `string`
# MAGIC - `reference_file`: `string`
# MAGIC - `min_match_ratio`: `double` (optional, defaults to `.95`)

# COMMAND ----------

output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)

# COMMAND ----------

# DBTITLE 1,View the rows for which liftover succeeded
lifted_df = output_df.filter('liftOverStatus.success = true').drop('liftOverStatus')
display(lifted_df.select('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'INFO_AC', 'INFO_SwappedAlleles', 'INFO_ReverseComplementedAlleles'))
33 changes: 33 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/merge-vcf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Databricks notebook source
from pyspark.sql.functions import *
import glow
glow.register(spark)

# COMMAND ----------

# DBTITLE 1,Split Thousand Genome Project multi-sample VCF into 2 single-sample VCFs
vcf_df = spark.read.format('vcf').load('/databricks-datasets/genomics/1kg-vcfs/*.vcf.gz')
vcf_split1 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[0].sampleId)'))
vcf_split2 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[1].sampleId)'))
vcf_split1.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/1.vcf.bgz')
vcf_split2.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/2.vcf.bgz')

# COMMAND ----------

# DBTITLE 1,Show contents before merge
df_to_merge = spark.read.format('vcf').load(['/hhd/vcf-merge-demo/1.vcf.bgz', '/hhd/vcf-merge-demo/2.vcf.bgz'])
display(df_to_merge.select('contigName', 'start', col('genotypes').sampleId).orderBy('contigName', 'start', 'genotypes.sampleId'))

# COMMAND ----------

# DBTITLE 1,Merge genotype arrays
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'))
display(merged.orderBy('contigName', 'start').select('contigName', 'start', col('genotypes').sampleId))

# COMMAND ----------

# DBTITLE 1,Merge VCFs and sum INFO_DP
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'), sum('INFO_DP').alias('INFO_DP'))
display(merged.orderBy('contigName', 'start').select('contigName', 'start', 'INFO_DP', col('genotypes').sampleId))
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Databricks notebook source
# DBTITLE 1,Setup
# MAGIC %md
# MAGIC To use variant normalizer, a copy of the reference genome `.fa/.fasta` file (along with its `.fai` file) must be downloaded to each node of the cluster.
# MAGIC
# MAGIC Here, we assume the reference genome is downloaded to the following path: `/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa`
# MAGIC
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be done by setting the [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) ``refGenomeId=grch38`` for your cluster.

# COMMAND ----------

# DBTITLE 1,Define path variables
import glow
glow.register(spark)
ref_genome_path = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa'
vcf_path = '/databricks-datasets/genomics/variant-normalization/test_left_align_hg38.vcf'

# COMMAND ----------

# DBTITLE 1,Load a VCF into a DataFrame
original_variants_df = spark.read\
.format("vcf")\
.option("includeSampleIds", False)\
.load(vcf_path)

# COMMAND ----------

# DBTITLE 1,Display
display(original_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variants transformer with column replacement
normalized_variants_df = glow.transform(\
"normalize_variants",\
original_variants_df,\
reference_genome_path=ref_genome_path
)

display(normalized_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variants transformer without column replacement
normalized_variants_df = glow.transform(\
"normalize_variants",\
original_variants_df,\
reference_genome_path=ref_genome_path,
replace_columns="False"
)

display(normalized_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variant function
from glow.functions import *

normalized_variants_df = original_variants_df.select("*", normalize_variant("contigName", "start", "end", "referenceAllele", "alternateAlleles", ref_genome_path).alias("normalizationResult"))

display(normalized_variants_df)
Loading