Skip to content

Commit

Permalink
Generate notebook sources (#301)
Browse files Browse the repository at this point in the history
* First try

Signed-off-by: Karen Feng <[email protected]>

* Fiddle with environment

Signed-off-by: Karen Feng <[email protected]>

* Activate glow env in docs steps

Signed-off-by: Karen Feng <[email protected]>

* Try attaching workspace

Signed-off-by: Karen Feng <[email protected]>

* Run subprocess in shell

Signed-off-by: Karen Feng <[email protected]>

* Try breaking checker

Signed-off-by: Karen Feng <[email protected]>

* Try triggering broken flow

Signed-off-by: Karen Feng <[email protected]>

* Figure out what's happening

Signed-off-by: Karen Feng <[email protected]>

* Try again

Signed-off-by: Karen Feng <[email protected]>

* Try again

Signed-off-by: Karen Feng <[email protected]>

* Mess up py again

Signed-off-by: Karen Feng <[email protected]>

* use python

Signed-off-by: Karen Feng <[email protected]>

* Try triggering error

Signed-off-by: Karen Feng <[email protected]>

* Print cg

Signed-off-by: Karen Feng <[email protected]>

* env ar

Signed-off-by: Karen Feng <[email protected]>

* Try again

Signed-off-by: Karen Feng <[email protected]>

* Remove test lines

Signed-off-by: Karen Feng <[email protected]>

* debug dbcli

Signed-off-by: Karen Feng <[email protected]>

* debug again

Signed-off-by: Karen Feng <[email protected]>

* comment

Signed-off-by: Karen Feng <[email protected]>

* more debug

Signed-off-by: Karen Feng <[email protected]>

* Try updating cli

Signed-off-by: Karen Feng <[email protected]>

* use westus2

Signed-off-by: Karen Feng <[email protected]>

* fix gff

Signed-off-by: Karen Feng <[email protected]>

* Clarify workspace-tmp-dir option and use f-based str interpolation

Signed-off-by: Karen Feng <[email protected]>

* Address comments

Signed-off-by: Karen Feng <[email protected]>

* Add conda-forge as channel

Signed-off-by: Karen Feng <[email protected]>

* Try updating conda

Signed-off-by: Karen Feng <[email protected]>

* Reorder deps with Python first

Signed-off-by: Karen Feng <[email protected]>
  • Loading branch information
karenfeng committed Nov 2, 2020
1 parent 796c98e commit 17e4346
Show file tree
Hide file tree
Showing 17 changed files with 1,851 additions and 8 deletions.
24 changes: 19 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ install_conda_deps: &install_conda_deps
command: |
export PATH=$HOME/conda/bin:$PATH
if [ ! -d "/home/circleci/conda" ]; then
wget https://repo.continuum.io/miniconda/Miniconda3-4.3.31-Linux-x86_64.sh
/bin/bash Miniconda3-4.3.31-Linux-x86_64.sh -b -p $HOME/conda
wget https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
/bin/bash Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -p $HOME/conda
conda env create -f python/environment.yml
else
echo "Conda already installed"
Expand All @@ -26,6 +26,7 @@ install_pyspark2: &install_pyspark2
conda remove -y pyspark
pip install pyspark==2.4.5
check_clean_repo: &check_clean_repo
run:
name: Verify that repo is clean
Expand All @@ -45,7 +46,7 @@ orbs:
codecov: codecov/[email protected]
jobs:

check-links:
check-docs:
<<: *setup_base
steps:
- checkout
Expand All @@ -64,6 +65,19 @@ jobs:
export PATH=$HOME/conda/envs/glow/bin:$PATH
cd docs
make linkcheck
- run:
name: Configure Databricks CLI
command: |
printf "[docs-ci]\nhost = https://westus2.azuredatabricks.net\ntoken = ${DATABRICKS_API_TOKEN}\n" > ~/.databrickscfg
- run:
name: Generate notebook source files
command: |
export PATH=$HOME/conda/bin:$PATH
source activate glow
for f in $(find docs/source/_static/notebooks -type f -name '*.html'); do
python docs/dev/gen-nb-src.py --html "${f}" --cli-profile docs-ci
done
- *check_clean_repo

scala-2_11-tests:
<<: *setup_base
Expand Down Expand Up @@ -200,7 +214,7 @@ workflows:
version: 2
test:
jobs:
- check-links
- check-docs
- scala-2_11-tests
- scala-2_12-tests
- spark-3-tests
Expand All @@ -213,5 +227,5 @@ workflows:
only:
- master
jobs:
- check-links
- check-docs
- spark-3-tests
62 changes: 62 additions & 0 deletions docs/dev/gen-nb-src.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
'''
Transforms a .html notebook into its source .py/.scala/.r/,sql file.
This script is used by the CircleCI job 'check-docs'. Before running this, configure
your Databricks CLI profile.
Example usage:
python3 docs/dev/gen-nb-src.py \
--html docs/source/_static/notebooks/etl/variant-data.html
'''
import click
import subprocess
import os
import uuid

NOTEBOOK_DIR = 'docs/source/_static/notebooks'
SOURCE_DIR = 'docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE'
SOURCE_EXTS = ['scala', 'py', 'r', 'sql']


def run_cli_workspace_cmd(cli_profile, args):
cmd = ['databricks', '--profile', cli_profile, 'workspace'] + args
res = subprocess.run(cmd, capture_output=True)
if res.returncode is not 0:
raise ValueError(res)


@click.command()
@click.option('--html', required=True, help='Path of the HTML notebook.')
@click.option('--cli-profile', default='DEFAULT', help='Databricks CLI profile name.')
@click.option('--workspace-tmp-dir', default='/tmp/glow-docs-ci', help='Base workspace dir; a temporary directory will be generated under this for import/export.')
def main(html, cli_profile, workspace_tmp_dir):
assert os.path.commonpath([NOTEBOOK_DIR, html]) == NOTEBOOK_DIR, \
f"HTML notebook must be under {NOTEBOOK_DIR} but got {html}."
rel_path = os.path.splitext(os.path.relpath(html, NOTEBOOK_DIR))[0]

if not os.path.exists(html): # html notebook was deleted
print(f"{html} does not exist. Deleting the companion source file...")
for ext in SOURCE_EXTS:
source_path = os.path.join(SOURCE_DIR, rel_path + "." + ext)
if os.path.exists(source_path):
os.remove(source_path)
print(f"\tDeleted {source_path}.")
return

print(f"Generating source file for {html} under {SOURCE_DIR} ...")

work_dir = os.path.join(workspace_tmp_dir, str(uuid.uuid4()))
workspace_path = os.path.join(work_dir, rel_path)

run_cli_workspace_cmd(cli_profile, ['mkdirs', os.path.join(work_dir, os.path.dirname(rel_path))])
try:
# `-l PYTHON` is required by CLI but ignored with `-f HTML`
# This command works for all languages in SOURCE_EXTS
run_cli_workspace_cmd(cli_profile, ['import', '-o', '-l', 'PYTHON', '-f', 'HTML', html, workspace_path])
run_cli_workspace_cmd(cli_profile, ['export_dir', '-o', work_dir, SOURCE_DIR])
finally:
run_cli_workspace_cmd(cli_profile, ['rm', '-r', work_dir])


if __name__ == '__main__':
main()
73 changes: 73 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/gff.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Databricks notebook source
from pyspark.sql.types import *

# Human genome annotations in GFF3 are available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/
gff_path = "/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz"

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with inferred schema

# COMMAND ----------

# DBTITLE 0,Print inferred schema
original_gff_df = spark.read \
.format("gff") \
.load(gff_path) \

original_gff_df.printSchema()

# COMMAND ----------

# DBTITLE 0,Read in the GFF3 with the inferred schema
display(original_gff_df)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with user-specified schema

# COMMAND ----------

mySchema = StructType( \
[StructField('seqId', StringType()),
StructField('start', LongType()),
StructField('end', LongType()),
StructField('ID', StringType()),
StructField('Dbxref', ArrayType(StringType())),
StructField('gene', StringType()),
StructField('mol_type', StringType())]
)

original_gff_df = spark.read \
.schema(mySchema) \
.format("gff") \
.load(gff_path) \

display(original_gff_df)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Read in GFF3 with user-specified schema including original GFF3 ``attributes`` column

# COMMAND ----------

mySchema = StructType( \
[StructField('seqId', StringType()),
StructField('start', LongType()),
StructField('end', LongType()),
StructField('ID', StringType()),
StructField('Dbxref', ArrayType(StringType())),
StructField('gene', StringType()),
StructField('mol_type', StringType()),
StructField('attributes', StringType())]
)

original_gff_df = spark.read \
.schema(mySchema) \
.format("gff") \
.load(gff_path) \

display(original_gff_df)
83 changes: 83 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/lift-over.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Databricks notebook source
# MAGIC %md
# MAGIC
# MAGIC To perform coordinate or variant liftover, you must download a chain file to each node.
# MAGIC
# MAGIC On a Databricks cluster, an example of a [cluster-scoped init script](https://docs.azuredatabricks.net/clusters/init-scripts.html#cluster-scoped-init-scripts) you can use to download the required file is as follows:
# MAGIC
# MAGIC ```
# MAGIC #!/usr/bin/env bash
# MAGIC set -ex
# MAGIC set -o pipefail
# MAGIC mkdir /opt/liftover
# MAGIC curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain
# MAGIC ```
# MAGIC In this demo, we perform coordinate and variant liftover from b37 to hg38.
# MAGIC
# MAGIC To perform variant liftover, you must download a reference file to each node of the cluster. Here, we assume the reference genome is downloaded to
# MAGIC ```/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa```
# MAGIC
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be achieved by setting [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) `refGenomeId=grch38`.

# COMMAND ----------

# DBTITLE 1,Import glow and define path variables
import glow
glow.register(spark)
chain_file = '/opt/liftover/b37ToHg38.over.chain'
reference_file = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa'
vcf_file = 'dbfs:/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'

# COMMAND ----------

# DBTITLE 1,First, read in a VCF from a flat file or Delta Lake table.
input_df = spark.read.format("vcf") \
.load(vcf_file) \
.cache()

# COMMAND ----------

# MAGIC %md
# MAGIC
# MAGIC Now apply the `lift_over_coordinates` UDF, with the parameters as follows:
# MAGIC - chromosome (`string`)
# MAGIC - start (`long`)
# MAGIC - end (`long`)
# MAGIC - CONSTANT: chain file (`string`)
# MAGIC - OPTIONAL: minimum fraction of bases that must remap (`double`), defaults to `.95`
# MAGIC
# MAGIC This creates a column with the new coordinates.

# COMMAND ----------

from pyspark.sql.functions import *

# COMMAND ----------

liftover_expr = "lift_over_coordinates(contigName, start, end, chain_file, .99)"
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr))

# COMMAND ----------

# DBTITLE 1,Filter rows for which liftover succeeded and see which rows changed.
changed_with_lifted_df = input_with_lifted_df.filter("lifted is not null").filter("start != lifted.start")
display(changed_with_lifted_df)

# COMMAND ----------

# MAGIC %md
# MAGIC
# MAGIC Now apply the `lift_over_variants` transformer, with the following options.
# MAGIC - `chain_file`: `string`
# MAGIC - `reference_file`: `string`
# MAGIC - `min_match_ratio`: `double` (optional, defaults to `.95`)

# COMMAND ----------

output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)

# COMMAND ----------

# DBTITLE 1,View the rows for which liftover succeeded
lifted_df = output_df.filter('liftOverStatus.success = true').drop('liftOverStatus')
display(lifted_df.select('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'INFO_AC', 'INFO_SwappedAlleles', 'INFO_ReverseComplementedAlleles'))
33 changes: 33 additions & 0 deletions docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/merge-vcf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Databricks notebook source
from pyspark.sql.functions import *
import glow
glow.register(spark)

# COMMAND ----------

# DBTITLE 1,Split Thousand Genome Project multi-sample VCF into 2 single-sample VCFs
vcf_df = spark.read.format('vcf').load('/databricks-datasets/genomics/1kg-vcfs/*.vcf.gz')
vcf_split1 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[0].sampleId)'))
vcf_split2 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[1].sampleId)'))
vcf_split1.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/1.vcf.bgz')
vcf_split2.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/2.vcf.bgz')

# COMMAND ----------

# DBTITLE 1,Show contents before merge
df_to_merge = spark.read.format('vcf').load(['/hhd/vcf-merge-demo/1.vcf.bgz', '/hhd/vcf-merge-demo/2.vcf.bgz'])
display(df_to_merge.select('contigName', 'start', col('genotypes').sampleId).orderBy('contigName', 'start', 'genotypes.sampleId'))

# COMMAND ----------

# DBTITLE 1,Merge genotype arrays
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'))
display(merged.orderBy('contigName', 'start').select('contigName', 'start', col('genotypes').sampleId))

# COMMAND ----------

# DBTITLE 1,Merge VCFs and sum INFO_DP
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'), sum('INFO_DP').alias('INFO_DP'))
display(merged.orderBy('contigName', 'start').select('contigName', 'start', 'INFO_DP', col('genotypes').sampleId))
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Databricks notebook source
# DBTITLE 1,Setup
# MAGIC %md
# MAGIC To use variant normalizer, a copy of the reference genome `.fa/.fasta` file (along with its `.fai` file) must be downloaded to each node of the cluster.
# MAGIC
# MAGIC Here, we assume the reference genome is downloaded to the following path: `/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa`
# MAGIC
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be done by setting the [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) ``refGenomeId=grch38`` for your cluster.

# COMMAND ----------

# DBTITLE 1,Define path variables
import glow
glow.register(spark)
ref_genome_path = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa'
vcf_path = '/databricks-datasets/genomics/variant-normalization/test_left_align_hg38.vcf'

# COMMAND ----------

# DBTITLE 1,Load a VCF into a DataFrame
original_variants_df = spark.read\
.format("vcf")\
.option("includeSampleIds", False)\
.load(vcf_path)

# COMMAND ----------

# DBTITLE 1,Display
display(original_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variants transformer with column replacement
normalized_variants_df = glow.transform(\
"normalize_variants",\
original_variants_df,\
reference_genome_path=ref_genome_path
)

display(normalized_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variants transformer without column replacement
normalized_variants_df = glow.transform(\
"normalize_variants",\
original_variants_df,\
reference_genome_path=ref_genome_path,
replace_columns="False"
)

display(normalized_variants_df)

# COMMAND ----------

# DBTITLE 1,Normalize variants using normalize_variant function
from glow.functions import *

normalized_variants_df = original_variants_df.select("*", normalize_variant("contigName", "start", "end", "referenceAllele", "alternateAlleles", ref_genome_path).alias("normalizationResult"))

display(normalized_variants_df)
Loading

0 comments on commit 17e4346

Please sign in to comment.