-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate notebook sources #301
Changes from 25 commits
ad1eb65
7b42ec9
0346a17
ae03c9e
c3fdd45
e615624
f077ea5
639281d
ff21897
719b8f1
7c6fc42
ca9c4a0
28e6cc8
a447721
ed1f88c
a361207
34a5952
4748d80
065b1ce
42097a8
185b1d0
c8ba553
3264419
7c10cd2
366c351
2aa0bdf
82e82fd
38ade1f
b191171
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ install_pyspark2: &install_pyspark2 | |
conda remove -y pyspark | ||
pip install pyspark==2.4.5 | ||
|
||
|
||
check_clean_repo: &check_clean_repo | ||
run: | ||
name: Verify that repo is clean | ||
|
@@ -45,7 +46,7 @@ orbs: | |
codecov: codecov/[email protected] | ||
jobs: | ||
|
||
check-links: | ||
check-docs: | ||
<<: *setup_base | ||
steps: | ||
- checkout | ||
|
@@ -64,6 +65,19 @@ jobs: | |
export PATH=$HOME/conda/envs/glow/bin:$PATH | ||
cd docs | ||
make linkcheck | ||
- run: | ||
name: Configure Databricks CLI | ||
command: | | ||
printf "[docs-ci]\nhost = https://westus2.azuredatabricks.net\ntoken = ${DATABRICKS_API_TOKEN}\n" > ~/.databrickscfg | ||
- run: | ||
name: Generate notebook source files | ||
command: | | ||
export PATH=$HOME/conda/bin:$PATH | ||
source activate glow | ||
for f in $(find docs/source/_static/notebooks -type f -name '*.html'); do | ||
python docs/dev/gen-nb-src.py --html "${f}" | ||
done | ||
- *check_clean_repo | ||
|
||
scala-2_11-tests: | ||
<<: *setup_base | ||
|
@@ -200,7 +214,7 @@ workflows: | |
version: 2 | ||
test: | ||
jobs: | ||
- check-links | ||
- check-docs | ||
- scala-2_11-tests | ||
- scala-2_12-tests | ||
- spark-3-tests | ||
|
@@ -213,5 +227,5 @@ workflows: | |
only: | ||
- master | ||
jobs: | ||
- check-links | ||
- check-docs | ||
- spark-3-tests |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
''' | ||
Transforms a .html notebook into its source .py/.scala/.r/,sql file. | ||
|
||
This script is used by the CircleCI job 'check-docs'. Before running this, configure | ||
your Databricks CLI profile. | ||
|
||
Example usage: | ||
python3 docs/dev/gen-nb-src.py \ | ||
--html docs/source/_static/notebooks/etl/variant-data.html | ||
''' | ||
import click | ||
import subprocess | ||
import os | ||
import uuid | ||
|
||
NOTEBOOK_DIR = 'docs/source/_static/notebooks' | ||
SOURCE_DIR = 'docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE' | ||
SOURCE_EXTS = ['scala', 'py', 'r', 'sql'] | ||
|
||
|
||
@click.command() | ||
@click.option('--html', required=True, help='Path of the HTML notebook.') | ||
@click.option('--cli-profile', default='docs-ci', help='Databricks CLI profile name.') | ||
@click.option('--workspace-tmp-dir', default='/tmp/glow-docs-ci', help='Base workspace dir; a temporary directory will be generated under this for import/export.') | ||
def main(html, cli_profile, workspace_tmp_dir): | ||
assert os.path.commonpath([NOTEBOOK_DIR, html]) == NOTEBOOK_DIR, \ | ||
f"HTML notebook must be under {NOTEBOOK_DIR} but got {html}." | ||
rel_path = os.path.splitext(os.path.relpath(html, NOTEBOOK_DIR))[0] | ||
|
||
if not os.path.exists(html): # html notebook was deleted | ||
print(f"{html} does not exist. Deleting the companion source file...") | ||
for ext in SOURCE_EXTS: | ||
source_path = os.path.join(SOURCE_DIR, rel_path + "." + ext) | ||
if os.path.exists(source_path): | ||
os.remove(source_path) | ||
print(f"\tDeleted {source_path}.") | ||
return | ||
|
||
print(f"Generating source file for {html} under {SOURCE_DIR} ...") | ||
|
||
def run_cli_workspace_cmd(args): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this a nested definition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just so we can leverage There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that would be a bit cleaner. |
||
cmd = ['databricks', '--profile', cli_profile, 'workspace'] + args | ||
subprocess.check_output(cmd, stderr=subprocess.STDOUT) | ||
|
||
work_dir = os.path.join(workspace_tmp_dir, str(uuid.uuid4())) | ||
workspace_path = os.path.join(work_dir, rel_path) | ||
|
||
run_cli_workspace_cmd(['mkdirs', os.path.join(work_dir, os.path.dirname(rel_path))]) | ||
try: | ||
# `-l PYTHON` is required by CLI but ignored with `-f HTML` | ||
# This command works for all languages in SOURCE_EXTS | ||
run_cli_workspace_cmd(['import', '-o', '-l', 'PYTHON', '-f', 'HTML', html, workspace_path]) | ||
run_cli_workspace_cmd(['export_dir', '-o', work_dir, SOURCE_DIR]) | ||
finally: | ||
run_cli_workspace_cmd(['rm', '-r', work_dir]) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Databricks notebook source | ||
from pyspark.sql.types import * | ||
|
||
# Human genome annotations in GFF3 are available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/ | ||
gff_path = "/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz" | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with inferred schema | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 0,Print inferred schema | ||
original_gff_df = spark.read \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
original_gff_df.printSchema() | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 0,Read in the GFF3 with the inferred schema | ||
display(original_gff_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with user-specified schema | ||
|
||
# COMMAND ---------- | ||
|
||
mySchema = StructType( \ | ||
[StructField('seqId', StringType()), | ||
StructField('start', LongType()), | ||
StructField('end', LongType()), | ||
StructField('ID', StringType()), | ||
StructField('Dbxref', ArrayType(StringType())), | ||
StructField('gene', StringType()), | ||
StructField('mol_type', StringType())] | ||
) | ||
|
||
original_gff_df = spark.read \ | ||
.schema(mySchema) \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
display(original_gff_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with user-specified schema including original GFF3 ``attributes`` column | ||
|
||
# COMMAND ---------- | ||
|
||
mySchema = StructType( \ | ||
[StructField('seqId', StringType()), | ||
StructField('start', LongType()), | ||
StructField('end', LongType()), | ||
StructField('ID', StringType()), | ||
StructField('Dbxref', ArrayType(StringType())), | ||
StructField('gene', StringType()), | ||
StructField('mol_type', StringType()), | ||
StructField('attributes', StringType())] | ||
) | ||
|
||
original_gff_df = spark.read \ | ||
.schema(mySchema) \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
display(original_gff_df) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Databricks notebook source | ||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC To perform coordinate or variant liftover, you must download a chain file to each node. | ||
# MAGIC | ||
# MAGIC On a Databricks cluster, an example of a [cluster-scoped init script](https://docs.azuredatabricks.net/clusters/init-scripts.html#cluster-scoped-init-scripts) you can use to download the required file is as follows: | ||
# MAGIC | ||
# MAGIC ``` | ||
# MAGIC #!/usr/bin/env bash | ||
# MAGIC set -ex | ||
# MAGIC set -o pipefail | ||
# MAGIC mkdir /opt/liftover | ||
# MAGIC curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain | ||
# MAGIC ``` | ||
# MAGIC In this demo, we perform coordinate and variant liftover from b37 to hg38. | ||
# MAGIC | ||
# MAGIC To perform variant liftover, you must download a reference file to each node of the cluster. Here, we assume the reference genome is downloaded to | ||
# MAGIC ```/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa``` | ||
# MAGIC | ||
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be achieved by setting [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) `refGenomeId=grch38`. | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Import glow and define path variables | ||
import glow | ||
glow.register(spark) | ||
chain_file = '/opt/liftover/b37ToHg38.over.chain' | ||
reference_file = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa' | ||
vcf_file = 'dbfs:/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,First, read in a VCF from a flat file or Delta Lake table. | ||
input_df = spark.read.format("vcf") \ | ||
.load(vcf_file) \ | ||
.cache() | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC Now apply the `lift_over_coordinates` UDF, with the parameters as follows: | ||
# MAGIC - chromosome (`string`) | ||
# MAGIC - start (`long`) | ||
# MAGIC - end (`long`) | ||
# MAGIC - CONSTANT: chain file (`string`) | ||
# MAGIC - OPTIONAL: minimum fraction of bases that must remap (`double`), defaults to `.95` | ||
# MAGIC | ||
# MAGIC This creates a column with the new coordinates. | ||
|
||
# COMMAND ---------- | ||
|
||
from pyspark.sql.functions import * | ||
|
||
# COMMAND ---------- | ||
|
||
liftover_expr = "lift_over_coordinates(contigName, start, end, chain_file, .99)" | ||
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr)) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Filter rows for which liftover succeeded and see which rows changed. | ||
changed_with_lifted_df = input_with_lifted_df.filter("lifted is not null").filter("start != lifted.start") | ||
display(changed_with_lifted_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC Now apply the `lift_over_variants` transformer, with the following options. | ||
# MAGIC - `chain_file`: `string` | ||
# MAGIC - `reference_file`: `string` | ||
# MAGIC - `min_match_ratio`: `double` (optional, defaults to `.95`) | ||
|
||
# COMMAND ---------- | ||
|
||
output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,View the rows for which liftover succeeded | ||
lifted_df = output_df.filter('liftOverStatus.success = true').drop('liftOverStatus') | ||
display(lifted_df.select('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'INFO_AC', 'INFO_SwappedAlleles', 'INFO_ReverseComplementedAlleles')) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Databricks notebook source | ||
from pyspark.sql.functions import * | ||
import glow | ||
glow.register(spark) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Split Thousand Genome Project multi-sample VCF into 2 single-sample VCFs | ||
vcf_df = spark.read.format('vcf').load('/databricks-datasets/genomics/1kg-vcfs/*.vcf.gz') | ||
vcf_split1 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[0].sampleId)')) | ||
vcf_split2 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[1].sampleId)')) | ||
vcf_split1.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/1.vcf.bgz') | ||
vcf_split2.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/2.vcf.bgz') | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Show contents before merge | ||
df_to_merge = spark.read.format('vcf').load(['/hhd/vcf-merge-demo/1.vcf.bgz', '/hhd/vcf-merge-demo/2.vcf.bgz']) | ||
display(df_to_merge.select('contigName', 'start', col('genotypes').sampleId).orderBy('contigName', 'start', 'genotypes.sampleId')) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Merge genotype arrays | ||
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\ | ||
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes')) | ||
display(merged.orderBy('contigName', 'start').select('contigName', 'start', col('genotypes').sampleId)) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Merge VCFs and sum INFO_DP | ||
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\ | ||
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'), sum('INFO_DP').alias('INFO_DP')) | ||
display(merged.orderBy('contigName', 'start').select('contigName', 'start', 'INFO_DP', col('genotypes').sampleId)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# Databricks notebook source | ||
# DBTITLE 1,Setup | ||
# MAGIC %md | ||
# MAGIC To use variant normalizer, a copy of the reference genome `.fa/.fasta` file (along with its `.fai` file) must be downloaded to each node of the cluster. | ||
# MAGIC | ||
# MAGIC Here, we assume the reference genome is downloaded to the following path: `/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa` | ||
# MAGIC | ||
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be done by setting the [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) ``refGenomeId=grch38`` for your cluster. | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Define path variables | ||
import glow | ||
glow.register(spark) | ||
ref_genome_path = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa' | ||
vcf_path = '/databricks-datasets/genomics/variant-normalization/test_left_align_hg38.vcf' | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Load a VCF into a DataFrame | ||
original_variants_df = spark.read\ | ||
.format("vcf")\ | ||
.option("includeSampleIds", False)\ | ||
.load(vcf_path) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Display | ||
display(original_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variants transformer with column replacement | ||
normalized_variants_df = glow.transform(\ | ||
"normalize_variants",\ | ||
original_variants_df,\ | ||
reference_genome_path=ref_genome_path | ||
) | ||
|
||
display(normalized_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variants transformer without column replacement | ||
normalized_variants_df = glow.transform(\ | ||
"normalize_variants",\ | ||
original_variants_df,\ | ||
reference_genome_path=ref_genome_path, | ||
replace_columns="False" | ||
) | ||
|
||
display(normalized_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variant function | ||
from glow.functions import * | ||
|
||
normalized_variants_df = original_variants_df.select("*", normalize_variant("contigName", "start", "end", "referenceAllele", "alternateAlleles", ref_genome_path).alias("normalizationResult")) | ||
|
||
display(normalized_variants_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it make more sense to rely on the databricks CLI's default profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this useful during manual testing, as I already have a default Databricks CLI profile and wanted to test this on the same shard as the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that it makes sense to have this argument, but the default value should be the default profile. If you're running locally, I don't think there's any reason to expect a profile called
docs-ci
to exist. We can pass thedocs-ci
argument from our CI script.