-
Notifications
You must be signed in to change notification settings - Fork 111
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* First try Signed-off-by: Karen Feng <[email protected]> * Fiddle with environment Signed-off-by: Karen Feng <[email protected]> * Activate glow env in docs steps Signed-off-by: Karen Feng <[email protected]> * Try attaching workspace Signed-off-by: Karen Feng <[email protected]> * Run subprocess in shell Signed-off-by: Karen Feng <[email protected]> * Try breaking checker Signed-off-by: Karen Feng <[email protected]> * Try triggering broken flow Signed-off-by: Karen Feng <[email protected]> * Figure out what's happening Signed-off-by: Karen Feng <[email protected]> * Try again Signed-off-by: Karen Feng <[email protected]> * Try again Signed-off-by: Karen Feng <[email protected]> * Mess up py again Signed-off-by: Karen Feng <[email protected]> * use python Signed-off-by: Karen Feng <[email protected]> * Try triggering error Signed-off-by: Karen Feng <[email protected]> * Print cg Signed-off-by: Karen Feng <[email protected]> * env ar Signed-off-by: Karen Feng <[email protected]> * Try again Signed-off-by: Karen Feng <[email protected]> * Remove test lines Signed-off-by: Karen Feng <[email protected]> * debug dbcli Signed-off-by: Karen Feng <[email protected]> * debug again Signed-off-by: Karen Feng <[email protected]> * comment Signed-off-by: Karen Feng <[email protected]> * more debug Signed-off-by: Karen Feng <[email protected]> * Try updating cli Signed-off-by: Karen Feng <[email protected]> * use westus2 Signed-off-by: Karen Feng <[email protected]> * fix gff Signed-off-by: Karen Feng <[email protected]> * Clarify workspace-tmp-dir option and use f-based str interpolation Signed-off-by: Karen Feng <[email protected]> * Address comments Signed-off-by: Karen Feng <[email protected]> * Add conda-forge as channel Signed-off-by: Karen Feng <[email protected]> * Try updating conda Signed-off-by: Karen Feng <[email protected]> * Reorder deps with Python first Signed-off-by: Karen Feng <[email protected]>
- Loading branch information
Showing
17 changed files
with
1,851 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,8 +9,8 @@ install_conda_deps: &install_conda_deps | |
command: | | ||
export PATH=$HOME/conda/bin:$PATH | ||
if [ ! -d "/home/circleci/conda" ]; then | ||
wget https://repo.continuum.io/miniconda/Miniconda3-4.3.31-Linux-x86_64.sh | ||
/bin/bash Miniconda3-4.3.31-Linux-x86_64.sh -b -p $HOME/conda | ||
wget https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh | ||
/bin/bash Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -p $HOME/conda | ||
conda env create -f python/environment.yml | ||
else | ||
echo "Conda already installed" | ||
|
@@ -26,6 +26,7 @@ install_pyspark2: &install_pyspark2 | |
conda remove -y pyspark | ||
pip install pyspark==2.4.5 | ||
check_clean_repo: &check_clean_repo | ||
run: | ||
name: Verify that repo is clean | ||
|
@@ -45,7 +46,7 @@ orbs: | |
codecov: codecov/[email protected] | ||
jobs: | ||
|
||
check-links: | ||
check-docs: | ||
<<: *setup_base | ||
steps: | ||
- checkout | ||
|
@@ -64,6 +65,19 @@ jobs: | |
export PATH=$HOME/conda/envs/glow/bin:$PATH | ||
cd docs | ||
make linkcheck | ||
- run: | ||
name: Configure Databricks CLI | ||
command: | | ||
printf "[docs-ci]\nhost = https://westus2.azuredatabricks.net\ntoken = ${DATABRICKS_API_TOKEN}\n" > ~/.databrickscfg | ||
- run: | ||
name: Generate notebook source files | ||
command: | | ||
export PATH=$HOME/conda/bin:$PATH | ||
source activate glow | ||
for f in $(find docs/source/_static/notebooks -type f -name '*.html'); do | ||
python docs/dev/gen-nb-src.py --html "${f}" --cli-profile docs-ci | ||
done | ||
- *check_clean_repo | ||
|
||
scala-2_11-tests: | ||
<<: *setup_base | ||
|
@@ -200,7 +214,7 @@ workflows: | |
version: 2 | ||
test: | ||
jobs: | ||
- check-links | ||
- check-docs | ||
- scala-2_11-tests | ||
- scala-2_12-tests | ||
- spark-3-tests | ||
|
@@ -213,5 +227,5 @@ workflows: | |
only: | ||
- master | ||
jobs: | ||
- check-links | ||
- check-docs | ||
- spark-3-tests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
''' | ||
Transforms a .html notebook into its source .py/.scala/.r/,sql file. | ||
This script is used by the CircleCI job 'check-docs'. Before running this, configure | ||
your Databricks CLI profile. | ||
Example usage: | ||
python3 docs/dev/gen-nb-src.py \ | ||
--html docs/source/_static/notebooks/etl/variant-data.html | ||
''' | ||
import click | ||
import subprocess | ||
import os | ||
import uuid | ||
|
||
NOTEBOOK_DIR = 'docs/source/_static/notebooks' | ||
SOURCE_DIR = 'docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE' | ||
SOURCE_EXTS = ['scala', 'py', 'r', 'sql'] | ||
|
||
|
||
def run_cli_workspace_cmd(cli_profile, args): | ||
cmd = ['databricks', '--profile', cli_profile, 'workspace'] + args | ||
res = subprocess.run(cmd, capture_output=True) | ||
if res.returncode is not 0: | ||
raise ValueError(res) | ||
|
||
|
||
@click.command() | ||
@click.option('--html', required=True, help='Path of the HTML notebook.') | ||
@click.option('--cli-profile', default='DEFAULT', help='Databricks CLI profile name.') | ||
@click.option('--workspace-tmp-dir', default='/tmp/glow-docs-ci', help='Base workspace dir; a temporary directory will be generated under this for import/export.') | ||
def main(html, cli_profile, workspace_tmp_dir): | ||
assert os.path.commonpath([NOTEBOOK_DIR, html]) == NOTEBOOK_DIR, \ | ||
f"HTML notebook must be under {NOTEBOOK_DIR} but got {html}." | ||
rel_path = os.path.splitext(os.path.relpath(html, NOTEBOOK_DIR))[0] | ||
|
||
if not os.path.exists(html): # html notebook was deleted | ||
print(f"{html} does not exist. Deleting the companion source file...") | ||
for ext in SOURCE_EXTS: | ||
source_path = os.path.join(SOURCE_DIR, rel_path + "." + ext) | ||
if os.path.exists(source_path): | ||
os.remove(source_path) | ||
print(f"\tDeleted {source_path}.") | ||
return | ||
|
||
print(f"Generating source file for {html} under {SOURCE_DIR} ...") | ||
|
||
work_dir = os.path.join(workspace_tmp_dir, str(uuid.uuid4())) | ||
workspace_path = os.path.join(work_dir, rel_path) | ||
|
||
run_cli_workspace_cmd(cli_profile, ['mkdirs', os.path.join(work_dir, os.path.dirname(rel_path))]) | ||
try: | ||
# `-l PYTHON` is required by CLI but ignored with `-f HTML` | ||
# This command works for all languages in SOURCE_EXTS | ||
run_cli_workspace_cmd(cli_profile, ['import', '-o', '-l', 'PYTHON', '-f', 'HTML', html, workspace_path]) | ||
run_cli_workspace_cmd(cli_profile, ['export_dir', '-o', work_dir, SOURCE_DIR]) | ||
finally: | ||
run_cli_workspace_cmd(cli_profile, ['rm', '-r', work_dir]) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
73 changes: 73 additions & 0 deletions
73
docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/gff.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Databricks notebook source | ||
from pyspark.sql.types import * | ||
|
||
# Human genome annotations in GFF3 are available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/ | ||
gff_path = "/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz" | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with inferred schema | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 0,Print inferred schema | ||
original_gff_df = spark.read \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
original_gff_df.printSchema() | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 0,Read in the GFF3 with the inferred schema | ||
display(original_gff_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with user-specified schema | ||
|
||
# COMMAND ---------- | ||
|
||
mySchema = StructType( \ | ||
[StructField('seqId', StringType()), | ||
StructField('start', LongType()), | ||
StructField('end', LongType()), | ||
StructField('ID', StringType()), | ||
StructField('Dbxref', ArrayType(StringType())), | ||
StructField('gene', StringType()), | ||
StructField('mol_type', StringType())] | ||
) | ||
|
||
original_gff_df = spark.read \ | ||
.schema(mySchema) \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
display(original_gff_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Read in GFF3 with user-specified schema including original GFF3 ``attributes`` column | ||
|
||
# COMMAND ---------- | ||
|
||
mySchema = StructType( \ | ||
[StructField('seqId', StringType()), | ||
StructField('start', LongType()), | ||
StructField('end', LongType()), | ||
StructField('ID', StringType()), | ||
StructField('Dbxref', ArrayType(StringType())), | ||
StructField('gene', StringType()), | ||
StructField('mol_type', StringType()), | ||
StructField('attributes', StringType())] | ||
) | ||
|
||
original_gff_df = spark.read \ | ||
.schema(mySchema) \ | ||
.format("gff") \ | ||
.load(gff_path) \ | ||
|
||
display(original_gff_df) |
83 changes: 83 additions & 0 deletions
83
docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/lift-over.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Databricks notebook source | ||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC To perform coordinate or variant liftover, you must download a chain file to each node. | ||
# MAGIC | ||
# MAGIC On a Databricks cluster, an example of a [cluster-scoped init script](https://docs.azuredatabricks.net/clusters/init-scripts.html#cluster-scoped-init-scripts) you can use to download the required file is as follows: | ||
# MAGIC | ||
# MAGIC ``` | ||
# MAGIC #!/usr/bin/env bash | ||
# MAGIC set -ex | ||
# MAGIC set -o pipefail | ||
# MAGIC mkdir /opt/liftover | ||
# MAGIC curl https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain --output /opt/liftover/b37ToHg38.over.chain | ||
# MAGIC ``` | ||
# MAGIC In this demo, we perform coordinate and variant liftover from b37 to hg38. | ||
# MAGIC | ||
# MAGIC To perform variant liftover, you must download a reference file to each node of the cluster. Here, we assume the reference genome is downloaded to | ||
# MAGIC ```/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa``` | ||
# MAGIC | ||
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be achieved by setting [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) `refGenomeId=grch38`. | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Import glow and define path variables | ||
import glow | ||
glow.register(spark) | ||
chain_file = '/opt/liftover/b37ToHg38.over.chain' | ||
reference_file = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa' | ||
vcf_file = 'dbfs:/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,First, read in a VCF from a flat file or Delta Lake table. | ||
input_df = spark.read.format("vcf") \ | ||
.load(vcf_file) \ | ||
.cache() | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC Now apply the `lift_over_coordinates` UDF, with the parameters as follows: | ||
# MAGIC - chromosome (`string`) | ||
# MAGIC - start (`long`) | ||
# MAGIC - end (`long`) | ||
# MAGIC - CONSTANT: chain file (`string`) | ||
# MAGIC - OPTIONAL: minimum fraction of bases that must remap (`double`), defaults to `.95` | ||
# MAGIC | ||
# MAGIC This creates a column with the new coordinates. | ||
|
||
# COMMAND ---------- | ||
|
||
from pyspark.sql.functions import * | ||
|
||
# COMMAND ---------- | ||
|
||
liftover_expr = "lift_over_coordinates(contigName, start, end, chain_file, .99)" | ||
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr)) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Filter rows for which liftover succeeded and see which rows changed. | ||
changed_with_lifted_df = input_with_lifted_df.filter("lifted is not null").filter("start != lifted.start") | ||
display(changed_with_lifted_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC | ||
# MAGIC Now apply the `lift_over_variants` transformer, with the following options. | ||
# MAGIC - `chain_file`: `string` | ||
# MAGIC - `reference_file`: `string` | ||
# MAGIC - `min_match_ratio`: `double` (optional, defaults to `.95`) | ||
|
||
# COMMAND ---------- | ||
|
||
output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,View the rows for which liftover succeeded | ||
lifted_df = output_df.filter('liftOverStatus.success = true').drop('liftOverStatus') | ||
display(lifted_df.select('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'INFO_AC', 'INFO_SwappedAlleles', 'INFO_ReverseComplementedAlleles')) |
33 changes: 33 additions & 0 deletions
33
docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/merge-vcf.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Databricks notebook source | ||
from pyspark.sql.functions import * | ||
import glow | ||
glow.register(spark) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Split Thousand Genome Project multi-sample VCF into 2 single-sample VCFs | ||
vcf_df = spark.read.format('vcf').load('/databricks-datasets/genomics/1kg-vcfs/*.vcf.gz') | ||
vcf_split1 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[0].sampleId)')) | ||
vcf_split2 = vcf_df.withColumn('genotypes', expr('filter(genotypes, (g, idx) -> g.sampleId = genotypes[1].sampleId)')) | ||
vcf_split1.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/1.vcf.bgz') | ||
vcf_split2.write.format('bigvcf').mode('overwrite').save('/tmp/vcf-merge-demo/2.vcf.bgz') | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Show contents before merge | ||
df_to_merge = spark.read.format('vcf').load(['/hhd/vcf-merge-demo/1.vcf.bgz', '/hhd/vcf-merge-demo/2.vcf.bgz']) | ||
display(df_to_merge.select('contigName', 'start', col('genotypes').sampleId).orderBy('contigName', 'start', 'genotypes.sampleId')) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Merge genotype arrays | ||
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\ | ||
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes')) | ||
display(merged.orderBy('contigName', 'start').select('contigName', 'start', col('genotypes').sampleId)) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Merge VCFs and sum INFO_DP | ||
merged = df_to_merge.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\ | ||
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'), sum('INFO_DP').alias('INFO_DP')) | ||
display(merged.orderBy('contigName', 'start').select('contigName', 'start', 'INFO_DP', col('genotypes').sampleId)) |
61 changes: 61 additions & 0 deletions
61
docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/etl/normalizevariants.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# Databricks notebook source | ||
# DBTITLE 1,Setup | ||
# MAGIC %md | ||
# MAGIC To use variant normalizer, a copy of the reference genome `.fa/.fasta` file (along with its `.fai` file) must be downloaded to each node of the cluster. | ||
# MAGIC | ||
# MAGIC Here, we assume the reference genome is downloaded to the following path: `/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa` | ||
# MAGIC | ||
# MAGIC If you are using a Databricks cluster with [Databricks Runtime for Genomics](https://docs.databricks.com/applications/genomics/index.html), this can be done by setting the [environment variable](https://docs.databricks.com/user-guide/clusters/spark-config.html#environment-variables) ``refGenomeId=grch38`` for your cluster. | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Define path variables | ||
import glow | ||
glow.register(spark) | ||
ref_genome_path = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa' | ||
vcf_path = '/databricks-datasets/genomics/variant-normalization/test_left_align_hg38.vcf' | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Load a VCF into a DataFrame | ||
original_variants_df = spark.read\ | ||
.format("vcf")\ | ||
.option("includeSampleIds", False)\ | ||
.load(vcf_path) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Display | ||
display(original_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variants transformer with column replacement | ||
normalized_variants_df = glow.transform(\ | ||
"normalize_variants",\ | ||
original_variants_df,\ | ||
reference_genome_path=ref_genome_path | ||
) | ||
|
||
display(normalized_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variants transformer without column replacement | ||
normalized_variants_df = glow.transform(\ | ||
"normalize_variants",\ | ||
original_variants_df,\ | ||
reference_genome_path=ref_genome_path, | ||
replace_columns="False" | ||
) | ||
|
||
display(normalized_variants_df) | ||
|
||
# COMMAND ---------- | ||
|
||
# DBTITLE 1,Normalize variants using normalize_variant function | ||
from glow.functions import * | ||
|
||
normalized_variants_df = original_variants_df.select("*", normalize_variant("contigName", "start", "end", "referenceAllele", "alternateAlleles", ref_genome_path).alias("normalizationResult")) | ||
|
||
display(normalized_variants_df) |
Oops, something went wrong.