This repository provides both an example and a scripted procedure for taking raw genomics reads to analysis ready variants following GATK's best practices, using Python.
In particular, we follow the GATK's steps for germline cohort data as outlined in Germline-short-variant-discovery-SNPs-Indels.
This pipeline uses the current latest versions of GATK and hail:
- GATK:
4.2.6.1
- Hail:
0.2.95
All raw data was obtained from 2 resources:
- 1000 Genomes Project and corresponding FTP data server
- Broad Institute and corresponding FTP data server (ftp://[email protected]/bundle/b37, only accessible via FTP client)
including unmapped read files for each sample (BAM), reference genome (Fasta) and known sites (VCF) data.
In order to keep our experiment reproducible, we utilize the GATK docker image available from dockerhub at broadinstitute/gatk:4.2.6.1
.
We use the latest version, which is currently 4.2.6.1
We emphasize the use of tools which promote scalability, including the use of GATK's BETA Spark features (e.g. ReadsPipelineSpark) and Hail for scalable downstream analytics. Both of these tools leverage Spark to enable large scale data processing, which allows for scaling from a local workstation to a large HPC or cloud cluster.
To see an example of the pipeline as markdown with bash commands executed from within the GATK docker image, please see the pipeline_mds
directory.
The main contribution of this repository is to script the pipeline using Python, running each of the GATK steps in docker and automatically handling the stages of the workflow. For example, the following python code snippet will perform the following:
- Download all data
- Sample reads (i.e., a cohort of reads)
- Known sites
- Reference
- Perform data preprocessing
- Filtering, sorting and indexing cohort reads
- Creating indices for known sites
- Creating an index for the reference
- Perform ReadsPipelineSpark for each sample, obtaining analysis-ready variants for each sample of the cohort
- Combine cohort samples into a GenomicsDB
- Perform joint genotyping on the cohort with HaplotypeCaller
- Fit/applies a variant recalibration model to obtain variant quality scores
- Applies VQSR to filter low quality variants
- Docker (tested 20.10.14)
- ~32 GB RAM
- ~20GB SSD space
- Linux recommended (Ubuntu 20.04 tested)
Create the conda environment
CONDA_ENV_NAME=reads_to_variants
conda create -n $CONDA_ENV_NAME python=3.7 -y
Activate the conda environment
conda activate $CONDA_ENV_NAME
Install the python package
pip install .
Install Jupyter Notebook
conda install ipykernel
Add the created conda environment to the ipython kernel
ipython kernel install --user --name=$CONDA_ENV_NAME
(Optional) Open the example notebook Start the Jupyter server from the root of this repository and navigate to the localhost shown.
jupyter notebook
Navigate to the example notebook located at examples/trio_variant_analysis_pipeline.ipynb
Below we duplicate some python code from the example notebook found examples/trio_variant_analysis_pipeline.ipynb
.
In particular, we download raw reads from a father/mother/child trio dataset obtained from the 1000 genomes project,
and use the Python helpers to obtain analysis ready variants that can be used by Hail.
from os.path import join
PHASE_3_1KG_ROOT_FTP = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data"
BROAD_INSTITUTE_B37_ROOT_FTP = "ftp://[email protected]/bundle/b37"
# Reads for father/mother/child
READS_BAM_URIS=[
# Father
join(PHASE_3_1KG_ROOT_FTP, "HG00731/alignment/HG00731.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
# Mother
join(PHASE_3_1KG_ROOT_FTP, "HG00732/alignment/HG00732.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
# Child
join(PHASE_3_1KG_ROOT_FTP, "HG00733/alignment/HG00733.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130415.bam"),
]
# Reference genome
REF_FA_GZ_URI = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"
# Known sites for variant extraction
KNOWN_EXTRACTION_VCF_GZ_URIS = [
join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]
# Known sites for VQSR model
KNOWN_SITES_VQSR_URIS = [
join(BROAD_INSTITUTE_B37_ROOT_FTP, "hapmap_3.3.b37.vcf.gz"),
join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_omni2.5.b37.vcf.gz"),
join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_phase1.snps.high_confidence.b37.vcf.gz"),
join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]
from src.gatk.extract_sample_variants import ExtractGenomicVariants
variant_discovery = ExtractGenomicVariants(
volume_hostpath='/data/Genomics',
paired_reads_file_bam_uris=READS_BAM_URIS,
ref_file_fasta_uri=REF_FA_GZ_URI,
known_sites_vcf_uris=KNOWN_EXTRACTION_VCF_GZ_URIS,
dry_run=False,
spark_driver_memory= '5g',
spark_executor_memory= '8g',
java_options= '-Xmx48g',
)
# Run the pipeline to obtain the GVCF path for each sample
sample_gvcf_paths = variant_discovery.run_pipeline()
print(sample_gvcf_paths)
# ['/data/sample_gvcfs/human/HG00731.g.vcf', '/data/sample_gvcfs/human/HG00732.g.vcf', '/data/sample_gvcfs/human/HG00733.g.vcf']
from src.gatk.curate_genotype_dataset import CurateGenotypeDataset
curated_genotyped_dataset = CurateGenotypeDataset(
volume_hostpath='/data/Genomics',
intervals=['20'],
gvcf_paths=sample_gvcf_paths,
annotations=[
"QD",
"MQ",
"MQRankSum",
"ReadPosRankSum",
"FS",
"SOR"
],
known_sites_vqsr=KNOWN_SITES_VQSR_URIS,
resources=[
('hapmap,known=false,training=true,truth=true,prior=15.0', 'hapmap_3.3.b37.vcf'),
('omni,known=false,training=true,truth=false,prior=12.0', '1000G_omni2.5.b37.vcf'),
('1000G,known=false,training=true,truth=false,prior=10.0', '1000G_phase1.snps.high_confidence.b37.vcf'),
('dbsnp,known=true,training=false,truth=false,prior=2.0', 'dbsnp_138.b37.vcf'),
],
reference_path=variant_discovery.raw_data_container_paths_map[ExtractGenomicVariants.REFERENCE_DIR],
dry_run=False
)
# Combine sample GVCFs and extract filtered variants for analysis
filtered_vcf_save_path = curated_genotyped_dataset.run_pipeline()
print(filtered_vcf_save_path)
# /data/curated_gvcfs/human/gvcf_sample_HG00731_HG00732_HG00733__n_gvcfs_3__intervals_20__annotations_QD_MQ_MQRankSum_ReadPosRankSum_FS_SOR.filtered_variants.g.vcf
from src.hail.utils import vcf_path_to_mt, hl
mt = vcf_path_to_mt(filtered_vcf_save_path)
mt.GT.show(n_rows=100)
'HG00731' | 'HG00732' | 'HG00733' | ||
locus | alleles | GT | GT | GT |
locus<GRCh37> | array<str> | call | call | call |
20:65900 | ["G","A"] | 1/1 | 1/1 | 1/1 |
20:66370 | ["G","A"] | 1/1 | 1/1 | 1/1 |
20:67500 | ["T","TTGGTATCTAG"] | 1/1 | 1/1 | 1/1 |
20:68749 | ["T","C"] | 0/1 | 0/1 | 1/1 |
20:69094 | ["G","A"] | 0/1 | 0/1 | 0/1 |
20:69481 | ["CT","C"] | 0/0 | 0/1 | 0/0 |
20:69506 | ["G","GACACAC","GACAC"] | 0/1 | 1/2 | 0/1 |
20:72104 | ["TA","T"] | 0/0 | 0/1 | 0/0 |
20:74347 | ["G","A"] | 0/1 | 0/0 | 0/0 |
20:74729 | ["G","GT"] | 0/0 | 0/1 | 0/0 |
showing top 10 rows
dataset = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == False)
print(f"Num NA rows dropped: {mt.count()[0] - dataset.count()[0]}")
mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == True).GT.show()
# Num NA rows dropped: 3
'HG00731' | 'HG00732' | 'HG00733' | ||
locus | alleles | GT | GT | GT |
locus<GRCh37> | array<str> | call | call | call |
20:37584126 | ["AAAG","A"] | NA | 0/0 | 1/1 |
20:44386269 | ["GTGTATATATATGCATATGTATGTATATATATC","G"] | 0/0 | NA | 1/1 |
20:50252339 | ["A","AGGAG"] | 0/0 | 1|1 | NA |
parents_hom_ref = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_hom_ref())[0:2]) == True)
parents_hom_ref.GT.show()
print(f"Percentage of total: {round(parents_hom_ref.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731' | 'HG00732' | 'HG00733' | ||
locus | alleles | GT | GT | GT |
locus<GRCh37> | array<str> | call | call | call |
20:114916 | ["G","A"] | 0/0 | 0/0 | 1/1 |
20:142262 | ["C","G"] | 0/0 | 0/0 | 1|1 |
20:142339 | ["T","C"] | 0/0 | 0/0 | 1/1 |
20:221118 | ["G","GAAA"] | 0/0 | 0/0 | 0/1 |
20:248667 | ["G","A"] | 0/0 | 0/0 | 0/1 |
20:254517 | ["C","CAAGAA"] | 0/0 | 0/0 | 0|1 |
20:259563 | ["C","A"] | 0/0 | 0/0 | 1/1 |
20:259818 | ["G","A"] | 0/0 | 0/0 | 0/1 |
20:261059 | ["C","T"] | 0/0 | 0/0 | 0/1 |
20:285148 | ["CA","C"] | 0/0 | 0/0 | 0/1 |
showing top 10 rows
Percentage of total: 1.29%
parents_het = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_het())[0:2]) == True)
parents_het.GT.show()
print(f"Percentage of total: {round(parents_het.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731' | 'HG00732' | 'HG00733' | ||
locus | alleles | GT | GT | GT |
locus<GRCh37> | array<str> | call | call | call |
20:68749 | ["T","C"] | 0/1 | 0/1 | 1/1 |
20:69094 | ["G","A"] | 0/1 | 0/1 | 0/1 |
20:69506 | ["G","GACACAC","GACAC"] | 0/1 | 1/2 | 0/1 |
20:76962 | ["T","C"] | 0/1 | 0/1 | 0/0 |
20:77965 | ["G","GT","GTT"] | 0/1 | 0/2 | 0/0 |
20:82012 | ["T","C"] | 0/1 | 0/1 | 0/0 |
20:87416 | ["A","C"] | 0/1 | 0/1 | 0/1 |
20:106703 | ["C","CT","CTTT"] | 0/1 | 1/2 | 1/1 |
20:126529 | ["G","A"] | 0/1 | 0/1 | 0/0 |
20:126600 | ["C","A"] | 0/1 | 0/1 | 0/0 |
showing top 10 rows
Percentage of total: 17.52%
child_hom_var = mt.filter_rows((hl.agg.collect(mt.GT.is_hom_var())[2]==True) == True)
child_hom_var.GT.show()
print(f"Percentage of total: {round(child_hom_var.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731' | 'HG00732' | 'HG00733' | ||
locus | alleles | GT | GT | GT |
locus<GRCh37> | array<str> | call | call | call |
20:65900 | ["G","A"] | 1/1 | 1/1 | 1/1 |
20:66370 | ["G","A"] | 1/1 | 1/1 | 1/1 |
20:67500 | ["T","TTGGTATCTAG"] | 1/1 | 1/1 | 1/1 |
20:68749 | ["T","C"] | 0/1 | 0/1 | 1/1 |
20:80655 | ["A","G"] | 1/1 | 1/1 | 1/1 |
20:106703 | ["C","CT","CTTT"] | 0/1 | 1/2 | 1/1 |
20:114916 | ["G","A"] | 0/0 | 0/0 | 1/1 |
20:127952 | ["C","T"] | 0/1 | 0/1 | 1/1 |
20:128863 | ["G","C"] | 0/1 | 0/1 | 1/1 |
20:129063 | ["A","G"] | 0/1 | 0/1 | 1/1 |
showing top 10 rows
Percentage of total: 27.36%