Reads to Analysis Ready Variants

Overview

This repository provides both an example and a scripted procedure for taking raw genomics reads to analysis ready variants following GATK's best practices, using Python.

In particular, we follow the GATK's steps for germline cohort data as outlined in Germline-short-variant-discovery-SNPs-Indels.

This pipeline uses the current latest versions of GATK and hail:

GATK: 4.2.6.1
Hail: 0.2.95

Methodology

Obtaining data

All raw data was obtained from 2 resources:

1000 Genomes Project and corresponding FTP data server
Broad Institute and corresponding FTP data server (ftp://[email protected]/bundle/b37, only accessible via FTP client)

including unmapped read files for each sample (BAM), reference genome (Fasta) and known sites (VCF) data.

Executing GATK steps in docker

In order to keep our experiment reproducible, we utilize the GATK docker image available from dockerhub at broadinstitute/gatk:4.2.6.1. We use the latest version, which is currently 4.2.6.1

Scalability

We emphasize the use of tools which promote scalability, including the use of GATK's BETA Spark features (e.g. ReadsPipelineSpark) and Hail for scalable downstream analytics. Both of these tools leverage Spark to enable large scale data processing, which allows for scaling from a local workstation to a large HPC or cloud cluster.

Example Pipeline as Markdown

To see an example of the pipeline as markdown with bash commands executed from within the GATK docker image, please see the pipeline_mds directory.

Helper classes for automating the pipeline with Python

The main contribution of this repository is to script the pipeline using Python, running each of the GATK steps in docker and automatically handling the stages of the workflow. For example, the following python code snippet will perform the following:

Download all data
1. Sample reads (i.e., a cohort of reads)
2. Known sites
3. Reference
Perform data preprocessing
1. Filtering, sorting and indexing cohort reads
2. Creating indices for known sites
3. Creating an index for the reference
Perform ReadsPipelineSpark for each sample, obtaining analysis-ready variants for each sample of the cohort
Combine cohort samples into a GenomicsDB
Perform joint genotyping on the cohort with HaplotypeCaller
Fit/applies a variant recalibration model to obtain variant quality scores
Applies VQSR to filter low quality variants

Python package installation

Prerequisites

Docker (tested 20.10.14)
~32 GB RAM
~20GB SSD space
Linux recommended (Ubuntu 20.04 tested)

Setup environment

Create the conda environment

CONDA_ENV_NAME=reads_to_variants
conda create -n $CONDA_ENV_NAME python=3.7 -y

Activate the conda environment

conda activate $CONDA_ENV_NAME

Install the python package

pip install .

Install Jupyter Notebook

conda install ipykernel

Add the created conda environment to the ipython kernel

ipython kernel install --user --name=$CONDA_ENV_NAME

(Optional) Open the example notebook Start the Jupyter server from the root of this repository and navigate to the localhost shown.

jupyter notebook

Navigate to the example notebook located at examples/trio_variant_analysis_pipeline.ipynb

Python Getting Started

Below we duplicate some python code from the example notebook found examples/trio_variant_analysis_pipeline.ipynb. In particular, we download raw reads from a father/mother/child trio dataset obtained from the 1000 genomes project, and use the Python helpers to obtain analysis ready variants that can be used by Hail.

from os.path import join

PHASE_3_1KG_ROOT_FTP = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data"
BROAD_INSTITUTE_B37_ROOT_FTP = "ftp://[email protected]/bundle/b37"

# Reads for father/mother/child
READS_BAM_URIS=[
    # Father
    join(PHASE_3_1KG_ROOT_FTP, "HG00731/alignment/HG00731.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
    # Mother
    join(PHASE_3_1KG_ROOT_FTP, "HG00732/alignment/HG00732.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
    # Child
    join(PHASE_3_1KG_ROOT_FTP, "HG00733/alignment/HG00733.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130415.bam"),
]

# Reference genome
REF_FA_GZ_URI = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"

# Known sites for variant extraction
KNOWN_EXTRACTION_VCF_GZ_URIS = [
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]

# Known sites for VQSR model
KNOWN_SITES_VQSR_URIS = [
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "hapmap_3.3.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_omni2.5.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_phase1.snps.high_confidence.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]

Perform variant discovery to obtain analysis-ready variants

from src.gatk.extract_sample_variants import ExtractGenomicVariants

variant_discovery = ExtractGenomicVariants(
    volume_hostpath='/data/Genomics',
    paired_reads_file_bam_uris=READS_BAM_URIS,
    ref_file_fasta_uri=REF_FA_GZ_URI,
    known_sites_vcf_uris=KNOWN_EXTRACTION_VCF_GZ_URIS,
    dry_run=False,
    spark_driver_memory= '5g',
    spark_executor_memory= '8g',
    java_options= '-Xmx48g',
)

# Run the pipeline to obtain the GVCF path for each sample
sample_gvcf_paths = variant_discovery.run_pipeline()
print(sample_gvcf_paths)
# ['/data/sample_gvcfs/human/HG00731.g.vcf', '/data/sample_gvcfs/human/HG00732.g.vcf', '/data/sample_gvcfs/human/HG00733.g.vcf']

Curate a genotype dataset and filter low quality variants

from src.gatk.curate_genotype_dataset import CurateGenotypeDataset

curated_genotyped_dataset = CurateGenotypeDataset(
    volume_hostpath='/data/Genomics',
    intervals=['20'],
    gvcf_paths=sample_gvcf_paths,
    annotations=[
        "QD",
        "MQ",
        "MQRankSum",
        "ReadPosRankSum",
        "FS",
        "SOR"
    ],
    known_sites_vqsr=KNOWN_SITES_VQSR_URIS,
    resources=[
        ('hapmap,known=false,training=true,truth=true,prior=15.0', 'hapmap_3.3.b37.vcf'),
        ('omni,known=false,training=true,truth=false,prior=12.0', '1000G_omni2.5.b37.vcf'),
        ('1000G,known=false,training=true,truth=false,prior=10.0', '1000G_phase1.snps.high_confidence.b37.vcf'),
        ('dbsnp,known=true,training=false,truth=false,prior=2.0', 'dbsnp_138.b37.vcf'),
    ],
    reference_path=variant_discovery.raw_data_container_paths_map[ExtractGenomicVariants.REFERENCE_DIR],
    dry_run=False
)

# Combine sample GVCFs and extract filtered variants for analysis
filtered_vcf_save_path = curated_genotyped_dataset.run_pipeline()

print(filtered_vcf_save_path)
# /data/curated_gvcfs/human/gvcf_sample_HG00731_HG00732_HG00733__n_gvcfs_3__intervals_20__annotations_QD_MQ_MQRankSum_ReadPosRankSum_FS_SOR.filtered_variants.g.vcf

Load the cohort VCF file into Hail as a MatrixTable

from src.hail.utils import vcf_path_to_mt, hl

mt = vcf_path_to_mt(filtered_vcf_save_path)

Perform analysis

mt.GT.show(n_rows=100)


		'HG00731'	'HG00732'	'HG00733'
locus	alleles	GT	GT	GT
locus<GRCh37>	array<str>	call	call	call
20:65900	["G","A"]	1/1	1/1	1/1
20:66370	["G","A"]	1/1	1/1	1/1
20:67500	["T","TTGGTATCTAG"]	1/1	1/1	1/1
20:68749	["T","C"]	0/1	0/1	1/1
20:69094	["G","A"]	0/1	0/1	0/1
20:69481	["CT","C"]	0/0	0/1	0/0
20:69506	["G","GACACAC","GACAC"]	0/1	1/2	0/1
20:72104	["TA","T"]	0/0	0/1	0/0
20:74347	["G","A"]	0/1	0/0	0/0
20:74729	["G","GT"]	0/0	0/1	0/0

showing top 10 rows

Filter missing rows

dataset = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == False)
print(f"Num NA rows dropped: {mt.count()[0] - dataset.count()[0]}")
mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == True).GT.show()
# Num NA rows dropped: 3


		'HG00731'	'HG00732'	'HG00733'
locus	alleles	GT	GT	GT
locus<GRCh37>	array<str>	call	call	call
20:37584126	["AAAG","A"]	NA	0/0	1/1
20:44386269	["GTGTATATATATGCATATGTATGTATATATATC","G"]	0/0	NA	1/1
20:50252339	["A","AGGAG"]	0/0	1\|1	NA

Find known_sites where parents are both homozygous reference

parents_hom_ref = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_hom_ref())[0:2]) == True)

parents_hom_ref.GT.show()
print(f"Percentage of total: {round(parents_hom_ref.count()[0]/mt.count()[0] * 100, 2)}%")


		'HG00731'	'HG00732'	'HG00733'
locus	alleles	GT	GT	GT
locus<GRCh37>	array<str>	call	call	call
20:114916	["G","A"]	0/0	0/0	1/1
20:142262	["C","G"]	0/0	0/0	1\|1
20:142339	["T","C"]	0/0	0/0	1/1
20:221118	["G","GAAA"]	0/0	0/0	0/1
20:248667	["G","A"]	0/0	0/0	0/1
20:254517	["C","CAAGAA"]	0/0	0/0	0\|1
20:259563	["C","A"]	0/0	0/0	1/1
20:259818	["G","A"]	0/0	0/0	0/1
20:261059	["C","T"]	0/0	0/0	0/1
20:285148	["CA","C"]	0/0	0/0	0/1

showing top 10 rows

Percentage of total: 1.29%

Find known_sites where parents are both heterozygous

parents_het = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_het())[0:2]) == True)

parents_het.GT.show()
print(f"Percentage of total: {round(parents_het.count()[0]/mt.count()[0] * 100, 2)}%")


		'HG00731'	'HG00732'	'HG00733'
locus	alleles	GT	GT	GT
locus<GRCh37>	array<str>	call	call	call
20:68749	["T","C"]	0/1	0/1	1/1
20:69094	["G","A"]	0/1	0/1	0/1
20:69506	["G","GACACAC","GACAC"]	0/1	1/2	0/1
20:76962	["T","C"]	0/1	0/1	0/0
20:77965	["G","GT","GTT"]	0/1	0/2	0/0
20:82012	["T","C"]	0/1	0/1	0/0
20:87416	["A","C"]	0/1	0/1	0/1
20:106703	["C","CT","CTTT"]	0/1	1/2	1/1
20:126529	["G","A"]	0/1	0/1	0/0
20:126600	["C","A"]	0/1	0/1	0/0

showing top 10 rows

Percentage of total: 17.52%

Sites where child is homozygous variant

child_hom_var = mt.filter_rows((hl.agg.collect(mt.GT.is_hom_var())[2]==True) == True)

child_hom_var.GT.show()
print(f"Percentage of total: {round(child_hom_var.count()[0]/mt.count()[0] * 100, 2)}%")


		'HG00731'	'HG00732'	'HG00733'
locus	alleles	GT	GT	GT
locus<GRCh37>	array<str>	call	call	call
20:65900	["G","A"]	1/1	1/1	1/1
20:66370	["G","A"]	1/1	1/1	1/1
20:67500	["T","TTGGTATCTAG"]	1/1	1/1	1/1
20:68749	["T","C"]	0/1	0/1	1/1
20:80655	["A","G"]	1/1	1/1	1/1
20:106703	["C","CT","CTTT"]	0/1	1/2	1/1
20:114916	["G","A"]	0/0	0/0	1/1
20:127952	["C","T"]	0/1	0/1	1/1
20:128863	["G","C"]	0/1	0/1	1/1
20:129063	["A","G"]	0/1	0/1	1/1

showing top 10 rows

Percentage of total: 27.36%

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
pipeline_mds		pipeline_mds
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reads to Analysis Ready Variants

Overview

Methodology

Obtaining data

Executing GATK steps in docker

Scalability

Example Pipeline as Markdown

Helper classes for automating the pipeline with Python

Python package installation

Prerequisites

Setup environment

Python Getting Started

Perform variant discovery to obtain analysis-ready variants

Curate a genotype dataset and filter low quality variants

Load the cohort VCF file into Hail as a MatrixTable

Perform analysis

Filter missing rows

Find known_sites where parents are both homozygous reference

Find known_sites where parents are both heterozygous

Sites where child is homozygous variant

About

Releases

Packages

Languages

lajd/reads2variants

Folders and files

Latest commit

History

Repository files navigation

Reads to Analysis Ready Variants

Overview

Methodology

Obtaining data

Executing GATK steps in docker

Scalability

Example Pipeline as Markdown

Helper classes for automating the pipeline with Python

Python package installation

Prerequisites

Setup environment

Python Getting Started

Perform variant discovery to obtain analysis-ready variants

Curate a genotype dataset and filter low quality variants

Load the cohort VCF file into Hail as a MatrixTable

Perform analysis

Filter missing rows

Find known_sites where parents are both homozygous reference

Find known_sites where parents are both heterozygous

Sites where child is homozygous variant

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages