VariantCallingFJD

The objective of the development of this pipeline was to automate the customized genomics analysis that we carry out in the Bioinformatics Unit for the Department of Genetics and Genomics of the Institituto de Investigación Sanitaria Fundación Jiménez Díaz (IIS-FJD). This pipeline is designed to be run in a Slurm Workload Manager system wich is an "open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters".

This pipeline has been developed by Translational Bioinformatics Lab at the IIS-FJD.

Developers

Main developers

Gonzalo Núñez Moreno
Raquel Romero Fernández
Lorena de la Fuente Lorente

Developers

Ionut-Florin Iancu
Pablo Mínguez Paniagua

Contact

Gonzalo Núñez Moreno ([email protected])

License

VariantCallingFJD source code is provided under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). VariantCallingFJD includes several third party packages provided under other open source licenses, please check them for additional details.

Dependencies

Job scheduler:

Slurm Workload Manager: This pipeline has been developped to be run in a Slurm Workload Manager system. We use the the function sbatch to submit the jobs to the queue. We also have installed the diferent programs in different modules to prevent incompatibilities.

Programming languages:

Python v2.7.15: All python scrips were developped using python v2.7.15. In task "Mosdepth.sh" python v3.6.12 is used by loading the module miniconda/3.6 to use mosdepth.
Perl v5.28.0
R v3.5.0

Bioinformatic tools:

bwa v0.7.17
samtools v1.9
picard v2.18.9
gatk v4.2.0
mosdepth 0.2.5
bedtools v2.27.0
bcftools v1.3
annotsv v2.2
PLINK v1.90b6.9 64-bit (4 Mar 2019)
BaseSpaceCLI v1.0.0
bscp v0.6.1.337

R libraries: (these versions are the ones tested)

dplyr v0.8.3
optparser v1.6.6
stringr v1.4.0
CODEX2 v1.3.0
panelcn.mops v1.4.0
cn.mops v1.28.0
ExomeDepth v1.1.15
GenomicRanges v1.34.0

Python libraries

csv
argparse
os
subprocess
sys
re
glob
numpy
pandas
datetime
shutil
ConfigParser
string
json
itertools
operator
collections
time

Instalation

Install all the dependencies.

1.1. We recommend installing all the above bioinformatics programs and programming languages in modules as follow:
- Python v2.7.15 --> python/2.7.15
- Perl v5.28.0 --> perl/5.28.0
- R v3.5.0 --> R/R
- bwa v0.7.17 --> bwa/0.7.17
- samtools v1.9 --> samtools/1.9
- picard v2.18.9 --> picard/2.18.9
- gatk v4.2.0 --> gatk/4.2.0
- mosdepth 0.2.5 --> miniconda/3.6
- bedtools v2.27.0 --> bedtools/2.27.0
- bcftools v1.3 --> bcftools/1.3
- annotsv v2.2 --> annotsv/2.2
- vep release 103 --> vep/release103
1.2. Install all Python and R packages

1.3 Download the following binaries and executables
- The BaseSpace Sequence Hub CLI tool suite binary (optional)
- BaseSpace-copy binary (optional)
- GATK executable jar file
- Picard executable jar file
- PLINK binary (optional)
Clone this repository using:

git clone https://github.com/TBLabFJD/VariantCallingFJD.git

Getting Started

(Optional) BaseSpace credentials setup. To be able to use the -b/--basespace option to automatically download samples from BaseSpace, run this command:

BaseSpace_Sequence_Hub_CLI_tool_suite_binary_path/bs authenticate

This command generates a URL to copy into a web browser and login into BaseSpace. After login, it creates into your directory a .basespace file with the credentials so that when you run the pipeline it autamatically acces the data.

Configuration file. There is a pipeline.conf file that needs to be filled with the apropiate information (some information is optional or required depending on the selected parameters):
- Slurm credentials
  - account
  - partition
- Files
  - Reference genome (in FASTA format)
  - Validated VCF with known sites of common variation. We use a VCF from The HapMap project. Available at: https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz
  - Validated VCF with known sites of common variation. We use a VCF from The 1000 Genomes Project. Available at: https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz
  - Index of the reference (.fasta.fai) genome to be used by CoNVading (CNV calling algorithm). This file should be a two column file with the name of the chromosomes (without the "chr") and their size (See CoNVaDING documentation for more information).
- Directories
  - Temporal directory used for merging FASTQ files.
  - (Optional) Dictory where to move a copy of the VCF file with the SNVs and INDELs for data base creation.
  - (Optional) Dictory where to move a copy of the BED file with the covered regions for data base creation.
- Binaries & Executables
  - The BaseSpace Sequence Hub CLI tool suite binary
  - BaseSpace-copy binary
  - GATK executable (.jar)
  - PICARD executable (.jar)
  - conda binary (same binary as the une installed for the miniconda)
  - PLINK binary
  - ANNOTSV executable (.tcl)
Run the pipeline as follow

module load python/2.7.15
module load perl/5.28.0

pipeline="path_to_the_downloaded_git_repository/variantDiscoveryFJD_panelWES.py"
input_path="input_directory_with_fastq_files"
output_path="output_directory"
bed_path="path_to_bed_file"

python ${pipeline} -i ${input_path} -o ${output_path} -p ${bed_path} -a all -A

Output

The output looks like:

bams - directoy containing the mapped samples
cnvs - directory containing CNV calling and annotation results
- CODEX2 - directory containing CODEX2 CNV calling results
- CoNVaDING - directory containing CoNVaDING CNV calling results
- ExomeDepth - directory containing ExomeDepth CNV calling results
- Panelcn.MOPS - directory containing Panelcn.MOPS CNV calling results
- run_name.combinedAnnotated.tsv - file containing the annotated CNVs
- run_name.combined.txt - file containing the combined results of the 4 CNV calling algorithms (in TSV format)
- run_name.combined.vcf - file containing the combined results of the 4 CNV calling algorithms (in VCF format)
- run_name.extended.bed - modified bed file use by the CNV calling algorithms
- run_name.final.txt - file containing the final report with the annotated CNVs
- run_name.final.genelist.txt - file containing the final report with the annotated CNVs filtered by the provided gene list
genotyping - directory containing temporal information about the variant calling
logfiles - directory containing the .err and .out logfiles of all jobs
plink - directory containing PLINKS's output with information about homozygosity
qc - directory containing quality (coverage) information of the for SNV and CNV calling
snvs - directory containing SNV and INDEL calling and annotation results
- sample_name.annotated.MAFfiltered.pvm.txt - file containing the final report with the PASS annotated SNVs and INDELs filtered by minor allele frequency (in TSV format)
- sample_name.annotated.MAFfiltered.txt - file containing the PASS annotated SNVs and INDELs filtered by minor allele frequency (in TSV format)
- sample_name.annotated.MAFfiltered.vcf - vcf containing the PASS annotated SNVs and INDELs filtered by minor allele frequency (in VCF format)
- sample_name.annotated.vcf - vcf containing the PASS annotated SNVs and INDELs (in VCF format)
- sample_name.final.vcf - file containing PASS SNVs and INDELs (in VCF format)
- sample_name.gatkLabeled.vcf - vcf containing the SNV and INDEL (in VCF format)
- sample_name.gatkLabeled.vcf.idx - index of the sample_name.annotated.gatkLabeled.vcf file
software_run_name.txt - file containing the software use in each section of the analysis
sophia_clinical_exome_ces_annotated_run_name_10bp.bed - filtered bedfile used in the pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
batchx		batchx
pipelines		pipelines
tasks		tasks
README.md		README.md
README_issues		README_issues
Workflow.png		Workflow.png
pipeline.config		pipeline.config
variantDiscoveryFJD_WGS.py		variantDiscoveryFJD_WGS.py
variantDiscoveryFJD_panelWES.py		variantDiscoveryFJD_panelWES.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VariantCallingFJD

Developers

Main developers

Developers

Contact

License

Dependencies

Instalation

Getting Started

Output

About

Releases

Packages

Languages

jmche/VariantCallingFJD

Folders and files

Latest commit

History

Repository files navigation

VariantCallingFJD

Developers

Main developers

Developers

Contact

License

Dependencies

Instalation

Getting Started

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages