CALDERA (Calling disease-related genes)

General overview

CALDERA is a tool for identifying causal genes in GWAS loci. Many alternative tools (e.g., L2G, Ei, FLAMES) use complex XGBoost models with >45 predictive features. CALDERA achieves similar or better performance despite using a simple LASSO regression model with only 12 features. This makes it much easier to understand why CALDERA chooses the genes that it does. Unlike other methods, CALDERA attempts for potential biases in its ground truth training data. Full details can be found in the preprint.

This repo contains everything needed to run CALDERA given: 1) a PoPS output file, 2) a MAGMA output file, and 3) a file containing credible set information.

How to run CALDERA

First, clone this repo

git clone https://github.com/bulik/ldsc.git

Second, open an R session and source the z_caldera.R script

caldera_path <- "your_local_path_to_the_caldera_repo"
source( file.path( caldera_path, "z_caldera.R" ) )

Finally, run CALDERA by providing the three required input files (described in more detail in the next section)

pops_file  <- "path_to_your_PoPS_output_file"
magma_file <- "path_to_your_MAGMA_output_file"
cs_file    <- "path_to_your_credible_set_file"
caldera( pops_file    = pops_file,
         magma_file   = magma_file,
         cs_file      = cs_file,
         caldera_path = caldera_path )

CALDERA input files

File	Description
pops_file	PoPS output file (pops.preds). Must contain columns: ENSGID and PoPS_Score.
magma_file	MAGMA output file (.genes.out). Must contain columns: ZSTAT and GENE (containing ENSGID).
cs_file	A (comma/tab/etc. delimited) file containing all credible set (CS) variants across all loci. Used to define loci, the genes therein, and coding PIPs for these genes. One CS variant per row. Must contain columns: "chr" (chromosome), "bp" (GRCh37 position), and "pip" (posterior inclusion probability). Must also contain a "locus" column identifying which credible set each variant belongs to (can be an integer or a string). Optionally, you can include a "c_gene" column containing the ENSGID of the gene whose coding sequence is affected by the given variant (and NA if the variant does not affect any coding sequences). Alternatively, a "SNP" column containing rsIDs can be provided and variants will automatically be annotated using a list of ~76k UK Biobank coding variants. If "SNP" is not provided, variants will be automatically annotated based on "chr" and "bp".

CALDERA output

The caldera() R function returns a data.table containing the following columns:

Column	Description
locus	The locus identifier, taken from the input credible set file
locus_pos	A string containing the chromosome, left boundary, and right boundary of the locus (each separated by underscores)
gene	Gene name
caldera	CALDERA-predicted probability that this gene is causal for the GWAS trait
n_genes	Number of protein-coding genes in the locus
dist	Distance between the lead GWAS variant and any part of the gene (bp)
dist_gene_rel	Relative distance (see footnote for an explanation of "relative")
pops_glo	PoPS score for the gene
pops_rel	Relative PoPS score for the gene
magma_glo	MAGMA z-score for the gene
magma_rel	Relative MAGMA z-score for the gene
coding_glo	Sum of posterior inclusion probabilities of coding variants in this credible set affecting this gene
coding_rel	Relative coding PIP
ensgid	ENSEMBL gene ID

Relative scores are computed as the focal gene's score minus the best score in the locus. Note that distances were transformed prior to subtraction: -log10(x+1000).

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
trained_models		trained_models
LICENSE		LICENSE
README.md		README.md
a_merge_v2g_files.R		a_merge_v2g_files.R
b_create_causal_noncausal_trait_gene_pairs.R		b_create_causal_noncausal_trait_gene_pairs.R
c_combine_causal_data_and_v2g_data.R		c_combine_causal_data_and_v2g_data.R
d_predictive_model.R		d_predictive_model.R
e_distance_between_gwas_hits_and_genes_from_mostafavi2023.R		e_distance_between_gwas_hits_and_genes_from_mostafavi2023.R
z_caldera.R		z_caldera.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CALDERA (Calling disease-related genes)

General overview

How to run CALDERA

CALDERA input files

CALDERA output

About

Releases

Packages

Contributors 2

Languages

License

kheilbron/caldera

Folders and files

Latest commit

History

Repository files navigation

CALDERA (Calling disease-related genes)

General overview

How to run CALDERA

CALDERA input files

CALDERA output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages