Skip to content

berilerdogdu/SPIT

Repository files navigation

spitting_llama

SPIT

A statistical tool that quantifies the heterogeneity in transcript usage within a population and identifies predominant subgroups along with their distinctive sets of DTU events.

Why use SPIT?

Detecting DTU events for single-gene genetic traits is relatively uncomplicated; however, the heterogeneity of populations with complex diseases presents an intricate challenge due to the presence of diverse causal events and undetermined subtypes. SPIT can detect DTU events exclusive to subgroups as well as DTU events shared amongst all case samples. Downstream of DTU analysis, SPIT uses detected DTU events to provide insight into potentially hierarchical subgrouping patterns present in complex disease populations using hierarchical clustering.

SPIT is equally effective on relatively homogeneous populations, and proves to be applicable for diverse scenarios, including simple genetic disorders, tissue-to-tissue comparisons and other types of DTU studies. SPIT consistently maintains notably low false discovery rates regardless of the level of dispersion in the datasets.

How to use SPIT?

SPIT is available as a PyPI package and can be installed by calling:

pip install spit

An extensive step-by-step guide that demonstrates the application of SPIT using a mock dataset is provided here.

Users can also directly upload their datasets into this Colab environment and easily run SPIT online.

Parameter-fitting with GNU Parallel

If you would like to optimize hyper-parameters based on the dispersion levels on your own data set, you can easily do so using the package module "fit_parameters". However, this is a computationally expensive process and might take some time. Multi-threading via GNU Parallel is an option. If you wish to run the parameter-fitting module with GNU Parallel, please clone this project and follow these steps:

  • Navigate into the "parameter_fitting" directory, and generate your 10 DTU simulations by running:
sh simulate_exps.sh -i [tx_counts_file] -m [tx2gene_file] -l [pheno_file]

Please note that your input files should follow the formatting requirements described in the Colab notebook. For detailed explanations of what these files should contain, please refer to the tutorial.

  • Run SPIT with combinations of b and k parameters to search for the optimal choice for your dataset:
sh run_SPIT_search_params.sh [#number of threads] -m [tx2gene_file]

The maximum number of threads that can be used is equal to the number of simulated experiments (10). For example, if you would like to run with 10 threads, run:

sh run_SPIT_search_params.sh 10 -m [tx2gene_file]
  • Run the leave-one-out cross-validation (LOOCV) step to see the optimal parameters in all 10 experiments as:
python LOOCV.py -m tx2gene_file.txt -P venns.pdf

The output PDF file (venns.pdf) will include the optimal parameters at each iteration of the LOOCV process along with corresponding true positive and false positive rates and F-scores.

If you use SPIT, please cite:

Erdogdu, B., Varabyou, A., Hicks, S.C., Salzberg, S.L. & Pertea, M. Detecting differential transcript usage in complex diseases with SPIT. bioRxiv, 2023.2007.2010.548289 (2023)

Please use this Google group to post your questions, comments, or bug reports.