Skip to content

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation

Notifications You must be signed in to change notification settings

shenkers/isoscm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 

Repository files navigation

#Support

If you encounter any difficulties running IsoSCM, please create an issue on our github page. To facilitate resolution, issues should adhere to the spirit of the minimal complete verifiable example and include information about what version of IsoSCM you are using, program parameters, complete trace of log/error messages.

#Overview

Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3' UTRs. Existing assembly strategies often fragment long 3' UTRs ( illustrated below, left), and importantly, no de-novo (reference annotation independent) transcript assembly of the algorithms in popular use can identify tandem 3' UTR isoforms generated by alternative cleavage and polyadenylation (APA) using only RNA-seq data (below, right). Consequently, it is often not possible to identify patterns of differential APA using existing transcript assembly tools.

motivation image

To address these limitations, we have developed Isoform Structural Change Model (IsoSCM), a new method for transcript assembly that incorporates change-point analysis to improve the 3' UTR annotation process. If we assume that sequenced reads are distributed proportional to isoform abundance, the boundaries of transcription will be marked by a change in the level of coverage. In instances where short and long isoforms are co-expressed there can be a “step-like” pattern of coverage at the boundary of the nested exon model. To identify terminal exon boundaries, we thus seek change-points that mark transitions in RNA-seq coverage, as illustrated below.

changepoint image

Real RNA-seq data contains biases that can cause the sampling of reads across the transcript body to deviate from a uniform distribution. These biases typically cause short segments of the transcript to be sequenced at a lower frequency than the rest of the transcript, resulting in a local drop in the level of coverage. The conventional (unconstrained) change-point detection procedure identifies these local drops in coverage as distinct segments. However, in the context of annotating transcripts we wish to disregard these local aberrations as they do not correspond to terminal exon boundaries. In order to minimize the effect of local biases on transcript model inference we introduce additional constraints to distinguish localized changes from change-points that mark a sustained decrease in the level of coverage. The difference between constrained and unconstrained formulations is contrasted using a toy example below.

constrained changepoint image

We estimate relative usage of a polyadenylation site by considering the read density in the segments up and downstream of a change-point, such that

usage = 1 - ( covdn / covup )

where covdn/up are estimates of read density downstream/upstream of a change point. In addition to identifying change-points that mark the boundaries of 3'UTRs, IsoSCM can also be used to estimate the relative usage of 3'UTR isoforms, and identify sites with differential usage between two conditions. Conveniently, the change-point detection framework used by IsoSCM can also be used to identify differential usage of tandem 3' UTRs between two samples by performing a joint segmentation, as illustrated below.

differential image

The resulting joint segmentation can be used to quantitate differential usage of tandem 3'UTR isoforms. For the example shown above we can calculate covup and covdn, the level of coverage in the segments up and downstream of the identified change-point, and use these quantities to define differential site usage in each tissue:

usagebrain = 1 - ( covdownstream,brain / covupstream,brain ) ≅ 0.68

usageliver = 1 - ( covdownstream,liver / covupstream,liver ) ≅ 0.91

Δusagebrain-liver = usagebrain-usageliver ≅ -0.23

As for the example above, polyadenylation sites that are differentially used between a pair of conditions will be marked by a change-point that has a non-zero Δusage.

The functions of IsoSCM described above are performed by the assemble, enumerate, compare commands, detailed below.

#Running IsoSCM

The latest version of the IsoSCM executable jar can be downloaded from here.

##assemble command

java -Xmx2048m -jar IsoSCM.jar assemble [optional arguments]

Taking a BAM file as input, the assemble command will construct a splice-graph of exons, and segment terminal exons using the constrained change-point detection procedure. The minimal arguments to run the assemble command require the BAM file, a base-name for generated output files, and the strandedness of the RNA-seq data. IsoSCM is compatible BAM files generated by many short read mappers, and only requires that spliced reads include the XS flag specifying the inferred strandedness of the spliced read. IsoSCM is a Java application, and requires a JRE version >=1.6 to be installed. We recommend running IsoSCM with at least 2GB of memory via the java -Xmx argument, although should be increased as necessary if OutOfMemoryError exceptions are encountered.

Required arguments:

          -bam                    BAM file to be assembled
          
          -base                   The output basename. This will be used as the
                                  prefix for output files.
                                  
          -s                      the strandedness, can be "reverse_forward" or
                                  "unstranded"

IsoSCM has a number of optional arguments with default values pertaining to either splice-graph assembly or change-point detection.

Assembly arguments:

          -coverage               whether or not to calculate coverage of
                                  generated models
                                  Default: true
                                  
          -dir                    The output directory
                                  Default: ./isoscm
                                  
          -insert_size_quantile   If the data is paired, and this value is specified
                                  IsoSCM will attempt to scaffold gaps between segments
                                  spanned by mates separated by at most this quantile
                                  
          -internal               whether or not to identify change-points within 
                                  internal (non-terminal) exons
                                  Default: false
                                  
          -jnct_alpha             The significance level for binomial test to
                                  accept splice junction
                                  Default: 0.05
                                  
          -merge_radius           gaps smaller than this distance (nt) will be merged
                                  Default: 100
                                  
          -merge_segments         Optional: bed file of regions to merge across

Segmentation arguments:

          -min_fold               the minimum fold change between neighboring
                                  segments expressed as a ratio of coverage downstream divided by
                                  coverage upstream. acceptable values in the range [0.0-1.0]
                                  Default: 0.5
                                  
          -min_terminal           terminal segments are extended by
                                  this amount before segmentation
                                  Default: 300
                                  
          -nb_r                   r parameter of the NB(p,r) prior on RNA-seq coverage within a segment.
                                  The p parameter is not specified as it is integrated over.
                                  Default: 10
                                  
          -segment_p              p parameter of NB(p,r) prior on segment length (between change-points)
                                  Default: 0.95
                                  
          -segment_r              r parameter of NB(p,r) prior on segment length (between change-points)
                                  Default: 10
                                  
          -t                      the threshold above which a segment is
                                  considered expressed
                                  Default: 1
                                  
          -w                      the width of window to be used by the SCM
                                  Default: 20

If the BAM file to process is called brain.bam and the sample was sequenced with the stranded dUTP protocol, the minimal call to IsoSCM assemble would look like

java -Xmx2048m -jar IsoSCM.jar assemble -bam brain.bam -base brain -s reverse_forward

##enumerate command

java -Xmx2048m -jar IsoSCM.jar enumerate [arguments]

Taking a splice-graph generated by the assemble command as input, the enumerate command will list transcript isoforms that are compatible with the splice graph. Since the splice-graph may imply a large number of isoforms, there is a parameter to specify the maximum number of isoforms that will be reported for a single transcript. Loci with more than this number of isoforms are reported as "skipped". This function is not required for assessing differential 3'UTR isoform usage, and is provided as a means to visualize assembled models.

Required arguments:

         -max_isoforms            loci with more than this number of isoforms will be
                                  skipped
         -x                       configuration XML file from the assembly step

##compare command

java -Xmx2048m -jar IsoSCM.jar compare [arguments]

Taking the splice-graphs generated by the assemble for two samples as input, the compare command performs joint segmentation of the terminal exons of coexpressed transcripts, and reports the differential usage of each identified change-point.

Required arguments:

          -base                   The output basename. This will be used as the
                                  prefix for output files.
          -x1                     XML configuration file for the assembly step from sample 1
          -x2                     XML configuration file for the assembly step from sample 2

Optional arguments:

          -dir                    The output directory
                                  Default: ./compare

#IsoSCM Output

##Assemble command

Depending on the command line arguments used, the output of the assemble command will be found in

[assemble-dir]/[assemble-basename].gtf

These two files are GTF format files representing assembled exons and splice junctions between them. The first file ([dir]/[basename].gtf) will contain all assembled loci that are inferred to contain at least one splice junction. Each line of this file represents either an exon or a splice-juction, as specified in the 3rd column. Each feature in assembly GTF file will also have attributes for a "locus_id". All exons that either overlap, or are connected by splice junctions are assigned a common locus_id. For convenience, exons are assigned an additional attribute "type" which represents the relative order of that exon in the transcript model, which can be one of "3p_exon", "5p_exon", or "internal_exon". It's important to keep in mind that the transcript model may be fragmented, especially in the case of low-coverage genes. If the assembled transcript model is fragmented these assignments of 3p_exon or 5p_exon will be incorrect. Conversely, if two genes are very proximally located they may instead be innapropriately merged by the assembly process.

Assembled transcripts that do not overlap any spliced reads are reported as transcribed segments in a second file:

[assemble-dir]/[assemble-basename].unspliced.gtf

The format of this file is the same as the GTF representing spliced transcript models. An important thing to keep in mind when analyzing reads generated using unstranded sequencing protocols is that the strandedness of assembled transcripts is inferred from reads containing splice junctions. To represent this ambiguity, the transcribed segments reported in this file will have with a strand of '.' when assembling unstranded reads.

Finally, if the assemble command is run with the -coverage true flag additional files:

[assemble-dir]/[assemble-basename].coverage.gtf [assemble-dir]/[assemble-basename].unspliced.coverage.gtf

will be generated. These files will be the same as the two other GTF files, except that each exon will have an associated "coverage" value, representing a rough estimate of read density over that exon. The coverage represents the median read coverage over each segment of the exon. This quantity is not intended to be used a rigorous metric isoform abundance, but rather as an aid to quickly identify exon models supported by less evidence.

##Compare command

The output of the compare command is reported in tabular format to the file:

[compare-dir]/[compare-basename].txt

Each row in this file corresponds to a change-point, with columns:

samples
the [basename]'s of the samples being compared.
locus_id
locus id identifying the transcript the change-point occurs in from the splice-graph.
changepoint
genomic coordinate of the change-point
confidence
a confidence score representing the likelihood that a change-point occurs at the given position, compared to a null model that no change-point at this position
log_odds
log odds representation of the confidence value
upstream_segment
coordinates of the segment upstream of the change-point
downstream_segment
coordinates of the segment downstream of the change-point
locus
the coordinates of the exon that was segmented to identify this change-point
strand
strand on which this change-point is inferred to have occured
upstream_cov
mean read density in the segment upstream of the change-point (reported in reads/base). Values are listed in the same order as sample names in the first column.
downstream_cov
mean read density in the segment downstream of the change-point (reported in reads/base). Values are listed in the same order as sample names in the first column.
site_usage
estimated site usage in each sample (see Overview). Values are listed in the same order as sample names in the first column.
differential_usage
the difference in estimated site usage (see Overview). This will correspond to site_usage in the first sample minus the site_usage in the second sample.

##Building from source

IsoSCM uses maven as a build system. To build IsoSCM change into the root directory and type:

mvn clean compile assembly:single

The executable .jar can be found in the sub-directory target/IsoSCM-[version].jar

Analytics

About

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages