MotifFinder: Vignette (Tutorial)

Daniel Wells 2019-03-01

Simulate Data

We create 300 DNA sequences of length 200 with the motif “ATgTT_GtCC” around the center of 50% of the sequences.

library(MotifFinder)
set.seed(42)
simulated_sequences <- simulate_sequences(motif="ATgTT_GtCC")

## [1] "motif start position is 96"

str(simulated_sequences)

## List of 6
##  $ seqs          : Named chr [1:300] "CCTCTTTTGTATGGCCAAGGGGCGTAAAACTTAATCTCCGCTTTGCAGACTAGTGCGATTCGTGCGGCCGCTGGACCCAAGCATCCAGGCTCCCGGATTCCCTTACAAAAA"| __truncated__ "TCGGACTCCGTTGTGCAAACGTTTGCTGATAAGCCCGACCAAGCACTTCAGGAGTTAGTAAGACACATTGGCGAGCATAAGGCAGTTAGTAATAGGGACGAACATAAACGT"| __truncated__ "AGGTCACTATAGAGAATCTAATCCGATTGGGGTACATAATGCTCTTACGATTGGTTTGCCTACAAGTATCTACCTTGCACAGCACTCAAACTCTTACTAAAAGCACGACTA"| __truncated__ "TGTAACTCAGTACATCTACCGGGGTCTATACAAGCCTTCGGGTCATTATTAGGTAGGCCGGCCAATGGGACAAAGAATCTCTAACGCAGGCAATCGCTCGAGGCCTGGTCC"| __truncated__ ...
##   ..- attr(*, "names")= chr [1:300] "1" "2" "3" "4" ...
##  $ whichreg      : int [1:150] 226 94 252 273 147 117 210 220 21 251 ...
##  $ whichpos_s1   : int [1:150] 92 97 90 105 98 95 105 86 106 92 ...
##  $ whichpos      : num [1:150] 92 95 90 87 98 97 105 86 86 100 ...
##  $ whichrevstrand: int [1:150] 21 135 77 17 291 253 280 20 109 284 ...
##  $ truemotif     : chr "ATgTT_GtCC"

Run MotifFinder

We run MotifFinder with a length slightly shorter than the known motif length.

motif_found <- findamotif(simulated_sequences$seqs, len=7, stranded_prior = T, seed = 42)

Plot the Motif(s) Found

We can see that we have recovered the motif.

library(ggseqlogo)
ggseqlogo(get_PWM(motif_found))

ggseqlogo(get_PWM(motif_found, complement=TRUE))

Where is the motif

We can also check the location of the motifs found as well as the location it thinks the motifs are.

plot_motif_location(motif_found)

plot(motif_found$prior)

Check Convergence

It’s a good idea to plot the inferred probability the motif being in each sequence (alpha parameter) for each iteration to check convergence.

plot(motif_found$alphas, ylim=c(0,1))

Other helper functions include downloading PWMs from the Jaspar or Hocomoco databases.

Alx1 <- download_PWM("ALX1_MOUSE.H11MO.0.B")
str(Alx1)

## List of 2
##  $ pwm : num [1:12, 1:4] 0.3795 0.1211 0.6119 0.6785 0.0768 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "A" "C" "G" "T"
##  $ name: chr "ALX1_MOUSE.H11MO.0.B"

Extracting DNA from the genome

To use real DNA instead of simulating sequences you will need to download the genome and extract regions that are specified using a BED file.

wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
gunzip motifs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
samtools faidx motifs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

bedtools getfasta -s -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed example_locations.bed -name > example_locations.fasta

We can then load these sequences using the helper function load_sequences

genomic_sequences <- load_sequences("example_locations.fasta")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vignette.md

vignette.md

MotifFinder: Vignette (Tutorial)

Simulate Data

Run MotifFinder

Plot the Motif(s) Found

Where is the motif

Check Convergence

Extracting DNA from the genome

Files

vignette.md

Latest commit

History

vignette.md

File metadata and controls

MotifFinder: Vignette (Tutorial)

Simulate Data

Run MotifFinder

Plot the Motif(s) Found

Where is the motif

Check Convergence

Extracting DNA from the genome