Home

Methods

  • Stegoscripts and statistical model
  • Wordspy algorithm
  • Model optimization
  • Over-represented motif discovery
  • Word clustering

Insights

Results

 

 

 

 

 

 

A steganalysis-based approach for genome-wide identification of regulatory DNA sequence elements


Guandong Wang and Weixiong Zhang

 

Genome-wide identification of cis-acting elements, or transcription factor binding motifs (TFBMs), is a challenge problem. We approach the problem by viewing the regulatory regions of a genome as a stegoscript with over-represented words, i.e., TFBMs, being embedded in a covertext. We model the stegoscript with a statistical model consisting of a dictionary and a grammar, and progressively learn a series of models, resulting in an efficient genome-wide motif finding algorithm called WordSpy. From the promoters of 645 distinct cell-cycle related genes of S. cerevisiae, WordSpy is able to identify all known cell-cycle related TFBMs with high rankings based on two evaluation methods, a genome-wide Monte Carlo simulation and a gene expression coherence measure. We further apply the method to de novo detect putative cell-cycle related TFBMs of A. thaliana. Several top ranking motifs resemble the binding motifs of mitotic specific activation (MSA) and E2F transcription factors. WordSpy can also be applied to identify discriminative motifs. By utilizing the ChIP-chip data of Lee et al., we predict potential binding motifs of 113 known transcription factors of budding yeast.