WordSpy - help


About

The software is designed for discovering hidden words in a stegoscript. Its primary application is on motif finding in genomic regulatory DNA sequences. A stegoscript is a set of sequences in which some secret messages (or motifs in biological applications) are embedded within some cover-texts (or background sequences in biological applications). Our task is to recover the secret messages from the script. By assuming that secret messages and cover-texts have different word distribution, we design a stastistical dictionary and grammar model to generate the stegoscript. Then the deciphering problem is just to learn such a model from the script. The algorithm is very efficient on searching for a large set of motifs from a large set of sequences, thus suitable for genome-wide motif finding. It also can accurately predict the transcription factor binding motifs for a specific transcription factor when combined with ChIP-chip experiments.

WordSpy allows you to

  1. Discover all over-represented (degenerate) words in a large set of sequences (DNA, RNA, Protein, or English)

    Over-represented words in a set of sequences usually represent functional elements or significant features of that set of sequences. The identificaiton of all over-represented words are fundamentally important for many applications. This program can automatically detect all over-represented (degenerate) words by a dictionary and a grammar model. It can successfully recover an article hidden inside a long random string, without knowing what language the article is written in. The only required inputs from the user are input sequences (in FASTA format), the maximum word length.

    WordSpy combines a word counting method and a statistical model, and is significantly different from the other word-counting based methods. First, WordSpy's counting procedure is progressive and retrospective. It considers short to long words and can adjust the over-representativeness of short words after having examined longer words, so that not truly over-represented short words can be purged out of the dictionary. For example, assume that word there was detected as over-represented in the current iteration and word ther was in the original dictionary. However, if the occurrences of ther were merely due to the occurrences of there, the siginificance of ther would be reduced, and consequently it may be purged from the dictionary. As a result, WordSpy produces less spurious motifs, has a lower false positive rate, and is able to find words of optimal lengths and within optimal positions. On the other hand, most existing word-counting methods enumerate words of different lengths in isolation, and thus have high false positive rates. Note that all these properties are natually embedded within our model, thus no extra objective functions are needed.

  2. Identify discriminative words with negative sequence data

    Given two sets of scripts (or sequences), a discriminative word is such a word that is over-represented in one script but not the other. Finding discriminative words has practical importance for such applications as identifying tissue or condition specific TFBMs and elucidating differential transcriptional regulation. WordSpy is in essence an algorithm for discriminative words, thanks to its intrinsic feature of modeling motifs (discriminative words) and background words in an integral model. For instance, when looking for TFBMs that are responsible for the expression of a group of genes under certain conditions, we can use that group of genes to form a positive dataset, and choose a group of genes that are not responsive and/or down-regulated under these conditions to construct the background (negative) dataset. The WordSpy algorithm can directly take positive and negative datasets for finding discriminative motifs.

  3. Select biological meaningful DNA motifs using gene expression data

    It is known that co-regulated genes tend to have similar expression patterns over different conditions. We can thus measure the quality of a motif (for being a biologically meaningful motif) by the coherence of the expression profiles of the genes whose promoters contain that motif. We use the average coherence of pairwise gene expression profiles to measure the coherence of a set of expression profiles, and call this measure {\em G-score}, where G stands for genes. Therefore, a higher G-score indicates a more biologically meaningful motif. The pairwise gene expression coherence can be measured in many ways, such as Euclidean distances and correlation coefficients. Here, we used correlation coefficients.

  4. Evaluate DNA motifs with genome-scale random sampling analysis (in the result page)


Parameters

Parameters
Explanations
default value
maximum word length Set the maximum word length k. The program will build a dicitionary including all word lengthes up to this maximum number k. N/A
alphabet set Current program can handle input sequences of DNA, RNA, protein, or English text. Users are required to specify the alphabet set of their input sequences. By default, the program will think the input sequences as DNA sequences. If the input sequences contain letters that do not match the alphabet set, the program may replace them with the letters in the alphabet set randomly. dna
allow degeneracy Let the program search for degenerate motifs. By default, the program will only search for exact words. By turning this option on, the program will try to merge the similar over-rerepresented words to form degenerate motifs, which are represented by Position Weight Matrices (PWMs). These PWMs will be optimized during the model optimization processes according to the input sequences. no
subtle motifs Useful only when "allow degeneracy" is turned on. The "subtle motifs" refer to the degenerate motifs in which the degeneracy is uniform over all positions. By default, the program will search for degenerate motifs which have one or two core parts highly convserved with the their flanking sequences "do-not-care". As this option is turned on, all the words that matches at lease m bases will be merged, where m is determined so that the change of two random words have m base matches is less than 0.001. Here we suggest to use default settings since it is more biological meaningful. no
on both strands Search for motifs on both strands. no
count sequences Count the word distribution across different sequences. Rightnow, the program only counts the number of sequences in which the motifs appear. no
repeatly cleanup During the model optimization, some motifs will become less significant. These motifs can be removed in each iteration by turning on this option. Since after these motifs removed, the model has to be re-optimized, resulting in some other motifs less significant again. So this cleanup process will be repeated until no motifs can be removed. no
order of tandem repeats Set maximum length of identical subsequences in low complexity repeats. For example, 'AAA' is a tandem repeat of order one; 'CTCTCT' is a tandem repeat of order two. All the tandem repeats will be treated as background words. If you do not want to filter out tandem repeat, you can simply set this parameter to 0. 4
word selection ratio The program will select over-represented words, including motif words and background words, into the dictionary. The over-representation of a word is defined as a ratio between the observed occurences of the word in the input sequences and the expected occurrences of the word in the random sequences. 1.0
minimum z-score The program will quantatively measure the over-represenation of a word by a Z-score. Minimum z-score is used as a treshold to select motifs from background words, i.e., an over-represented words with a z-score higher than the threshold will be treated as a motif word, otherwise, background word. 3.0
minimum occurrences Minimum occurrence number is another threshold for selecting motifs. If a word occurring less than this threshold will not be considered a motif word. 2
maximum motif numbers You can restrict the total number of motifs for each word length by setting this parameter. By default, we give a large number so that virtually it will not have any restriction on the number of motifs. 10000

 


Format

1) FASTA

The fasta format consists of a list of sequences. Each sequence starts with a line with a '>' followed by the sequence name. The next lines then contain the amino acids/nucleotides and gaps of that sequence. The next sequence starts with the next line beginning on '>'. Below is an example of a fasta file.

>thrL Escherichia coli K-12 MG1655 complete genome.
atataggcatagcgcacagacagataaaaattacagagtacacaacatcc
>thrA Escherichia coli K-12 MG1655 complete genome.
cctgacagtgcgggctttttttttcgaccaaaggtaacgaggtaacaacc
>thrB Escherichia coli K-12 MG1655 complete genome.
tgtctttgctgatctgctacgtaccctctcatggaagttaggagtctgac
>thrC Escherichia coli K-12 MG1655 complete genome.
ttcatatttgccggctggatacggcgggcgcacgagtactggaaaactaa
>talB Escherichia coli K-12 MG1655 complete genome.
ggcagaccggttacatccccctaacaagctgtttaaagagaaatactatc
>htgA Escherichia coli K-12 MG1655 complete genome.
ggcaggcgatttgcagtacggctggaatcgtcacgcgataggcgctgccg
>dnaK Escherichia coli K-12 MG1655 complete genome.
ttacagactcacaaccacatgatgaccgaatatatagtggagacgtttag

2) Gene expression data

The gene expression data format is a two dimentional matrix with the first column labelled gene names (or IDs), and the first row labelled conditions. To make the gene expression data workable, the gene names (or IDs) should match with the sequence names (the first words of the sequence names following '>' in the fasta file) of the input sequence file. But the matches do not need to be one-on-one or with the same order. Below is an example of a gene expression data file, that matchs with the above fasta file example.




Example

The program is easy to use. We take the target promoters of Yeast ACE2 transcription factor as an example. The input sequences are availabel in our database.

Step 1: Select and input your positive sequences. In this case, you can select the "yeast ACE2 postive" sequences from our database.

Step 2: Give your options. The only required option is "Maximum word length". In this example, you can input 8. If you wish to generate degenerate motifs, check the option "allow degeneracy".

Step 3: Input your motif selection criteria. These parameters will effect the final size of the dictionary. A higher treshold will make the motifs identified more selective (less false positive motifs), however less sensitive (more false negative motifs). In this case, you can leave these parameters as their default values.

Step 4: Input gene expression data. Since the program needs to make a connection between promoter sequences and gene expression proflie, the names of the promoters should match with the names of gene expression profile. In this case, you can select the yeast expression data from our database, which matches with all the yeast sequence data in our sequence database.

Step 5: Input negative sequences. You can select the "yeast ACE2 negative" sequences from our database.

Step 6: Submit you job. Give a valid email address, since the result will send to you by email.