WordSpy - help
About
The software is designed for discovering hidden words in a stegoscript. Its
primary application is on motif finding in genomic regulatory DNA sequences.
A stegoscript is a set of sequences in which some secret messages (or motifs
in biological applications) are embedded within some cover-texts (or background
sequences in biological applications). Our task is to recover the secret messages
from the script. By assuming that secret messages and cover-texts have different
word distribution, we design a stastistical dictionary and grammar model to
generate the stegoscript. Then the deciphering problem is just to learn such
a model from the script. The algorithm is very efficient on searching for a
large set of motifs from a large set of sequences, thus suitable for genome-wide
motif finding. It also can accurately predict the transcription factor binding
motifs for a specific transcription factor when combined with ChIP-chip experiments.
WordSpy allows you to
-
Discover all over-represented (degenerate)
words in a large set of sequences (DNA, RNA, Protein, or English)
Over-represented words in a set of sequences usually represent functional
elements or significant features of that set of sequences. The identificaiton
of all over-represented words are fundamentally important for many applications.
This program can automatically detect all over-represented (degenerate) words
by a dictionary and a grammar model. It can successfully recover an article
hidden inside a long random string, without knowing what language the article
is written in. The only required inputs from the user are input sequences
(in FASTA format), the maximum word length.
WordSpy combines a word counting method and a statistical model, and is significantly
different from the other word-counting based methods. First, WordSpy's counting
procedure is progressive and retrospective. It considers short to long words
and can adjust the over-representativeness of short words after having examined
longer words, so that not truly over-represented short words can be purged
out of the dictionary. For example, assume that word there was detected
as over-represented in the current iteration and word ther was in the
original dictionary. However, if the occurrences of ther were merely
due to the occurrences of there, the siginificance of ther would
be reduced, and consequently it may be purged from the dictionary. As a result,
WordSpy produces less spurious motifs, has a lower false positive rate, and
is able to find words of optimal lengths and within optimal positions. On
the other hand, most existing word-counting methods enumerate words of different
lengths in isolation, and thus have high false positive rates. Note that all
these properties are natually embedded within our model, thus no extra objective
functions are needed.
-
Identify discriminative words with
negative sequence data
Given two sets of scripts (or sequences), a discriminative word is such a
word that is over-represented in one script but not the other. Finding discriminative
words has practical importance for such applications as identifying tissue
or condition specific TFBMs and elucidating differential transcriptional regulation.
WordSpy is in essence an algorithm for discriminative words, thanks to its
intrinsic feature of modeling motifs (discriminative words) and background
words in an integral model. For instance, when looking for TFBMs that are
responsible for the expression of a group of genes under certain conditions,
we can use that group of genes to form a positive dataset, and choose a group
of genes that are not responsive and/or down-regulated under these conditions
to construct the background (negative) dataset. The WordSpy algorithm can
directly take positive and negative datasets for finding discriminative motifs.
-
Select biological meaningful DNA motifs
using gene expression data
It is known that co-regulated genes tend to have similar expression patterns
over different conditions. We can thus measure the quality of a motif (for
being a biologically meaningful motif) by the coherence of the expression
profiles of the genes whose promoters contain that motif. We use the average
coherence of pairwise gene expression profiles to measure the coherence of
a set of expression profiles, and call this measure {\em G-score}, where G
stands for genes. Therefore, a higher G-score indicates a more biologically
meaningful motif. The pairwise gene expression coherence can be measured in
many ways, such as Euclidean distances and correlation coefficients. Here,
we used correlation coefficients.
-
Evaluate DNA motifs with genome-scale
random sampling analysis (in the result page)
Parameters
|
Parameters
|
Explanations
|
default value
|
|
maximum word length |
Set the maximum word length k. The
program will build a dicitionary including all word lengthes up to this
maximum number k. |
N/A |
| alphabet
set |
Current program can handle input sequences
of DNA, RNA, protein, or English text. Users are required to specify the
alphabet set of their input sequences. By default, the program will think
the input sequences as DNA sequences. If the input sequences contain letters
that do not match the alphabet set, the program may replace them with the
letters in the alphabet set randomly. |
dna |
| allow
degeneracy |
Let the program search
for degenerate motifs. By default, the program will only search for exact
words. By turning this option on, the program will try to merge the similar
over-rerepresented words to form degenerate motifs, which are represented
by Position Weight Matrices (PWMs). These PWMs will be optimized during
the model optimization processes according to the input sequences. |
no |
| subtle
motifs |
Useful only when "allow degeneracy"
is turned on. The "subtle motifs" refer to the degenerate motifs
in which the degeneracy is uniform over all positions. By default, the program
will search for degenerate motifs which have one or two core parts highly
convserved with the their flanking sequences "do-not-care". As
this option is turned on, all the words that matches at lease m bases
will be merged, where m is determined so that the change of two random
words have m base matches is less than 0.001. Here we suggest to
use default settings since it is more biological meaningful. |
no |
| on
both strands |
Search for motifs on both strands. |
no |
| count
sequences |
Count the word distribution across different
sequences. Rightnow, the program only counts the number of sequences in
which the motifs appear. |
no |
| repeatly
cleanup |
During the model optimization, some
motifs will become less significant. These motifs can be removed in each
iteration by turning on this option. Since after these motifs removed, the
model has to be re-optimized, resulting in some other motifs less significant
again. So this cleanup process will be repeated until no motifs can be removed.
|
no |
| order
of tandem repeats |
Set maximum length of identical
subsequences in low complexity repeats. For example, 'AAA' is a tandem repeat
of order one; 'CTCTCT' is a tandem repeat of order two. All the tandem repeats
will be treated as background words. If you do not want to filter out tandem
repeat, you can simply set this parameter to 0. |
4 |
| word
selection ratio |
The program will select over-represented
words, including motif words and background words, into the dictionary.
The over-representation of a word is defined as a ratio between the observed
occurences of the word in the input sequences and the expected occurrences
of the word in the random sequences. |
1.0 |
| minimum
z-score |
The program will quantatively measure
the over-represenation of a word by a Z-score. Minimum z-score is used as
a treshold to select motifs from background words, i.e., an over-represented
words with a z-score higher than the threshold will be treated as a motif
word, otherwise, background word. |
3.0 |
| minimum
occurrences |
Minimum occurrence number is another
threshold for selecting motifs. If a word occurring less than this threshold
will not be considered a motif word. |
2 |
| maximum
motif numbers |
You can restrict the total number of motifs
for each word length by setting this parameter. By default, we give a large
number so that virtually it will not have any restriction on the number
of motifs. |
10000 |
Format
1) FASTA
The fasta format consists of a list of sequences. Each sequence starts with
a line with a '>' followed by the sequence name. The next lines then contain
the amino acids/nucleotides and gaps of that sequence. The next sequence starts
with the next line beginning on '>'. Below is an example of a fasta file.
>thrL Escherichia coli K-12 MG1655
complete genome.
atataggcatagcgcacagacagataaaaattacagagtacacaacatcc
>thrA Escherichia coli K-12 MG1655 complete genome.
cctgacagtgcgggctttttttttcgaccaaaggtaacgaggtaacaacc
>thrB Escherichia coli K-12 MG1655 complete genome.
tgtctttgctgatctgctacgtaccctctcatggaagttaggagtctgac
>thrC Escherichia coli K-12 MG1655 complete genome.
ttcatatttgccggctggatacggcgggcgcacgagtactggaaaactaa
>talB Escherichia coli K-12 MG1655 complete genome.
ggcagaccggttacatccccctaacaagctgtttaaagagaaatactatc
>htgA Escherichia coli K-12 MG1655 complete genome.
ggcaggcgatttgcagtacggctggaatcgtcacgcgataggcgctgccg
>dnaK Escherichia coli K-12 MG1655 complete genome.
ttacagactcacaaccacatgatgaccgaatatatagtggagacgtttag
2) Gene expression data
The gene expression data format is a two dimentional matrix with the first
column labelled gene names (or IDs), and the first row labelled conditions.
To make the gene expression data workable, the gene
names (or IDs) should match with the sequence names (the first words of the
sequence names following '>' in the fasta file) of the input sequence file.
But the matches do not need to be one-on-one or with the same order. Below is
an example of a gene expression data file, that matchs with the above fasta
file example.
Example
The program is easy to use. We take the target promoters of Yeast ACE2 transcription
factor as an example. The input sequences are availabel in our database.
Step 1: Select and input your positive sequences. In this case,
you can select the "yeast ACE2 postive" sequences
from our database.
Step 2: Give your options. The only required option is "Maximum
word length". In this example, you can input 8.
If you wish to generate degenerate motifs, check the option
"allow degeneracy".
Step 3: Input your motif selection criteria. These
parameters will effect the final size of the dictionary. A higher treshold will
make the motifs identified more selective (less false positive motifs), however
less sensitive (more false negative motifs). In this case, you can leave these
parameters as their default values.
Step 4: Input gene expression data. Since the
program needs to make a connection between promoter sequences and gene expression
proflie, the names of the promoters should match with the names of gene expression
profile. In this case, you can select the yeast expression
data from our database, which matches with all the yeast sequence data
in our sequence database.
Step 5: Input negative sequences. You can select
the "yeast ACE2 negative" sequences from
our database.
Step 6: Submit you job. Give a valid email address, since the result
will send to you by email.