Explanation of Input Format

FASTA

The fasta format consists of a list of sequences. Each sequence starts with a line with a '>' followed by the sequence name. The next lines then contain the amino acids/nucleotides and gaps of that sequence. The next sequence starts with the next line beginning on '>'. Legal sequence symbols are 'a', 'c', 'g', 't', and '-' (case insensitive). Unknown symbols are converted to gaps ('-'). An example with two sequences:

>ZEA.MA-A
NCC-GAGCUCU-GUAGCGAGAGCUUGUAACCCG-----AGCGGG-GGCAUUAAGGUGGUG
UGAAUGCUUUGCGAUGGCU---UUCUGGGCCCU--GGGCUCG-U-UGUGACACUGGCCGG
CUUGCCCAUCCCAAGUUGGUAGU-GUCUGGU-GGGGGCUCUAGCGAAAGCUUUGGGUCUC
U-GCAGACCU-GGAGCGGCAGGAAUGGCGUAAGGCUGGCUUCACAGAGCAGCGAUCACUG
CCG-ACUCCCAACGGUGGGAGGAUAACGAAGCCGCUG-CACU--UUGAGCCUAACUCAG-
GCU-CAGAA----CCUCACU-AAGCAAACCACCA
>HUM.LU-A
GCC--GGUCUUAGCAACGUGGGCCUGUAACCCA-----AGUGGG-GGCAUGUGGGAAAUG
G-GACU-UUG-GGUCAAC----CUAGUGGAU-C--GGGUCCAGUGUUAGCUGCUUACUGG
UCUGCCCAUUCCAAGCCGGGAGUU-GGGCUG-AGUGACCUGGGCGAAGGGC-UGGGUUGC
GCACGUC-CU-AGAGUGGAGGGCAAUGCGUGAGGCUGGCUUCACAGAGCAGCGACUACCU
CC-CGCUCUCGGCAGUGGAAGGAUAACG-GGCCGGUG-CUAC--CUGGGUCCACCAUG-C
UUC-ACUAGG--CUGACUCUUAAUAGGACCAUUU

Explanation of Scoring Methods

Mutual information

This measures the interdependence between pairs of base positions, and is meaningful only for a set of aligned sequences (number of sequences > 10 in general). Given an alignment, let f_i(X) be the frequency of base X at aligned position i and let f_ij(XY) be the frequency of finding X at position i and Y at position j, the mutual information score between position i and j, M_ij, is calculated as:

M_ij = 1000 * SUM_XY {f_ij(XY) log2 ( f_ij(XY) / f_i(X)f_j(Y) ) }, where X, Y = A, C, G, T.

Helix plot

This measures the possibilities for a base-pair to be in a long helix. The score is the summation of a pair score and a helix bonus. if two positions can form a waston-crick or G-U base-pair, it is assigned a "good pair score" (default 1), if one of the position is a gap, it is assigned a "paired gap penalty" (default 3), otherwise it is asssigned a "bad pair score" (default 2). If a base-pair is within a helix of enough length (default 3), it gets a bonus score depending on the length of the helix (default 2 * helix length). The two parts are then summed together and multiplied by 20. When there are a sequence alignment, this score is calculated for each sequnce and the averge score for each position-pair is calculated. Let P_ij_k and B_ij_k be the pair score and the helix bonus score of position i and j in the kth sequence, respectively. The helix plot score between the position i and j, H_ij, is calculated as
:

H_ij = 20 * SUM_k (P_ij_k + B_ij_k) / N, where N is the number of sequences, and k = 1...N.

Extended Helix Plot

This is similar to the helix plot score, with more sophisicated calcuation of pair scores and helix bonus scores. G-C, A-U, and G-U pairs have different scores. Helix bonus scores are calculated from stacking energies. When there is only a single sequence, this provides better accuracy than helix plot scores.

Ratio of mutual information scores and (extended) helix plot scores

Two scoring methods can be combined together when both are available (i.e., multiple sequence alignment). When the number of sequences is small (< 10), helix plot should get more weight (1:3 is suggested). Otherwise a ratio 1:1 is suggested. When there is only a single sequence, mutual information will not be calculated, and extended helix plot is preferred to helix plot.

Explanation of Folding Algorithms and Options

Iterated loop matching

see reference [1].

Maximum weighted matching

see reference [2] and [3].

Minimum loop length

the minimum number of single stranded bases (in the aligned sequences) between any two bases to be paired.

Minimum virtual loop length

the minimum number of single stranded bases (in the aligned sequences) between any two bases to be paired, excluding pseudoknotted bases.

Minimum helix length

the minimum length of a helix (in the aligned sequences)

Number of helices per iteration

The maximum number of helices selected in each iteration. A value 0 means all helices are selected.

Number of iterations before stop

The maximum number of loop matching iterations. A value 0 means continue until no helices left.

Explanation of Output Formats

text

This is the original form of output. The first, second, and third columns are: nucleotide index, nucletide symbol, and base-pair partner index, respectively.

Bracket notation

This is an extended version of the traditional bracket notation for non-pseduoknotted structures. The first line of output shows the index of each helix, the second line shows the primary sequence, and the remaining lines show the base-pairs in brackets. Depending on the complexity of pseudoknots, the output may have multiple lines of brackets. Each line contains a set of non-crossing base-pairs, which are represented by pairs of opposing brackets. By breaking pseudoknotted base-pairs into separate lines, this format is able to represent pseduoknots of any complexity.

.ct

The '.ct file' contains the nucleic acid sequence and base pairing information from which a structure plot may be computed. It can be directly imported into the RNAviz program for drawing a secondary structure. Pseudoknots are allowed.

XRNA

I have not figured out how to make the output consistent with the program, really. The output now contains a primary sequence and a set of helices, which can be separately copy-pasted into the XRNA program. It will complain about the presence of pseudoknots. You can remove those base-pairs first. Then you may be able to add those pseudoknotted base-pairs later (I haven't figured out how to do that though. Please tell me if you know, thanks).

RNAML

The RNAML (RNA Markup Language) was developed by a consortium of investigators and is a proposed syntax for RNA information files. A description was published in 2002:

A. Waugh, P. Gendron, R. Altman, J.W. Brown, D. Case, D. Gautheret, S.C. Harvey, N. Leontis, J. Westbrook, E. Westhof, M. Zuker, & F. Major RNAML: A standard syntax for exchanging RNA information. RNA 8 (6), 707-717, (2002) medline

For more information see the RNAML website. Our syntax is compatible with DTD vision 1.1.

dotplot

A dot plot displays a score matrix together with predicted base-pairs. In the upper triangle, the size of each dot represents the relative score of that corresponding base-pair. The base-pair with the maximum score has the largest size. A dot in the lower triangle means that the corresponding base-pair is predicted by the algorithm. Note that the indices correspond to positions in the aligned sequence.
PostScript files can be viewed with Ghostscript, Ghostview and GSview.