Main
Predict Motif
Download Executable
Implementation Notes
FAQs

Implementation Notes


Input data

The algorithm takes as input RNA sequences. Information about structure is optional. Structural information can be:
  • given seperately for parts of the input sequence.
  • give more than a single structure for some or all parts of the input sequences.
If some experimental information on structure is known it is recommended to supply the experimental structures as input.

If no structural information is given, the algorithm use Vienna RNAfold to fold the input RNAs.
  • Folding can be done in segments (possibly overlapping), since folding algorithms do not work well on long sequences.
  • More than a single fold can be provided for each sequence, for example by allowing suboptimal folds of the sequences.

Negative Examples

If there are examples of sequences in which the motif DOES NOT appear, they can be provided as negative examples. These examples are used for initialization but not for the learning process itself. Negative examples are also used to estimate the predictive power of the motif.

If not known, negative examples can be created by using di-nucleotide shuffling of the input sequences.

Initialization

The initialization step produce motif candidates for the EM learning algorithm. These candidates contain only structural information. The entire set of sub-structures of a given length range ([15 .. 75] by default) in the input sequences is extracted. These small sub-structures are called "structural elements".

To simplify this set it is possible to treat similarly to stems or loops with close, but not identical sizes.

The parameter "stem size flexibility" will indicate the minimal size difference between stems (paired bases) that will be considered as a different structure by the algorithm. For example, if this parameter is 3 then stems of size 5 and 7 will be considered the same, but stems of size 5 and 8 will be considered as different structure. Similarly, the parameter "loop size flexibility" will indicate the minimal size difference between loops (unpaired bases) that will be considered as a different structure by the algorithm.

This set is filterd by:
  • Filtering by background distribution: a background distribution describing the expected motifs in the set. This distribution can be built from the negative set, or a default distribution can be used.
    Using this background distribution we can assign a pvalue to the count of appearences of each motif in the input data. By default all motifs with p-value > 0.01 are removed. This pvalue can be calculated by assuming either a normal or binomial model for the background distribution.
  • Filtering by position: If a specific structural element ALWAYS appears as part of another than one of the two can be removed.
    By default the algorithm keeps the larger structural element, but it is also possible to keep the smaller one or both.

Learning (the EM algorithm)

In this step the initial structure produced by initialization step is refined, and some sequence information is added to it.
You can change the number of initial structures that will be passed to this stage (default = 3). The resulting motif will be displayed.

Output

The output include several parts:
  • An summary table displaying the graphical description of each motif.
  • A file containing the model parameters.
  • An excel file with summary of the motif positions identified in each input sequence.

Graphical description of a motif

A motif is displayed graphically. This graphical representation gives information about the consensus structure and possibly sequence of the motif. This does NOT mean we expect to always see that structure or sequence, but that this is the structure and sequence with the highest probability according to the model.
  • Structure positions are colored in grey-scale. Darker color means higher probability for a position to be in the same state as in the picture (paired/unpaired).
  • sequence positions appear only if their probability is higher than 0.5 (by default). They are color-coded according to their probability with scale ranging from green (low probability) to red (high probability).

Estimating the predictive power of a motif

This option is not supported in the online tool, but can be done using the executable. Allows to estimate an AUC (area under the curve) value for the motif. <k>-fold cross-validation scheme is a standart statistical test in which we partition the input set into <k> parts, learn a model from each of the possible combinations of <k>-1 sets, and use this model to assign likelihood scores to the RNAs that were held out while learning it.
A standard ROC curve and its associated area under the curve (AUC) measure allows to evaluate the significance of the input RNAs likelihood scores compared to the negative examples. High AUC scores indicate that the input RNAs share a biological signal which is absent from the negative ones. AUC scores close to 0.5 means that the model was NOT successfult in seperating positive examples from negative ones. As the motif is more successful in identifying the input examples from the negative examples, the AUC score will be higher, and closer to 1.

Other details about our algorithm and implementation can be found here: