Input dataThe algorithm takes as input RNA sequences. Information about structure is optional. Structural information can be:
If no structural information is given, the algorithm use Vienna RNAfold to fold the input RNAs.
Negative ExamplesIf there are examples of sequences in which the motif DOES NOT appear, they can be provided as negative examples. These examples are used for initialization but not for the learning process itself. Negative examples are also used to estimate the predictive power of the motif.
If not known, negative examples can be created by using di-nucleotide shuffling of the input sequences.
InitializationThe initialization step produce motif candidates for the EM learning algorithm. These candidates contain only structural information. The entire set of sub-structures of a given length range ([15 .. 75] by default) in the input sequences is extracted. These small sub-structures are called "structural elements".
To simplify this set it is possible to treat similarly to stems or loops with close, but not identical sizes.
The parameter "stem size flexibility" will indicate the minimal size difference between stems (paired bases) that will be considered as a different structure by the algorithm. For example, if this parameter is 3 then stems of size 5 and 7 will be considered the same, but stems of size 5 and 8 will be considered as different structure. Similarly, the parameter "loop size flexibility" will indicate the minimal size difference between loops (unpaired bases) that will be considered as a different structure by the algorithm.
This set is filterd by:
Learning (the EM algorithm)In this step the initial structure produced by initialization step is refined, and some sequence information is added to it.
You can change the number of initial structures that will be passed to this stage (default = 3). The resulting motif will be displayed.
OutputThe output include several parts:
Graphical description of a motifA motif is displayed graphically. This graphical representation gives information about the consensus structure and possibly sequence of the motif. This does NOT mean we expect to always see that structure or sequence, but that this is the structure and sequence with the highest probability according to the model.
Estimating the predictive power of a motifThis option is not supported in the online tool, but can be done using the executable. Allows to estimate an AUC (area under the curve) value for the motif. <k>-fold cross-validation scheme is a standart statistical test in which we partition the input set into <k> parts, learn a model from each of the possible combinations of <k>-1 sets, and use this model to assign likelihood scores to the RNAs that were held out while learning it.
A standard ROC curve and its associated area under the curve (AUC) measure allows to evaluate the significance of the input RNAs likelihood scores compared to the negative examples. High AUC scores indicate that the input RNAs share a biological signal which is absent from the negative ones. AUC scores close to 0.5 means that the model was NOT successfult in seperating positive examples from negative ones. As the motif is more successful in identifying the input examples from the negative examples, the AUC score will be higher, and closer to 1.