- What does the output diagram that you get represents?
The output diagram represents the model that RNApromo predicts. It is not specific for any sequence, but rather tries to capture the motif that is common to all the input sequences.
- I would like to get the structures of the individual sequences in my training set (as was show in the PNAS paper or a number of motifs). How could I visualize these structures?
To visualize the structures you will need to use other existing tools such as RNAplot that is provided as part of the ViennaRNA package (http://www.tbi.univie.ac.at/RNA). This is what we used to produce the images in the paper. RNApromo is a tool for predicting and visualizing motifs, and not the structure of a specific sequence.
- What is the difference between vienna and CONTRA fold? Does it really matter which one I use?
Both Vienna and CONTRAfold are RNA secondary structure prediction algorithms. Each use a different algorithm to predict the structure, and therefore you can differences in the structures predicted using each of the algorithms in some cases. In most cases these differences will not affect the predicted motif. Vienna is the more standard and
well-known algorithm of the two, so you might prefer to use it.
- How does the score assigned to each input sequence is calculated? and how can I determine whether the motif exists in a given sequence based on this score?
The score is the log likelihood of the motif in each sequence (given the model parameters). You can look at these scores as a rank of the sequences as to how "strong" the motif is. herefore, a higher score means a better result. These are not the AUC scores that are used in the paper. To set a threshold on the scores, you can try to find the motif you learned in a set of random sequences, and set a limit on the FPR (false positive rate) you want to get which will give you a threshold on the scores (for example, if you have 100 negative sequences, and you want FPR = 5%, than take the 5-th highest score for the negative examples as a threshold).
- The cm_#.tab files contain the learned motif model. What is their format?
The cm_#.tab files contain the parameters of the motif. They includes two tables for each "state" of the motif: the transition matrix and the emission matrix. They are useful if you need to search for the motif in other sets. You can find documentation of the output in the Implementation Notes on the website.
- When I run the program I often several identical motifs. What does it mean?
Each result you get corresponds to a different initial assignment to the model parameters. In many cases different initial assignments will converge to the same final motif during the learning process (the EM algorithm).
- Are the resulting motifs numbered according to their likelihood (1 better than 2)?
The motifs are numbered according to the selection process of initial assignments, and not according to their likelihood. You can easily calculate the likelihood of the final motif by summing up all the scores the motif assigns to the input sequences.
- RNApromo finds a lot of motifs, and I wonder how I should decide whether they are significant? Could you give an idea of what is a good score?
Each predicted motif is assigned an AUC score. Generally speaking, motifs that get a higher AUC score (closer to 1) are better. The AUC scores in the output of RNApromo are calculated for the positive and negative set given as input (If you only provide a positive set, the negative set is generated as random sequences with the di-nucleotide distribution of the positive set). These will tend to be high since the algorithm use the positive set as a training set, and therefore we can expect over-fitting. If you add up all the individual sequence scores you will get the likelihood of your entire input set given the motif. The higher that score is the likelier the motif is. However, these likelihood scores are really hard to compare across different input sets, and even across different motifs. For example - a longer motif can have a higher likelihood than a shorter motif simply because it is longer. Therefore there are some standard statistical scores used to compare motifs. One of them is the AUC score which we used in the article. I would suggest selecting the motif with highest AUC score. In some cases all motifs got the same AUC score, and so you will need to apply other criteria to chose the best motif: you can choose the one which assigns the highest likelihood to the input set , or you can also take into consideration the size of the motif and select the motif that maximize the BIC score, for example. You could also run the program with -auc option to calculate a single AUC score for your set. This AUC will be calculated using 5-fold cross validation, and therefore is more statistically meaningful than the AUC calculated for each motif separately, and is less affected by over-fitting the input set. If this AUC score is high enough (typically above 0.65, but it is different for each input set) than it means the input set sequences probably share a common motif.
- We downloaded the program and run some sequences. What is the format of the output file we get?
The AUC file simply contains the motif serial number (typically 1-5) in the first column, and the AUC score in the second column.
- RNApromo gives me the area under the ROC curve (AUC). In the paper, you have quoted p-vales in addition to AUCs for all discoveries. Are these calculated as well with RNApromo? If so, where can I find them in the output?
The AUC scores in the output of RNApromo are calculated for the positive and negative set given as input (If you only provide a positive set, the negative set is generated as random sequences with the di-nucleotide distribution of the positive set). These will tend to be high since the algorithm use the positive set as a training set, and therefore we can expect over-fitting. In the paper we calculated the AUC scores using 5 or 10 fold cross validation of the input set, and get much lower AUC scores. This option is unfortunately not part of the web-site, but if you download the executable provided in the website, you can ask the program to calculate the cross validation AUC scores (note that these are calculated for an input set, and not for a motif). The pvalues in the article are calculated using a background distribution of AUC scores (by selecting random 3'UTRs and treating them as the positive set -- very similar to your null case trials). These should be calculated for each organism separately, and are therefore cannot be calculated directly by RNApromo. I cannot say what are expected AUC scores, since it is different for each organism, based on base composition and the average length of UTRs in the organism.