The following FAQ refer to the Learn Motifs from Unaligned Sequences
About the input
- Can I submit only a positive set?
Yes. You may choose the option "Use randomization of input sequences as negative set". Each positive sequence you provide will be randomly permuted to generate a negative sequence, thus the resulting negative set will be of same size as the positive set, and maintain its mono-nucleotide distribution.
- I have a positive set. How do I generate a negative set?
First, you can use the above mentioned "Use randomization of input sequences as negative set" option. Second, you may select a random sequences set from the genome. If you do so, make sure that positive sequences are excluded from it. Generate a set that is at least equivalent in size to the positive set. It is recommended that the negative set will preserve the mono and di-nucleotide distributions of the positive set as much as possible. Third, in case your positive set contains the highest ranking sequences from a ChIP-chip experiment, you can similarly choose the lowest ranking sequences as a negative set.
About the output
- Each motif has a p-value. What is this p-value?
The motif finder finds discriminately enriched sets of k-mers, K-mer set Motif Models (KMMs). The KMM enrichment measure is the set's multidimensional hypergeometric p-value (for more details, see our paper). Based on each KMM, the motif finder extracts aligned putative binding sites from which an FMM and a PSSM can be learned. Thus, we assign the multidimensional hypergeometric p-value of the KMM to both FMM and PSSM derived from it.
- What are the GXW output files?
The GXW files contain the XML representations of the motifs. In the FMM case they specify the learned features and their weights. Note that in the FMM GXW files the position indices are 0-based, while they are 1-based in the logo.
- Can the motif finder also detail where putative binding sites are located on my data sequences?
Yes. The motif finder also outputs a list of the best motif hits in the positive sequences (an Excell sheet). For each hit it states: the sequence where the hit is found, the coordinates of that hit within the sequence (they are reversed in case the hit is on the minus strand), the motif for which that hit was found, a hit score.
- What does the hit score value mean? What does a negative score mean?
For each motif, a score at a certain hit location is calculated as follows: SCORE = -1 / log2[motif hit probability] (where log2 is log with base 2). Suppose that for motif M its highest scoring hit got a score S. We linearly normalize all scores by dividing them by S (such that all scores will be between 0 and 1). Note that this is done independently for each motif. To emphasize scores for hits on the minus strand we turn them to negative (multiply by -1).
- Regarding motif hits, what does it mean: "View in Genomica"?
Genomica is a useful analysis and visualization tool for genomic data, developed in our lab. A link to its download webpage is provided with the output. We provide a GXP file containing the motif hits data, that can be opened using Genomica. As a quick help: After you open this file with Genomica, choose the option: Chromosomes -> View Tracks... , and then choose the "Best Motif Hits" track. Plus strands hits will show in red, and minus strand hits will show in green.