Data Format:
============
The datasets are provided in Consan's preferred format,
STOCKHOLM 1.0.  See XXX for more information on the
STOCKHOLM format.

Conus uses the style of structure notation with base pairing
symbols '>' and '<'.  [For a particular pair the symbols point
to each other.]

Technially we are corrupting the STOCKHOLM format by using it to
annotate and represent single sequences (instead of the multiple
sequence alignments for which it was intended) as well.


Structure format: (STOCHOLM 1.0)
--------------------------------
Using STOCKOLM format to annotate alignments:
Example:

# STOCKHOLM 1.0

#=GS RD0260 DE    GUC PHAGE T5           VIRUS
#=GS RE6781 DE    CPC HORDEUM VULGARE    CY PLA

RD0260             GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUG
RE6781             UCCGUCGUAGUCUAGGUGGUUAGGAUACUCGGCUCUCACCCGAGAGA-CC
#=GC SS_cons       >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>

RD0260             UGGGUUCAAAUCCCAUCGGUCGCGCCA
RE6781             CGGGUUCGAGUCCCGGCGACGGAACCA
#=GC SS_cons       >>>>.......<<<<<<<<<<<<....
//


Using STOCKOLM format to annotate single sequences:
Example:

# STOCKHOLM 1.0
#=GS DA252 DE DA2521 trna

DA252		GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCAAGGCAGAUGUCAG
#=GR DA252 SS	>>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>

DA252		CGGUUCGAGUCCGCUUUAUCUCCA
#=GR DA252 SS	>>>.......<<<<<.<<<<<<<.
//

The DataSets:
=============
The sequences in the training set (mixed80.stk) were obtained
from the European Ribosomal Database in December 2002.  They
were culled to remove sequences containing more than 5% ambiguous
bases or less than 40% base pairing.  The resulting set was
filtered at 80% identity.

For comparing parameterizations, the Rfam.v7.pub80.stk.

Random 100 sequence in 5S and tRNA family for comparison to
single sequence methods R100.stk.  The same as R100 but with
an informant sequence R100.pairs.stk.

The percent identity balanced set, used as the primary testing
set (percid.stk).

The Dynalign dataset is not redistributable, please contact
David Mathews (David_Mathews@urmc.rochester.edu).

The Stemloc dataset is extracted from Rfam v6.1 to reproduce
the dataset described in Holmes (2005).