RECON: a package for automated de novo identification of repeat families from genomic sequences

Update

08/07/02, bug fix, RECON1.05 is available. Thanks to John Williams for the bug report. Only affects you if you used MSPCollect.pl to reformat your initial all-vs-all BLASTN results AND if your BLASTN output had scores in scientific notation (eg Score = 1.23e+04).

08/05/02, User Tips added.

07/19/02, bug fix, RECON1.04 is available. Thanks to Dr. Gabor Toth for the bug report. Will not affect your results if you did not have self hits from your initial all-vs-all comparison. If you did have those, it would have crashed the program.

04/09/02, bug fix, RECON1.03 is available.

02/27/02, A slightly updated version, RECON1.02 is available.

Description

Proper identification of repetitive sequences is an essential step in genome analysis. The RECON package performs de novo identification and classification of repeat sequence families from genomic sequences. The underlying algorithm is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Specifically, our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. RECON should be useful for first-pass automatic classification of repeats in newly sequenced genomes.

Download Source Code

RECON is implemented in C and Perl. The source code is freely available under the GNU General Public License. The tar-ball also includes a demo and instructions on installation.

Requirements

There is no specific system requirements, so long as it supports C and Perl.

Reference

Bao Z. and Eddy S.R. (2002) Automated de novo Identification of Repeat Sequence Families in Sequenced Genomes. Genome Research, 12:1269-1276.

User Tips

Organize the initial all-vs-all pairwise comparison carefully. It may save up to half of the time. In particular, avoid collecting self hits. Self hits can slow down RECON significantly. In addition, if you are using RECON1.03 or lower, they may also cause the program to crash.

You may need to re-name your input sequences. Some sequence names are not properly recognized by the program for no apparent reason. A safe choice would be something like "seq123456", i.e., the string "seq" followed by a number. I'm working on this one.

Typically, I only focus on families with >= 10 copies. I would build consensus sequences for these families, then use the consensuses to annotate the genomic sequences (using RepeatMasker). This way, I can recover older/more divergent members of the families which were not detected in my initial all-vs-all BLAST.

If you have more than 30 to 50Mb sequences, you should consider taking an incremental approach as described in the RECON paper.

Additional Material

Having a good assessing method is almost as important as having a good solution. The assessment of repeat identification is more complicated than the typical measurements of a false positive and a false negative value. We provide our treatment of the problem here.

Contact

Comments? Bugs? Please Email zhirongbao@gmail.com

Zhirong Bao

First Setup 01/09/02

Back to the Eddy lab homepage