-------------------------------------------------------------- tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence Todd Lowe (1) & Sean Eddy (2) (1) Dept. of Genetics, Stanford University, Palo Alto, CA (2) Dept. of Genetics, Washington U. School of Medicine, St. Louis -------------------------------------------------------------- Current release: 1.21 (October 2000) Note: An HTML version of this manual can be found on the web at http://genome.wustl.edu/lowe/tRNAscan-SE-Manual/Manual.html 1. Introduction A. Brief Description tRNAscan-SE identifies transfer RNA genes in genomic DNA or RNA sequences. It combines the specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin, 1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991) plus an implementation of an algorithm described by Pavesi and colleagues (1994) which searches for eukaryotic pol III tRNA promoters (our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass prefilters to identify "candidate" tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds: (1) a false positive rate of less than one per 15 billion nucleotides of random sequence, (2) the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs), and (3) a search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 650-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm). This program and results of its analysis of a number of genomes have been published in Nucleic Acids Research (4). B. What is included in this package? This distribution includes the PERL script tRNAscan-SE, all the files necessary to compile and run the complete COVE package (version 2.4.4), all the files necessary to compile and run the modified version of tRNAscan (version 1.4), and all the files needed to compile and run EufindtRNA 1.0 (the cove programs, tRNAscan 1.4, and EufindtRNA are included for use with the tRNAscan-SE program, but may also be run as stand-alone programs. Installation of the PERL (Practical Extraction and Report Language, Larry Wall) interpreter package version 5.0 or later is required to run the tRNAscan-SE PERL script. C. Getting Started First, see the INSTALL file for installation instructions. Once installed, the user may wish to work through several of the examples included (Section 6 of this document) to get a quick feel for the program's operation and some of the most commonly used command line options. A description of the default run mode and output appears in Section 4, and a description of each of the program options is included in Section 5 of this document. D. Intended Use tRNAscan-SE was designed to make rapid, sensitive searches of genomic sequence feasible using the selectivity of the Cove analysis package. We have optimized search sensitivity with eukaryote cytoplasmic & eubacterial sequences, but it may be applied more broadly with a slight reduction in sensitivity. E. Web Resources For small-scale users and those who are unable to install tRNAscan-SE on a local UNIX platform, a web-based version of the program is available for on-line tRNA analysis at "http://genome.wustl.edu/eddy/tRNAscan-SE/". All of the most frequently used options are available in the web-based version. Links to the most recent release of the program and the tRNAscan-SE Genomic tRNA Database are also available from this page. 2. Methods tRNAscan-SE does no tRNA detection itself, but instead combines the strengths of three independent tRNA prediction programs by negotiating the flow of information between them, performing a limited amount of post-processing, and outputting the results. The program works in three main phases. In the first stage, it runs two independent tRNA detection programs on the input DNA sequence. These relatively fast, first-pass detection programs include a modified, optimized version of tRNAscan 1.3 (1), and EufindtRNA, an implementation of another tRNA search algorithm previously described (3). tRNAscan 1.3 detects tRNAs by initially looking for short, well conserved intragenic promoter sequences (A & B boxes in eukaryotes) found in the TPC and D arm regions of prototypic tRNAs. Once a specific number of nucleotides in the sequence match the consensus promoter (defined by an arbitrary score threshold), the program then progressively attempts to identify the various stem-loop structures found in the tRNA "clover leaf". As each arm is identified by the presence of base-pairing in the stem, correct loop size, and several invariant and semi-invariant bases, a "general score" counter is incremented. If the final score exceeds an empirically determined threshold, the tRNA location, anticodon, and type are saved. EufindtRNA, on the other hand, only searches for linear sequence signals. A step-wise algorithm uses newly developed log-odds score matrices to first identify A and B box promoter elements that exceed an empirically determined cutoff. The scores for these A and B boxes are then added to a log odds score for the nucleotide distance between the A and B boxes to produce an intermediate score. Finally, a log odds score for the distance to the nearest downstream poly-T pol III termination signal is added to the intermediate score to obtain a final score. If the final score is above a final score cutoff, the tRNA identity and location is saved. tRNAscan-SE uses a less selective version of this algorithm that does not look for pol III termination signals, thus uses the intermediate score as a final cutoff. Also, the intermediate score cutoff is loosened slightly relative to the intermediate cutoff described in the original algorithm (3). These modifications increase the algorithm's sensitivity but greatly reduce EufindtRNA's selectivity. This does not reduce the final selectivity of tRNAscan-SE since a secondary filter (Cove) is being used to eliminate false positives. The sensitivity of EufindtRNA is roughly comparable to tRNAscan 1.3, but it appears to be complementary in that EufindtRNA tends to identify tRNAs missed by tRNAscan 1.3 and vice versa (3). tRNAscan-SE takes advantage of this fact, and saves results from both tRNAscan 1.3 and EufindtRNA, then merges them into one list of non-redundant "candidate" tRNA identifications. In the second stage, tRNAscan-SE extracts the DNA subsequences identified as possible tRNAs and passes only these segments to an RNA search program in the Cove program suite (covels) for analysis. Cove programs look for tRNAs in a very different way. A probabilistic model for tRNA has been developed by aligning known tRNAs and giving a base-specific probability score to every nucleotide in the tRNA model. Also, Cove uses a special method for capturing secondary RNA structure information using a type of language referred to as a stochastic context-free grammar (SCFG). Cove applies this probabilistic model to the entire windowed sequence, and produces a probability score that the sequence matches the tRNA model. If the score exceeds 20.0 bits, the tRNA is considered a true tRNA (based on empirical studies in ref. 2). In the final phase, tRNAscan-SE takes those tRNAs confirmed as such and runs another Cove program (coves) that displays RNA secondary structure. The tRNA type is predicted by identifying the anticodon within the structure output. Introns are also automatically identified from the structure output as runs of five or more consecutive non-consensus nucleotides within the anticodon loop. tRNAscan-SE uses heuristics to try to distinguish pseudogenes from true tRNAs, primarily on lack of tRNA-like secondary structure. A second tRNA covariance model was created from the original 1415-tRNA alignment, under the constraint that no secondary structure is conserved (this model is effectively just a sequence profile, or hidden Markov model). By subtracting a tRNA's similarity score to the primary structure-only model from that using the complete tRNA model, a secondary structure-only score is obtained. We have observed that tRNAs with low scores for either component of the total score were often pseudogenes. Thus, tRNAs are marked as likely pseudogenes if they have either a score of less than 10 bits for the primary sequence component of the total score, or a score of less than 5 bits for the secondary structure component of the total score. Selenocysteine tRNAs are not checked by these rules since they have atypical primary and secondary structure. Also, use of the -O option (search for organellar tRNAs) disables pseudogene checking since these criteria are geared towards detecting cytoplasmic pseudogenes (some true non-eukaryotic tRNA are marked as pseudogenes by this analysis). Final tRNA predictions are then saved in tabular, ACeDB, or secondary structure output format. For more details on the program algorithm & implementation, see the Nucleic Acids Research paper (4). 3. Performance / Requirements Performance will obviously vary depending on the machine architecture, memory, and OS efficiency. The examples included in this document were run on a Silicon Graphics Indigo2 R4400-200 running IRIX 5.3, with over 32Mb of memory. tRNAscan-SE runs at approximately 20,000 to 45,000 bp/sec in its default operation mode on this machine. 4. Default Program Operation Invoking tRNAscan-SE: The program is invoked by giving it a series of optional command line parameters, then a list of one or more sequence files written in the FASTA format (see appendix A for example of FASTA format): tRNAscan-SE [-options] By default, the header credits and selected command-line options are printed to the screen via standard error, followed by the final results of the tRNA search written to standard output in a tabular format (see below). By default, tRNAscan-SE searches for eukaryotic cytoplasmic tRNAs. To search for prokaryotic, archaeal, or organellar tRNAs, use search mode options -P, -A, -O, repectively. If the sequences are from more than one phylogenetic domain, the general tRNA model (option -G) may be used with minimal loss of sensitivity and selectivity (the publication describing tRNAscan-SE used the general tRNA model exclusively, ref. 4). Sequence tRNA Bounds tRNA Anti Intron Bounds Cove Name tRNA # Begin End Type Codon Begin End Score -------- ------ ----- --- ---- ----- ----- ----- ----- CELF22B7 1 12619 12738 Leu CAA 12657 12692 60.01 CELF22B7 2 19480 19561 Ser AGA 0 0 80.44 CELF22B7 3 26367 26439 Phe GAA 0 0 80.32 CELF22B7 4 26992 26920 Phe GAA 0 0 80.32 CELF22B7 5 23765 23694 Pro CGG 0 0 75.76 Each new tRNA in a sequence is consecutively numbered in the 'tRNA #' column. 'tRNA Bounds' specify the starting (5') and ending (3') nucleotide bounds for the tRNA. tRNAs found on the reverse (lower) strand are indicated by having the Begin (5') bound greater than the End (3') bound (see tRNAs #4 & #5 in output above). The 'tRNA Type' is the predicted amino acid charged to the tRNA molecule based on the predicted 'Anticodon' (written 5'->3') displayed in the next column. tRNAs that fit criteria for potential pseudogenes (poor primary or secondary structure, discussed in Methods), will be marked with "Pseudo" in the 'tRNA Type' column. If there is a predicted intron in the tRNA, the next two columns indicate the nucleotide bounds. If there is no predicted intron, both of these columns contain zero. The final column is the Cove score for the tRNA in bits. Note that this score will vary somewhat depending on the particular tRNA covariance model used in the analysis (the search mode selects which tRNA covariance model will be used: eukaryote-specific, prokaryote-specific, archae-specific, or general). tRNAscan-SE counts any sequence that attains a score of 20.0 bits or larger as a tRNA (based on empirical studies conducted by Eddy & Durbin in ref #2). Temporary files: In the course of program execution, several temporary files are written to and deleted from the 'TEMPDIR' directory specified in the Makefile on installing the program. Alternatively, the environment variable 'TMPDIR' can be set to another directory which will override the temporary directory specified in the Makefile. For the average user, /tmp should work fine as the temp file directory. For sequencing centers or users scanning very large individual sequences (>1MBp), or many sequences at once (>4 instances of tRNAscan-SE at once), it might be advisable to use /usr/tmp or some other temporary directory that has sufficient free disk space (at least 10MB free). Note: If multiple FASTA files are specified on the command line, tRNAscan-SE creates a temporary file 'tscan.mseq' in which all sequence files are concatenated together for ease of processing. Because of this, the temporary directory must have enough room to temporarily save a copy of all the sequence files (at once) that have been specified on the command line --- this may be a problem for ``power users'' who may conceivably scan an entire directory of cosmids totalling many MBp of sequence. In these cases, I would advise the user to either run a smaller set of sequences at once, or make sure the TEMPDIR can handle the large '.mseq' temporary file. 5. Command-line Options A. Search Mode Options By default, the eukaryotic tRNA model is used for tRNA analysis. To select an alternate tRNA model for sequences from other sources (other phylogenetic domains or mitochondria/chloroplasts), use one of the following options: -B : search for bacterial tRNAs This option selects the bacterial covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection of bacterial tRNAs. Use of this mode with bacterial sequences will also improve bounds prediction of the 3' end (the terminal CAA triplet). -A : search for archaeal tRNAs This option selects an archaeal-specific covariance model for tRNA analysis, as well as slightly loosening the EufindtRNA search cutoffs. -O : search for organellar (mitochondrial/chloroplast) tRNAs This parameter bypasses the fast first-pass scanners that are poor at detecting organellar tRNAs and runs Cove analysis only. Since true organellar tRNAs have been found to have Cove scores between 15 and 20 bits, the search cutoff is lowered from 20 to 15 bits. Also, pseudogene checking is disabled since it is only applicable to eukaryotic cytoplasmic tRNA pseudogenes. Since Cove-only mode is used, searches will be very slow (see -C option below) relative to the default mode. -G : use general tRNA model This option selects the general tRNA covariance model that was trained on tRNAs from all three phylogenetic domains (archaea, bacteria, & eukarya). This mode can be used when analyzing a mixed collection of sequences from more than one phylogenetic domain, with only slight loss of sensitivity and selectivity. The original publication describing this program and tRNAscan-SE version 1.0 used this general tRNA model exclusively. If you wish to compare scores to those found in the paper or scans using v1.0, use this option. Use of this option is compatible with all other search mode options described in this section. -C : search using Cove analysis only (max sensitivity, slow) Directs tRNAscan-SE to analyze sequences using Cove analysis only. This option allows a slightly more sensitive search than the default tRNAscan + EufindtRNA -> Cove mode, but is much slower (by approx. 250 to 3,000 fold). Output format and other program defaults are otherwise identical to the normal analysis. -H : show both primary \& secondary structure score components to covariance model bit scores This option displays the breakdown of the two components of the covariance model bit score. Since tRNA pseudogenes often have one very low component (good secondary structure but poor primary sequence similarity to the tRNA model, or vice versa), this information may be useful in deciding whether a low-scoring tRNA is likely to be a pseudogene. The heuristic pseudogene detection filter uses this information to flag possible pseudogenes -- use this option to see why a hit is marked as a possible pseudogene. The user may wish to examine score breakdowns from known tRNAs in the organism of interest to get a frame of reference. -D : disable pseudogene checking Manually disable checking tRNAs for poor primary or secondary structure scores often indicative of eukaryotic pseudogenes. This will slightly speed the program & may be necessary for non-eukaryotic sequences that are flagged as possible pseudogenes but are known to be functional tRNAs. B. Output Options -o : save final results in Specifiy this option to write results to rather than standard output. -f : save results and Cove tRNA secondary structures to This option saves results and secondary structure information (as predicted by the coves program) in . Use '$' in place of to send to standard output. An example of the output format for one tRNA appears below: CELF22B7.trna4 (26992-26920) Length: 73 bp Type: Phe Anticodon: GAA at 34-36 (26959-26957) Score: 73.88 * | * | * | * | * | * | * | Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. ^ ^ ^ ^ ^ ^ ^ ^^ ^ | | | | | | | || | +-----+ +--------------+ +---------------+ +---------------++-----+ | D-stem/loop Anticodon TPC stem/loop | | stem/loop | | | +----------------------------------------------------------------+ Isoacceptor stem The first line contains the sequence name, trna#, tRNA bounds (in parentheses), and length of the tRNA. The next line contains the isoacceptor tRNA Type, Anticodon (with tRNA-relative and sequence-absolute bounds), and the Cove Score. This is identical information as would be seen in the tabular output format, excluding the anticodon bounds. The next line contains hash marks every 5 and 10 bp to ease position identification in the tRNA sequence that appears on the following line. On the sequence line, nucleotides matching the "consensus" tRNA model used in Cove analysis appear in upper case, while introns and other nucleotides in non-conserved positions are printed in lower-case letters. The last line contains predicted secondary structure folding of the tRNA, with nested '>' and '<' symbols representing base pairings. The various tRNA features are labelled in this example. -a : output results in ACeDB output format This option allows results to be written in ACeDB format instead of the default tabular output format. -m : save statistics summary for run This option directs tRNAscan-SE to write a brief summary to which contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other statistics. The following is a description of each of these statistics, followed by an example stats summary file created from scanning the C. elegans cosmid F59C12: ==================== tRNAscan-SE run results (on host ) Started: