Computational genomics: the great game

Even if you don't understand a foreign language, you might get pretty far just by recognizing similarities. You don't have to speak Spanish to guess that oxígeno means oxygen. More distant languages are more difficult, but with more data and context (Welsh ócsigen, sylffwr, fflworin, alwminiwm, ffosfforws...), you start to take advantage of recurring patterns of change. Some baffling words (Welsh plwm) only start to make sense as comparisons across many languages reveal an evolutionary history of word gain, loss, and modification (Latin plumbum, English plumbing; Dutch leiden, English lode and lead). Even languages in unknown alphabets can be methodically cracked. The mysterious symbolic language Linear B, found on stone tablets from Mycenea and Minoa, was eventually cracked by statistical analysis and recognized as an alternative alphabet for Greek.

We now have genome sequences for many thousands of species, from all across the evolutionary tree of life. By comparing biological sequences, we can recognize conserved genes and conserved regulatory regions. We can make tentative but useful connections to the "words" we've already deciphered in well-studied model organisms like mice, fruit flies, and yeast. As we study genome sequence data more and more systematically, with more and more powerful statistical methods, the meaning and history of genes and genomes is revealing itself to us, all the way back to the origins of life.

Computational genomics is the greatest of unsolved cryptographic puzzles — better than Linear B, the Phaistos Disk, the Voynich Manuscript, the Kryptos sculpture — a grand collaborative attack on the alien language of life. It's only recently, with the rapid advance of genome sequencing technology, that we molecular biologists have been able to get the texts we needed to play the game.

Probabilistic models of biological sequences

We develop algorithms and software for identifying conserved sequences in large scale genome sequence analysis. We use comparative genomics and deep multiple sequence alignments to learn statistical models of the conserved features of proteins and RNAs, in order to maximize our ability to identify distantly related sequences in genome sequence database searches. Our methods aim to identify relationships as far back as possible in evolutionary time, -- often to the last common ancestors of life on the planet, billions of years ago.

The mathematical underpinnings of our methods are typically probabilistic models of biological sequence and structure. Hidden Markov models (stochastic regular grammars) are useful for primary structure analysis of proteins and DNA. Stochastic context-free grammars are ideal for analysis of RNA secondary structure. We are exploring the use of Bayesian networks, statistical physics models called Potts models, and deep neural networks to capture even more complicated correlation structure in biological sequences.

Theory and algorithm development has little impact on biology unless we can put useful tools and implementations in the hands of the research community. We put heavy emphasis on the engineering of practical and robust software tools, capturing and transferring the best theoretical results in genome sequence analysis, in areas including mathematics, algorithms, computational science, and statistical inference. Two of our best-known software tools are HMMER for protein and DNA sequence alignment and database homology searches, and Infernal structural profile SCFG search software for RNAs. In collaboration with the Pfam and Rfam Consortia, we participate closely in the development of the Pfam protein domain database and the Rfam RNA families database.

The modern RNA world

We are particularly interested in identifying novel structural and catalytic RNAs. The "ancient RNA world" hypothesis says that an ecosphere of RNA-based life preceded protein/DNA based life. It is widely argued that many of the RNA genes (tRNA, rRNA, catalytic introns) that we see today are ancient relics of the RNA world. If this is true, we hope that we might be able to learn something about the origins of life by identifying new RNA genes and studying their evolutionary history. Screening for new RNA genes is an interesting challenge; classical genetics can identify new genes based on their functional phenotype, but not based on what material their product is made of. We are taking the approach of identifying new noncoding RNA genes by looking for them directly in genome sequence data, using computational genetics and algorithmic screens. What we seem to be finding is that the RNA World model is pessimistic. Far from being a few scattered relics, RNAs are in fact in widespread use in modern organisms in a variety of roles. We have argued for a "modern RNA world" hypothesis that many of the RNAs we see today are modern inventions, highly adapted to regulatory roles in complex organisms.