Probabilistic models of biological sequences

Much of our work involves algorithm development for computational genome sequence analysis. Many of the tools we develop are based on probabilistic models of biological sequence and structure. We tend to think about these models in the context of the Chomsky hierarchy of formal grammars. Hidden Markov models (stochastic regular grammars) are useful for primary structure analysis of proteins and DNA. Stochastic context-free grammars are ideal for analysis of RNA secondary structure.

Theory and algorithm development has little impact on biology unless we can put useful tools and implementations in the hands of the research community. We put heavy emphasis on the engineering of practical and robust software tools, capturing and transferring the best theoretical results in genome sequence analysis, in areas including mathematics, algorithms, computational science, and statistical inference. Two of our best-known software tools are HMMER for protein and DNA sequence alignment and database homology searches, and Infernal structural profile SCFG search software for RNAs. In collaboration with the Pfam and Rfam Consortia, we participate closely in the development of the Pfam protein domain database and the Rfam RNA families database.

The modern RNA world

We are particularly interested in identifying novel structural and catalytic RNAs. The "ancient RNA world" hypothesis says that an ecosphere of RNA-based life preceded protein/DNA based life. It is widely argued that many of the RNA genes (tRNA, rRNA, catalytic introns) that we see today are ancient relics of the RNA world. If this is true, we hope that we might be able to learn something about the origins of life by identifying new RNA genes and studying their evolutionary history. Screening for new RNA genes is an interesting challenge; classical genetics can identify new genes based on their functional phenotype, but not based on what material their product is made of. We are taking the approach of identifying new noncoding RNA genes by looking for them directly in genome sequence data, using computational genetics and algorithmic screens. What we seem to be finding is that the RNA World model is pessimistic. Far from being a few scattered relics, RNAs are in fact in widespread use in modern organisms in a variety of roles. We have argued for a "modern RNA world" hypothesis that many of the RNAs we see today are modern inventions, highly adapted to regulatory roles in complex organisms.

How genomes build neural circuits

One of the enduring mysteries in biology is how the genomic specification of complex neural systems evolved — how low-level changes in genome sequence give rise to all the glorious variation in neural circuitry and behavior that selection acts upon. This is an area in which biology still largely lacks quantitative language for asking precise questions. Our laboratory has begun to explore collaborations with neuroscientists working on the molecular regulatory specification of neural cell types in fly, worm, and mouse.