Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

[Picture of cover]

A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis.


Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison.
Cambridge University Press, 1998.
356 pp.
ISBN 0-521-62041-4 (hardback)
ISBN 0-521-62971-3 (paperback)

The book is available on line from various sources, including:
Cambridge University Press: [paperback] ;
Barnes and Noble: [paperback];
Amazon.com: [paperback] [hardback].

Errata

An up-to-date list of errata is here.

If you discover an error that isn't already on the list, please email: sean@eddylab.org.

Jacket blurb

The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale genome sequencing efforts, such as the Human Genome Project, which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly created and explosively expanding research area of computational molecular biology, or bioinformatics.

Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches.

This book provides the first unified, up to date, and tutorial level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length.

Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of each others' fields. It presents the state of the art in this important, new and rapidly developing discipline.

Table of contents

Prefaceix
Introduction1
Sequence similarity, homology, and alignment2
Overview of the book2
Probabilities and probabilistic models4
Further reading10
Pairwise alignment12
Introduction12
The scoring model13
Alignment algorithms17
Dynamic programming with more complex models28
Heuristic alignment algorithms32
Linear space alignments34
Significance of scores36
Deriving score parameters from alignment data41
Further reading45
Markov chains and hidden Markov models46
Markov chains48
Hidden Markov models51
Parameter estimation for HMMs62
HMM model structure68
More complex Markov chains72
Numerical stability of HMM algorithms77
Further reading79
Pairwise alignment using HMMs 80
Pair HMMs81
The full probability of x and y, summing over all paths87
Suboptimal alignment89
The posterior probability that xi is aligned to yj91
Pair HMMs versus FSAs for searching95
Further reading98
Profile HMMs for sequence families100
Ungapped score matrices102
Adding insert and delete states to obtain profile HMMs102
Deriving profile HMMs from multiple alignments105
Searching with profile HMMs108
Profile HMM variants for non-global alignments 113
More on estimation of probabilities115
Optimal model construction 122
Weighting training sequences124
Further reading132
Multiple sequence alignment methods134
What a multiple alignment means135
Scoring a multiple alignment137
Multidimensional dynamic programming141
Progressive alignment methods143
Multiple alignment by profile HMM training149
Further reading159
Building phylogenetic trees160
The tree of life160
Background on trees161
Making a tree from pairwise distances165
Parsimony173
Assessing the trees: the bootstrap179
Simultaneous alignment and phylogeny180
Further reading188
Appendix: proof of neighbour-joining theorem190
Probabilistic approaches to phylogeny192
Introduction192
Probabilistic models of evolution 193
Calculating the likelihood for ungapped alignments197
Using the likelihood for inference205
Towards more realistic evolutionary models215
Comparison of probabilistic and non-probabilistic methods224
Further reading231
Transformational grammars233
Transformational grammars234
Regular grammars237
Context-free grammars242
Context-sensitive grammars247
Stochastic grammars250
Stochastic context-free grammars for sequence modelling252
Further reading259
RNA structure analysis260
RNA261
RNA secondary structure prediction267
Covariance models: SCFG-based RNA profiles277
Further reading297
Background on probability299
Probability distributions299
Entropy305
Inference311
Sampling314
Estimation of probabilities from counts319
The EM algorithm323
Bibliography326
Author index345
Subject index350