A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis. Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1998. 356 pp. ISBN 0-521-62041-4 (hardback) ISBN 0-521-62971-3 (paperback)
The book is available on line from various sources, including: |
An up-to-date list of errata is here.
If you discover an error that isn't already on the list, please
email:
The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale genome sequencing efforts, such as the Human Genome Project, which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly created and explosively expanding research area of computational molecular biology, or bioinformatics.
Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches.
This book provides the first unified, up to date, and tutorial level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length.
Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of each others' fields. It presents the state of the art in this important, new and rapidly developing discipline.
Preface | ix | |
Introduction | 1 | |
Sequence similarity, homology, and alignment | 2 | |
Overview of the book | 2 | |
Probabilities and probabilistic models | 4 | |
Further reading | 10 | |
Pairwise alignment | 12 | |
Introduction | 12 | |
The scoring model | 13 | |
Alignment algorithms | 17 | |
Dynamic programming with more complex models | 28 | |
Heuristic alignment algorithms | 32 | |
Linear space alignments | 34 | |
Significance of scores | 36 | |
Deriving score parameters from alignment data | 41 | |
Further reading | 45 | |
Markov chains and hidden Markov models | 46 | |
Markov chains | 48 | |
Hidden Markov models | 51 | |
Parameter estimation for HMMs | 62 | |
HMM model structure | 68 | |
More complex Markov chains | 72 | |
Numerical stability of HMM algorithms | 77 | |
Further reading | 79 | |
Pairwise alignment using HMMs | 80 | |
Pair HMMs | 81 | |
The full probability of x and y, summing over all paths | 87 | |
Suboptimal alignment | 89 | |
The posterior probability that xi is aligned to yj | 91 | |
Pair HMMs versus FSAs for searching | 95 | |
Further reading | 98 | |
Profile HMMs for sequence families | 100 | |
Ungapped score matrices | 102 | |
Adding insert and delete states to obtain profile HMMs | 102 | |
Deriving profile HMMs from multiple alignments | 105 | |
Searching with profile HMMs | 108 | |
Profile HMM variants for non-global alignments | 113 | |
More on estimation of probabilities | 115 | |
Optimal model construction | 122 | |
Weighting training sequences | 124 | |
Further reading | 132 | |
Multiple sequence alignment methods | 134 | |
What a multiple alignment means | 135 | |
Scoring a multiple alignment | 137 | |
Multidimensional dynamic programming | 141 | |
Progressive alignment methods | 143 | |
Multiple alignment by profile HMM training | 149 | |
Further reading | 159 | |
Building phylogenetic trees | 160 | |
The tree of life | 160 | |
Background on trees | 161 | |
Making a tree from pairwise distances | 165 | |
Parsimony | 173 | |
Assessing the trees: the bootstrap | 179 | |
Simultaneous alignment and phylogeny | 180 | |
Further reading | 188 | |
Appendix: proof of neighbour-joining theorem | 190 | |
Probabilistic approaches to phylogeny | 192 | |
Introduction | 192 | |
Probabilistic models of evolution | 193 | |
Calculating the likelihood for ungapped alignments | 197 | |
Using the likelihood for inference | 205 | |
Towards more realistic evolutionary models | 215 | |
Comparison of probabilistic and non-probabilistic methods | 224 | |
Further reading | 231 | |
Transformational grammars | 233 | |
Transformational grammars | 234 | |
Regular grammars | 237 | |
Context-free grammars | 242 | |
Context-sensitive grammars | 247 | |
Stochastic grammars | 250 | |
Stochastic context-free grammars for sequence modelling | 252 | |
Further reading | 259 | |
RNA structure analysis | 260 | |
RNA | 261 | |
RNA secondary structure prediction | 267 | |
Covariance models: SCFG-based RNA profiles | 277 | |
Further reading | 297 | |
Background on probability | 299 | |
Probability distributions | 299 | |
Entropy | 305 | |
Inference | 311 | |
Sampling | 314 | |
Estimation of probabilities from counts | 319 | |
The EM algorithm | 323 | |
Bibliography | 326 | |
Author index | 345 | |
Subject index | 350 |