Biological Sequence Analysis

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis.

Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison.
Cambridge University Press, 1998.
356 pp.
ISBN 0-521-62041-4 (hardback)
ISBN 0-521-62971-3 (paperback)

The book is available on line from various sources, including:
Cambridge University Press: [paperback] ;
Barnes and Noble: [paperback];
Amazon.com: [paperback] [hardback].

Errata

An up-to-date list of errata is here.

If you discover an error that isn't already on the list, please email: sean@eddylab.org.

Jacket blurb

The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale genome sequencing efforts, such as the Human Genome Project, which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly created and explosively expanding research area of computational molecular biology, or bioinformatics.

Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches.

This book provides the first unified, up to date, and tutorial level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length.

Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of each others' fields. It presents the state of the art in this important, new and rapidly developing discipline.

	Preface	ix
Introduction		1
	Sequence similarity, homology, and alignment	2
	Overview of the book	2
	Probabilities and probabilistic models	4
	Further reading	10
Pairwise alignment		12
	Introduction	12
	The scoring model	13
	Alignment algorithms	17
	Dynamic programming with more complex models	28
	Heuristic alignment algorithms	32
	Linear space alignments	34
	Significance of scores	36
	Deriving score parameters from alignment data	41
	Further reading	45
Markov chains and hidden Markov models		46
	Markov chains	48
	Hidden Markov models	51
	Parameter estimation for HMMs	62
	HMM model structure	68
	More complex Markov chains	72
	Numerical stability of HMM algorithms	77
	Further reading	79
Pairwise alignment using HMMs		80
	Pair HMMs	81
	The full probability of x and y, summing over all paths	87
	Suboptimal alignment	89
	The posterior probability that x_i is aligned to y_j	91
	Pair HMMs versus FSAs for searching	95
	Further reading	98
Profile HMMs for sequence families		100
	Ungapped score matrices	102
	Adding insert and delete states to obtain profile HMMs	102
	Deriving profile HMMs from multiple alignments	105
	Searching with profile HMMs	108
	Profile HMM variants for non-global alignments	113
	More on estimation of probabilities	115
	Optimal model construction	122
	Weighting training sequences	124
	Further reading	132
Multiple sequence alignment methods		134
	What a multiple alignment means	135
	Scoring a multiple alignment	137
	Multidimensional dynamic programming	141
	Progressive alignment methods	143
	Multiple alignment by profile HMM training	149
	Further reading	159
Building phylogenetic trees		160
	The tree of life	160
	Background on trees	161
	Making a tree from pairwise distances	165
	Parsimony	173
	Assessing the trees: the bootstrap	179
	Simultaneous alignment and phylogeny	180
	Further reading	188
	Appendix: proof of neighbour-joining theorem	190
Probabilistic approaches to phylogeny		192
	Introduction	192
	Probabilistic models of evolution	193
	Calculating the likelihood for ungapped alignments	197
	Using the likelihood for inference	205
	Towards more realistic evolutionary models	215
	Comparison of probabilistic and non-probabilistic methods	224
	Further reading	231
Transformational grammars		233
	Transformational grammars	234
	Regular grammars	237
	Context-free grammars	242
	Context-sensitive grammars	247
	Stochastic grammars	250
	Stochastic context-free grammars for sequence modelling	252
	Further reading	259
RNA structure analysis		260
	RNA	261
	RNA secondary structure prediction	267
	Covariance models: SCFG-based RNA profiles	277
	Further reading	297
Background on probability		299
	Probability distributions	299
	Entropy	305
	Inference	311
	Sampling	314
	Estimation of probabilities from counts	319
	The EM algorithm	323
Bibliography		326
Author index		345
Subject index		350

Eddy Laboratory