Errata page for Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
Yes, there are errors. Unlike Donald Knuth, we don't dare offer you $1
for each one you find. But we would still appreciate knowing about
them!
Please report errata to
sean@eddylab.org.
The book is now past its 10th printing, and in its third revision.
New errata for printings in the third revision (10th printing and up)
(Page numbers are now relative to the 10th printing, which has a few
off-by-one's with respect to previous printings.)
Chapter 2
- p.19, a previous fix (see below) had added:
"Assume that gapped sequence alignments do not allow gaps in
the second sequence after a gap in the first; that is, allow
alignments of form ABC/A-C and A-CD/AB-D but not AB-D/A-CD.
(This is a natural restriction, because a region between aligned
pairs can be aligned in a large number of uninteresting ways.)"
This contains a typo: change "after" to "before", so that it
reads "do not allow gaps in the second sequence before a gap
in the first". Alternatively, "do not allow inserted residues in
the second sequence after an inserted residue in the first".
[Thanks to Jeffrey Xing, February 2016.]
- p.30, we say, "we assume that a deletion will not be followed
directly by an insertion. This will be true for the optimal path if
-d-e is less than the lowest mismatch score." This is incorrect.
A counterexample: suppose d=10, e=1, lowest mismatch score = -10,
and the alignment:
ccc---
---aaa
It should instead probably read something like,
"This will be true for an optimal path if -2e <= the lowest mismatch
score and d>=e." [Thanks to Martin Frith, CBRC, Japan, Jun 2012].
- p.39, eqn 2.18, the equation for the extreme value
distribution is missing a negative sign in the innermost
exponential, in front of \lambda (x - \mu). [Thanks to Larry Ruzzo,
UW Seattle, Apr 2011].
Chapter 3
- p.52, "normalised by dividing by their length, i.e. as an average
number of bits per molecule" should be "per residue". [Thanks to
Michael Hoffman, EBI, June 2008].
- p.78 (in my 1999 1st rev copy), "The b variables have to be
scaled with the same numbers, so the recursion step in (3.3)
becomes...": two corrections here. First, (3.3) should be
(3.13). Second, change "have to be scaled" to "can be scaled". Using
the same scaling factors is convenient because it leads to the
simplifications in exercises 3.13 and 3.14, but is not necessary, as
Kevin Murphy has pointed out (p.8 of https://www.cs.ubc.ca/~murphyk/Papers/segment.pdf).
[Thanks to Jason Ernst, UCLA, Sept 2022]
Chapter 4
- p.88, the pseudocode for the recursion doesn't correctly
handle the border cases i=0 or j=0.
[Thanks to Lee Newberg, Wadsworth Institute, Oct 2007]
Chapter 8
- p.196 in "... namely purine to purine or pyr-miding to pyrimidine",
an 'i' is missing from the first pyrimidine, right at the line break.
[Thanks to Tommy Kaplan, Dec 2015]
Chapter 10
- pp.270-273. We should make clear that by "Nussinov algorithm" we mean the general class
of base-pair scoring algorithms, as opposed to "Zuker algorithms" that are stacking-
(also said to be loop-)dependent algorithms. There are many different DP algorithms (in detail)
for maximizing the number of base pairs in an RNA sequence, corresponding to
different choices of grammar. The actual algorithm in Nussinov (1978)
corresponds essentially to grammar G5 in Dowell and Eddy, BMC Bioinf 5:71 (2004), whereas the
algorithm in the book corresponds to grammar G1. G5 is an unambiguous grammar,
but is a little tricky to understand if it's the first RNA folding algorithm one
sees. G1 is more intuitive, which is why we use it here, while still calling it a
"Nussinov" algorithm because it's maximizing the number of base pairs. [Thanks to
Georg Sauthoff, Bielefeld University, Germany, Dec 2008.]
- p.299: Two references are missing and appear as ?. This was a formatting error
erroneously introduced into the third revision. The missing references are:
Margalit, H., Shapiro, B. A., Oppenheim, A. B. and Maizel, J. V. 1989.
Detection of common motifs in RNA secondary structures. Nucleic Acids
Research 17:4829-4845.
Shapiro, B. A. and Zhang, K. 1990. Comparing multiple RNA secondary
structures using tree comparisons. Computer Applications in the
Biosciences 6:309-318.
[Thanks to Yiran Guo, Beijing Institute of Genomics; April 2008]
Chapter 11
- p.317: The description of rejection sampling for the Dirichlet is not correct.
The condition g(x,\alpha ,1) \leq f(x) means that f(x) cannot be a normalized
probability distribution, since g(x,\alpha ,1) is.
Actually f(x) = C \rho (x), where \rho(x) is a probability (see Exercise 11.8):
\rho (x) = \frac{\lambda \alpha^\lambda x^{\lambda -1}} {(\alpha^\lambda +x^\lambda )^2},
and the constant C = \frac {4e^{-\alpha}\alpha^\alpha}{\lambda\Gamma (\alpha )}.
Then 'sample from f' should be 'sample from \rho'; and finally
\rho(x) P({\rm rand}[0,1]< g(x)/f(x)) \propto g(x).
[Thanks to Dr. Wei-Mou Zheng, Institute of Theoretical Physics, Academica Sinica, Beijing; April 2008]
Errata that were fixed in the second revision
Chapter 2
Chapter 3
- p55. At the top, the example of calculating a probability for
the sequence CGCG has two typos in it. It should be:
a0,c+ * 1 * ac+,g- * 1 * ag-,c- * 1 * ac-,g+ * 1 * ag+,0
[thanks to Bill Nagy, IBM]
- p65. At the top, it says that Viterbi training "maximises
the contribution to the likelihood P(x1..xn | \theta,
\pi*(x1)..\pi*(xn))." That should read P(x1..xn,
\pi*(x1)..\pi*(xn)): e.g. Viterbi finds an optimal
path by maximizing P(path, seq | model) over possible paths.
[thanks to Shlomo Moran, Technion, Haifa]
- pp 70-71. The comparison of the numbers of transitions between
the "fully downstream connected" model and the "silent state"
model may be confusing. It's less confusing if we change 600
to 800 for the number of transitions in the silent state model
for 200 match states; the figure at the top of p. 71 clearly
shows 4 transitions per match state. The reason it says
600 was that we weren't counting the 200 match->match "main
line" transitions in the comparison (since both
models have to have them), only the extra transitions having
to do with further downstream connectivity. [thanks to
Chun Li, U.Mich.]
Chapter 4
- p 83. The initialisation of the pair-HMM Viterbi algorithm isn't right.
It should be v^M(0,0)=1, v^X(0,0)= v^Y(0,0)=0, and v^*(-1,j)=v^*(i,-1)=0.
The recursion should be over i=0,..,n and j=0,..,m except for (0,0).
As it is, there can be no deletes in the beginning of any of the sequences.
It would be nicer to avoid the -1 row/column and explicitly
initialize v^X(0,j) and v^X(i,0) and similarly for v^Y [and initialize
v^M(0,j)=v^M(i,0)=0, except for v^M(0,0)=1]. Then the recusion
would start at 1 in i and j, and things would start to look more like
chap 3. [ak, 7 Sept 99]
- p.84 eqn 4.2. In two places, q_x_j should be q_y_j. [Thanks to
Michael Hoffman, UT Austin, 6 Aug 03.]
- p.84. The equation for e should have 1-\eta in the denominator,
not 1-\tau. [Thanks to J. Burton, 27 Feb 02.]
- p.85 top. The initialization of V^M(0,0) should be -2 \log \eta,
not 2 \log \eta. (And I'm a little suspicious still: I think this
algorithm doesn't work out right for some special cases, like length 0
alignment, or a length 1 insert-only alignment.) [sre, 2 Jan 03].
- p.85. Where it says "If the exit
transition probabilities from the gap states are set
to (1-epsilon)tau/(1-tau) then c will be zero", that should read
"...set to (1-epsilon)tau/(1-2delta)...".
[Thanks to Youyi Fong, 21 Dec 02]
- p.90. In the example global alignments, two pairs - (1,1),(1,3); (2,3),
(3,1) are identical, but the intention was to show different alternative
global alignments. [Thanks to P. Tung-Ming, 28 Apr 02.]
- p.99. The Zuker suboptimal alignment algorithm (which does
a forward/backward algorithm to calculate the scores
of conditionally optimal alignments that pass though each DP cell,
samples a suboptimal DP cell, then traces the optimal alignment that
passes through that sampled cell) is not really similar to
the Waterman/Eggert algorithm (which recalculates
the DP matrix to erase the influence of the optimal alignment,
then finds the second best independent alignment, and repeats
this process for additional independent suboptimals), except
in the very trivial sense that they are both "suboptimal alignment"
algorithms. [Thanks to M. Zuker.]
Chapter 5
Chapter 6
- p.156 bottom, \hat{a}_{i,\pi_{i}}} should be changed to
\hat{a}_{k,\pi_{i}}} (that is, index on a_i should be a_k)
[AK, 3 Jun 2005]
Chapter 7
- p.172, question 7.10: should say the bridge length, not
twice the bridge length. [thanks to Itai Sharon, 2 Feb 03].
- p.178-179. Last sentence of pp at top of p.179 should read
"Therefore, when we reach a situation with a row of 0s on the right,
we have to advance the leftmost 0 to 1 to make the next step."
Figure 7.11 (ii) needs to be changed accordingly as well.
[Thanks to Bernhard Sonderegger, 14 Jun 2004]
- p.189-191, the Studier/Keppler proof of NJ. Equation 7.10
should read -2 d_lk, not -d_lk (note that d_ij contains a
d_lk segment). This affects subsequent algebra. However,
Graeme believes the proof has other problems. This section
needs to be rethought carefully.
- p.196, around the line "Rescaling does not change the diagonal
form of the matrix", the operation should be "normalization";
and it is not clear that this leaves the matrix diagonalizable -
if that's what this sentence is supposed to mean. GJM wants
to revise here. [GJM, 7 Jun 04]
Chapter 8
- p. 228, in the first equation on the page, there's a missing
summation over a^8; both the first and second part of the equation
are summations over a^8, not just the first.
The LaTeX should read:
\eqas
\sum_{a^8} P(a^1 | a^8, t_1+t_6)P(a^3 | a^8, t_7+t_3)q_{a^8}
&=& \sum_{a^8} P(a^1 | a^8, t_1+t_6)P(a^8 | a^3,t_3+t_7)q_{a^3} \\
&=& P(a^1 | a^3, t_1+t_6+t_3+t_7)q_{a^3} \\
\enas
[thanks to Carlos Gerardo Arroyo at SLAC, 13 Dec 2000]
Chapter 9
Chapter 10
- p 272, simple SCFG for RNA folding: needs one more production
S -> \epsilon, to terminate derivations. (Thanks to R. Giegerich,
10/11/99)
- p 271-272. Algorithm on p. 271 does not give the structure
in figure 10.9, though the structure in fig 10.9 is indeed
a possible optimal structure. Clarify in exercise 10.2: point
this out, and ask for modifications to the traceback to give
the fig 10.9 structure. (Discrepancy caught by
B. Sonderegger, 6/26/02)
- p 273, CYK recursion: for i = 1 to L-1 should read for i = L-1 down to 1.
(Thanks to R. Giegerich, 10/11/99)
- Last paragraph on p 273 (or at some point): We need to discuss grammar ambiguity.
Obviously any useful RNA structure prediction grammar must be
formally ambiguous: there are many possible structures for a
given sequence, and we're going to be interested in the maximum likelihood
structure. It's important to avoid a second level of ambiguity:
we want each structure to have an unambiguous parse tree.
If this is not true, we cannot say that the maximum likelihood
parse tree (which is what CYK is going to give us) corresponds
to the maximum likelihood structure. The grammar in (10.2) is
ambiguous in this way... which is another reason why it's not
a good example of an RNA folding grammar. (Thanks to R. Giegerich, 10/11/99)
- p 274, footnote: The URL's gone out of date (surprise surprise).
A working URL is here.
But we should replace it with a real reference.
[Thanks to Michael Hoffman, UT Austin, 7 Sept 2003.]
- p 280, fig 10.12: On the left branch of the parse tree the
two last L productions before E should be C and G rather than U and C.
Chapter 11
- p.304: "...the Poisson, f(n)=e^-p p^n/n,...", should be
n! in the denominator. (This is a TeX source typo; "\!" does
nothing.) (Thanks to Li Liao at Dupont, 8/29/01.)
- p.312 line 6: "we we" should be "we". (Thanks to Chun Li, U.Mich.,
4/20/00).
- p.315 exercise 11.9: "from (x,y) to (u,w)" should be "from (u,w)
to (x,y)".(Chun Li, 4/20/00).
- p.315, exercise 11.10 line 2: The expression at the end should be
$\cos(2\pi x) \sqrt{\log(1/y^2)}$. (The square root was left out.)
(Chun Li, 4/20/00).
- p.317, Ex.11.12, line 1. The inequality at the end should be
1<=\lambda instead of 1>=\lambda. (Chun Li, 4/20/00).
Bibliography
- p.342: Stormo and Haussler generalized DP reference is 1994, not 1996;
2nd ISMB, not 4th. (sre to ak - doublecheck the bibtex crossrefs
for ismb, make sure this didn't happen elsewhere)
[Thanks to Gary Stormo, 15 Oct 2003.]
Errata for the first edition (fixed in the first revision)
Chapter 1
- pg. 8. In the last equation on the page, after "the denominator is
now an integral rather than a sum", the left hand side
P(\theta) should instead be P(D).
- pg. 8. Exercise 1.4, "Why?" should more specifically say
"Using Bayes' theorem, give a plausible reason for making
this decision." In real life, the decision to take or not take a genetic
test would also depend on
the risk associated with erroneously not treating the disease in
true positives and the risk associated with treating the
disease in false positives.
Chapter 2
- p. 18, eq. 2.7: Remove the "2" under the sqrt:
"\frac{ 2^{2n}}{\sqrt{2 \pi n}}" --> "\frac{ 2^{2n}}{\sqrt{\pi n}}"
- p. 24: H(p||q) should be H(qxq||p) -- rd; and I (sre) think
there's incorrect text too; should read something like "where H(qxq || p)
is the relative entropy of the expected distribution qxq
with respect to the observed distribution p..."
Chapter 3
- p. 51. The equation for S(x) isn't specific enough about what log base is
used to produce the actual example numbers in the table for \Beta.
The numbers in the example table are in log base two (bits). This
appears in the footnote, but it's easy to miss.
- p. 71, point (ii): "add" --> "set"
- p. 76, fig. 3.12, y-axis on lower plot:
"Bits per nucleotide" --> "Bits"
Chapter 4
- In Exercise 4.4, "Figure 4.8" should read "Figure 4.7".
Chapter 5
- p. 118. line 1 (prime is on sum instead of the a)
"$f_{ja} = c_{ja}/\sum_a' c_{ja'}$" --> "$f_{ja} = c_{ja}/\sum_{a'} c_{ja'}$"
- p. 118, eq. 5.4. The definition of f is incompatible with the exercises,
and I suggest changing " = 1" to "- 1 = 0".
- pg. 121. Equation 5.8 should read
$\sum_C \sum_{\mbox{samples } s} \sum_a C_a \log e_s(a)$.
- pg. 123. In figure 5.7, arrows for I-> D transitions were
omitted.
- pg. 123 (bottom of page). The probability $t_{xy}$ should be $a_{xy}$.
Chapter 6
- p. 140, equation 6.5: When we define the SP score (eqn 6.5), we
don't assign a cost to $s(-,-)$, though of course this can arise in
multiple alignments. I presume it should be zero. Additional
note (gjm): actually we do define $s(-,-)$ to be zero, but it
is on p. 146, and in fact we repeat what has been said on p 140 two lines
down "For simple linear gap costs..". We really ought to say s(-,-)=0 on
page 140 and then refer back to this whole passage on page 146.
Chapter 7
- p. 160, Zuckerkandl is misspelled; also in refs and in
index (pp. 344, 349) -> change in bibtex db.
- p. 162, caption to fig. 7.1, ln. 7:
"the alphabetically the" --> "alphabetically the"
- p. 165, Exercise 7.2 should read: "The trees with
three and four leaves in Figure 7.3 all have the same
unlabelled branching pattern. For both rooted and unrooted
trees, how many leaves do there have to be to obtain more than one
unlabelled branching pattern? Find a recurrence relation for the
number of rooted trees. (Hint: Consider the trees formed by joining
two trees at their roots.)"
- pg. 169. In Figure 7.5, the tree on the right is incorrect.
It should be (1(4(23))), not ((14)(23)).
- pg. 171. in the neighbour joining algorithm iteration,
"d_jk = d_ij - d_jk" should read "d_jk = d_ij - d_ik".
- pg. 176. A typo; "applying the nparsimony algorithm" should
read "applying the parsimony algorithm".
- pg. 180. In Fig 7.12 the residue on the small tree should be $x^2^j$,
not $x^2_{j-1}$.
- pg. 188. The short paragraph beginning "Note that in
this example..." doesn't make sense. The topology, qua unrooted tree, is
of course the same if GTT and GT are grouped together. What is true is
that different rootings of the tree give different results for Hein's
algorithm, which is worrying. Actually, the whole section needs
rethinking, because someone has just shown that the use of "dummy edges"
in Hein's algorithm is incorrect.
Chapter 8
- p. 213, line -2: should read "assuming a flat prior on $\exp(-\alpha t)$" -- I think. (But there may be something nastier wrong
with this.) [GJM]
- p. 214: should be a 3 in the denominator of eq 8.23.
Chapter 11
- pg. 301, Equation 11.4. The denominator of the last bit
should read (\sum_k n_k)!. The factorial was inadvertently
dropped.
- p. 302 eqn. 11.5: slight misformatting, parentheses should be
bigger for \delta term
- p. 306 beginning of ln. 6: "is the" --> "is then"
- p. 309 line -2: "neigbouring" --> "neighboring"
- p. 312 exercise 11.6: stray blemish on d of $\sigma^2/(Nd^2)$
- p. 319, second line before equations:
"Using equations (11.3)" --> "Using equation (11.3)"
- p. 324 line 4, "equatilty" --> "equality"