Modelling Is More Versatile Than Shuffling

L. Allison, D. Powell and T. I. Dix, Technical Report 2000/83, School of Computer Science and Software Engineering, Monash University, Australia 3168, September 2000

Abstract: Sequences having low information content cause problems for standard alignment algorithms, e.g. causing false-positive matches. Shuffling is a popular technique of correcting for the abnormally low alignment costs (or high scores) between such sequences. Shuffling cannot be used safely on arbitrary populations of sequences. It is only used "after the fact" to judge the significance of alignments and does not change their rank-order. We seek a better solution.

An alternative alignment methodology is described which directly models the information content of sequences. It can be used with a very large class of statistical models for different populations of sequences. In general, it not only judges the significance of alignments but can change their rank-order, promoting some and demoting others. The populations that the sequences come from can be identified, probably. The new methodology is compared to shuffling for the purpose of judging the significance of optimal alignments.

The methodology described can be incorporated into any alignment algorithm that allows mutation costs to be treated as (-logs of) probabilities.

Keywords:

alignment, data compression, low information, repeats, sequence analysis

Introduction

Low information content sequences cause difficulties for alignment algorithms because costs (scores) may be biased by shared statistical properties that are merely typical of many members of the population of sequences. The `shuffling' technique [Fit83] is often adopted to judge whether an optimal alignment of two such sequences is really significant. A new methodology [All99] builds statistical models of populations into alignment algorithms and is an alternative to shuffling. This paper compares the two techniques. It argues that modelling populations is the more versatile in that it can not only judge if an alignment is `acceptable' but can also change the rank order of alignments and can identify which population(s) the sequences probably came from while being applicable to a wide variety of models of populations of sequences.

...indels...   ...become...
   ||||||         ||||||
...indels...   ...become...

    (i)            (ii)

Figure 1: High and Low Information Matches.
As an informal example, Figure 1 shows two partial matches, both of six characters, that might occur in the alignment of sequences of English text. Both matches are good but 1(i) is better than 1(ii) because `indels' is a very rare subsequence in normal text while `become' is much more common. Therefore match 1(i) should give more benefit to an alignment than 1(ii). If an alignment can only contain 1(i) or 1(ii) but not both then it is better for it to contain 1(i), all other things being equal. So taking account of the local information content of sequences can change the rank order of alignments and, in particular, change which are considered optimal. Note that in the context of papers on bioinformatics, `indels' could well be more common than `become' in which case 1(ii) would become the more important match! One can only talk about the information content of a sequence in the context of a population of sequences.

In the simplest version of the shuffling technique, two sequences, S1 and S2, are aligned and get a certain cost C (or alternatively a score). S1 and S2 are then permuted at random to give S1' and S2' respectively. S1' and S2' are aligned and get a cost C'. The original alignment of S1 and S2 is only considered to be `acceptable' if C is better than C' by some safety margin. The underlying assumption must be that the sequences are generated by zero-order Markov models and that S1 and S1' are typical sequences from a population modelled by the same zero-order model, similarly S2 and S2'. Permutation destroys any higher order statistical structure in S1 and S2. Altschul and Erickson [Alt85] do show how to randomize a sequence while maintaining di-nucleotide and codon usage frequencies, but it is hard to see how to randomize a sequence while maintaining the statistical properties of an arbitrary population model. In the English text example, one would at least want to preserve the appearance of plausible English words and names.

Wootton [Woo97] proposes `masking out' regions of low compositional complexity before performing alignment. Compositional complexity is defined as the entropy of the characters within a sliding window, i.e. a moving average, under a multi-state distribution [Bou69]. Masking out is drastic because low information content regions do provide non-zero amounts of information which should not be thrown away. In fairness, it was proposed as a preprocessing step to searches against large databases where speed is of the essence. In the English text example it amounts to ignoring `become' as an example of a `stop word', as is often done in information retrieval [vRi79,p18].

In what follows, we examine an alternative to shuffling, and to masking out. This alternative calculates probabilities for an alignment of two sequences and for the null hypothesis, i.e. that the sequences are not related by descent. This has previously been achieved for random sequences and recently extended to non-random, that is medium to low-information content, sequences. Here we include models for populations of sequences that S1 and S2 might be drawn from. The resulting algorithms produce alignments that weight unusual `features' of sequences heavily, provide a significance test for alignments, and can identify which populations sequences probably came from.

Algorithms

The generic dynamic programming algorithm (DPA), shown in Figure 2, can be made to calculate the edit-distance [Lev65][Sel74], longest common subsequence [Hir75], most probable alignment, or sum of probability over all alignments, for sequences S1 and S2 by suitable choices for z, c, g and f; later we show how to modify it to take account of the information content of S1 and S2.

 M[0,0] = z
 M[i,0] = f(M[i-1,0], c(S1(S1[i], null)), i=1..|S1|
 M[0,j] = f(M[0,j-1], c(S2[null, S2[j])), j=1..|S2|

 M[i,j] = g(f(M[i-1,j-1], c(S1[i],S2[j])),  --match/change
            f(M[i-1,j  ], c(S1[i], null)),  --delete
            f(M[i  ,j-1], c(null, S2[j])) ) --insert
   i=1..|S1| and j=1..|S2|

 Figure 2: Dynamic Programming Algorithm (DPA).
An optimal alignment for any of the above criteria is recovered by a trace-back of the choices made by g.

Linear and piece-wise linear gap costs [Got82] can be included by elaborating the `state' information stored in M and by changes to f and g. Since there are a great many variations of this kind, we study the simplest case of a 1-state system as the canonical example. The new method of incorporating the information content of the sequences, described in this paper, can clearly be applied to more complex multi-state edit models.

If probabilities are assigned to the basic mutations, a probability can be assigned to two sequences under the hypothesis that they are related by such mutations [All92], see Figure 3. Mutation probabilities can be estimated by an expectation maximization [Bau67][Bau70] process when they are not known in advance.

 z = 1
 g = max
 f = multiply
 c(x,null) = P(delete).P(x)
 c(null,y) = P(insert).P(y)
 c(x,x)    = P(match) .P(x)
 c(x,y)    = P(change).P(x,y), x!=y

Figure 3: For DPA, Most Probable Alignment.
The simplest assumption concerning characters is that any one sequence is random, i.e. that characters are generated independently and uniformly. In the case of DNA, this gives P(x)=1/4 and P(x,y)=1/12, where x and y differ. In practice, algorithms work with the -log to base 2 of probabilities, the natural units being `bits' after Shannon's communication theory [Sha48], and consequently we will write of the `message length' of an alignment, for example.

The null hypothesis in alignment, is that two given sequences are not related by descent, i.e. that they are independent members of the population, here the uniform population. In that case, the probability of a DNA sequence, S, of a given length is 4**(-|S|). The difference in message lengths between an alignment and the null hypothesis is their negative log odds ratio. No alignment with a message length greater than the null hypothesis is acceptable. These notions are extended to non-random sequences below.

Non-Random Sequences

A population of sequence, F, is called `non-random' if there is a compression algorithm, m, such that m-1(m(S))=S for all S in F, and the length of m(S), measured in bits, is less than the length of S, on average for S in F. Many compression algorithms now consist of a `predictor', which yields P(S[i]|S[1..i-1]), and an encoder. The de-compression algorithm uses an identical predictor with a decoder. The encoding/decoding problem has essentially been solved by arithmetic coding [Lan84] which can get arbitrarily close to the theoretical limit of -log(P(I)) bits for an item I of probability P(I). Much data compression research is now concentrated on statistical modelling and on prediction algorithms for various sources. The best model is the one that leads to the greatest data compression. For example, low-order Markov models compress "typical" DNA sequences to about 1.9 bits per character but more sophisticated models (Grumbach and Tahi 1994, Loewenstern and Yianilos 1997, Allison et al 1998) [Gru94][Loe97][All98a] do much better.

Alignment of Non-Random Sequences

In order to find a most probable alignment, the DPA of Figures 2 and 3 needs estimates for the probabilities of characters involved in matches, changes, inserts and deletes. When the given sequences are assumed to come from particular models, the models provide predictions for P(S1[i]|S1[1..i-1]) and for P(S2[j]|S2[1..j-1]) and these predictions should be used in some efficient and effective way. The formulae given in Figure 4 define the simplest method of achieving these aims [All99]. Note in particular that `match' and `change' treat S1 and S2 symmetrically. Renormalisation takes place in a change because two necessarily different character values are involved. Note that there are almost no constraints on how P(S1[i]|S1[1..i-1]), say, is estimated, i.e. on the form of models of populations. The models' predictions for S1[i] and S2[j] can either be computed `on the fly' in the DPA or can be precomputed and cached, whichever is most convenient and efficient.

 P(delete).P(x|S1[1..i-1]), where S1[j]=x

 P(insert).P(y|S2[1..j-1]), where S2[j]=y

 P(match).(P(x|S1[1..i-1])+P(x|S2[1..j-1]))/2,
    where S1[i]=S2[j]=x

 P(change).(P(x|S1[1..i-1]) * P(y|S2[1..j-1])) /
    (1/(1-P(y|S1[1..i-1]))+1/(1-P(x|S2[1..j-1))) / 2,
 where S1[i]=x!=y=S2[j]

Figure 4: DPA Probabilities for Non-Random Sequences.
Note that there is an alternative formulation of the problem [Pow98] to suit the sequence assembly problem although the resulting algorithm is then less general in terms of the models that it can use.

Tests and Discussion

Unrelated pairs and related pairs of artificial DNA sequences were generated from various populations. 100 pairs of unrelated sequences of length 200 were generated from each of the uniform, MMf and MMg models as in Figure 5. These models were chosen purely on the grounds of simplicity and the method is not limited to them. MMf is an (A|T)-rich zero-order model and MMg is an AT-rich first-order model. For a pair of related sequences, one sequence S1 of length 200 was first generated from the relevant model - uniform, MMf or MMg. Sequence S2 was formed by copying S1 with a certain probability of mutation. The probabilities of change, insertion and deletion were kept in the ratios 2:1:1 for simplicity. When a character was inserted at S2[j], it was chosen from the distribution implied by the model given S2[1..j-1]. When a character was changed it was chosen from the distribution renormalised after the original character value was removed. S2 is thus from a similar, but not necessarily identical, model to S1. The probability of mutation was varied from a low 10% to a high 50%. 100 pairs of sequences were generated for each parameter setting.

For each set of pairs of sequences, each pair in the set was aligned with the modified DPA assuming (a) the uniform model for S1 and S2, (b) unknown zero-order models for S1 and S2, and (c) unknown first-order models for S1 and S2. Note that the parameters of the models must be inferred from the data in (b) and (c) and that these parameters must be stated, to optimum accuracy, as part of their respective hypotheses [Bou69]. The information content of the null hypothesis was also calculated under each of these three population assumptions, giving six hypotheses in total. The means and standard deviations of the hypotheses' message lengths, in bits per character, were calculated over a number of trials. The number of times that each hypothesis had the shortest message length, i.e. highest posterior probability, were also counted.

As a comparison, each sequence in a pair was shuffled and the shuffled sequences were aligned under the uniform model. The standard deviation, SD, of the message lengths of the alignments of the shuffled sequences was also calculated. The number of times that an alignment of two original sequences bettered the alignment of their shuffled selves (under the uniform model) by margins of one, two and three times SD, i.e. was considered acceptable by these safety margins, were also recorded. The use of the quantity bits per character allowed for slight differences in sequence lengths. The results of the experiments are summarised in table 1.

   A:9/20 C:1/20 G:1/20 T:9/20

   Figure 5(a): Zero-Order Model MMf.


                A   C    G    T
             .--------------------> S[i]
           A | 1/12 1/12 1/12 9/12
   S[i-1]  C | 9/20 1/20 1/20 9/20
           G | 9/20 1/20 1/20 9/20
           T | 9/12 1/12 1/12 1/12

   Figure 5(b): First-Order Model MMg.

Uniform Data

Data generated from the uniform model are the simplest to deal with. For related sequences, an optimal alignment under the uniform model has the shortest message length up to a mutation probability of 0.4 when the null hypothesis, also under the uniform model, starts to take over. This is consistent with Deken's bounds [Dek83] of 0.55 to 0.72 for the length of an LCS of unrelated sequences from such a population. It can be seen that shuffling, even with a safety margin of 3xS.D., is sometimes inclined to consider alignments of unrelated sequences to be acceptable.

Zero-Order Data

MMf is an (A|T)-rich zero-order model - Figure 5(a). The effective alphabet is less than four, typical sequences being compressible to 1.5 to 1.6 bits per character. Sequences are successfully identified as coming from a zero-order model. Almost all related pairs are identified as being related up to a mutation probability of 0.2. A significant number of optimal alignments are unacceptable at a mutation probability of 0.3 and almost all are unacceptable beyond this level. This is consistent with Deken's bounds of 0.61 to 0.78 for the LCS of unrelated sequences from a ternary alphabet. In contrast, a margin of three standard deviations is needed for shuffling to return almost no false positives on unrelated sequences. With that margin however, over one quarter of alignments are still considered acceptable at a mutation probability of 0.5, which is surely unreasonable given the alphabet's small effective size. Note that MMf is a zero-order model and should suit this instance of shuffling well.

First-Order Data

The message length criterion correctly identifies the population class of data from MMg - Figure 5(b). The transition from optimal alignment to (first-order) null hypothesis being the more probable occurs at a mutation probability of about 0.3. In contrast, shuffling often considers optimal alignments of even unrelated sequences to be acceptable at a safety margin as large as three standard deviations. Of course its assumption here, that the data comes from a zero-order population, is incorrect and this confirms that shuffling is invalid in general when it does not preserve the statistics of the population. This is not surprising but it does raise the question of how to use shuffling when the nature of the population is not simple and known in advance? The message length test can, on the other hand, select between a number of population models, including very general ones, while also giving a significance test for alignments.

Population Models

With reference to the tests above, note that the uniform model is a special case of a zero-order model which is in turn a special case of a first-order model. A zero-order model over DNA has three parameters and a first-order model over DNA has 12. The number of parameters rises with model complexity. Overfitting would be a danger if model costs were not included in the hypotheses. Data from a uniform population will contain some chance correlations that a zero or first-order model will detect but, when the costs of the increasingly complex models are included in their message lengths, these correlations are shown to be chance and not significant. When the sequence lengths are short, say 50 characters, MMg sequences are sometimes classified as zero-order data which is a reasonable outcome when only given two short sequences.

It is natural to ask what is the best model of sequences to incorporate into alignment algorithms. The message length criterion can identify the best out of a given collection of alternatives but it cannot invent the best model out of thin air. To answer that question would in general require a search over all models and this is infeasible, although various subsets can be searched systematically. Fortunately there is usually some prior biological knowledge about the possible nature of a population of sequences and one or more models can be based on this. Some general guidelines can also be given: A model should be simple; in particular the number of parameters to be fitted to the data should be small compared to the length of the sequences. If sequences are expected to be non-homogeneous, e.g. mixtures of uniform and repetitive sections say, then an `adaptive' model that uses recent observations to predict the next character can perform well. Research on the compression of biological sequences (Grumbach and Tahi 1994, Loewenstern and Yianilos 1997, Allison et al 1998) [Gru94][Loe97][All98a] can also suggest models to use.

Conclusion

We have argued that statistical models of populations of sequences should be used in alignment algorithms. Doing so can change the rank order of alignments and, in particular, change which are considered to be optimal. It gives a simple and efficient significance test for alignments of high or low information content sequences. A number of alternative models can be compared when the true nature of the population is not known. Tests show a good ability to infer this information from given data. The method is versatile and can be used with any model of sequences that makes predictions, i.e. yields probability estimates P(S[i]=x|S[1..i-1]) where x ranges over the alphabet. The question of what is a good model to use must depend on the sequences being analysed. Fortunately biologists often have prior knowledge which can be used to design a model. It is not crucial that this is correct in every detail provided that some simple and plausible alternatives are also included.

In contrast, the well known shuffling technique cannot change the rank order of alignments; it only changes the criterion for acceptability after the alignment process. It can also give incorrect results when the assumptions of its implicit model are invalid and this situation can be hard to identify.

The `masking out' technique discards useful information that is present in even low information subsequences; modelling during alignment can use this information.

References

[All98a] Allison, L,, Edgoose, T. and Dix, T. I. (1998). Compression of strings with approximate repeats. Proc. Intelligent Systems in Molecular Biology, ISMB98, AAAI Press, 8-16 (Also see [more].)

[All99] Allison, L., Powell, D. and Dix, T. I. (1999). Compression and approximate matching. Comp. Jrnl. 42(1) 1-10. (Also see [more].)

[All92] Allison, L., Wallace, C. S. and Yee, C. N. (1992). Finite-state models in the alignment of macromolecules. Jrnl. Molec. Evol. 35 77-89. (Also see TR90/148.)

[Alt85] Altschul, S. F. and Erickson, B. W. (1985). Significance of nucleotide sequence alignment: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol. Biol. Evol. 2(6) 526-538.

[Bau67] Baum, L. E. and Eagon, J. E. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model of ecology. Bulletin of AMS 73 360-363.

[Bau70] Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals Math. Stat. 41 164-171.

[Bou69] Boulton, D. M. and Wallace, C. S. (1969). The information content of a multistate distribution. J. Theor. Biol. 23 269-278.

[Dek83] Deken, J. (1983). Probabilistic behaviour of the longest common subsequence length. In Time Warps, String Edits and Macromolecules, D. Sankoff and J. B. Kruskal (eds), Addison Wesley, 359-362.

[Fit83] Fitch, W. M. (1983). Random sequences. J. Mol. Biol. 163, 171-176.

[Got82] Gotoh, O. (1982). An improved algorithm for matching biological sequences. Jrnl. Molec. Biol. 162 705-708.

[Gru94] Grumbach, S. and Tahi, F. (1994). A new challenge for compression algorithms: genetic sequences. Inf. Proc. and Management 30(6) 875-886.

[Hir75] Hirschberg, D. S. (1975). A linear space algorithm for computing maximal common subsequences. Comm. Assoc. Comp. Mach. 18(6) 341-343.

[Lan84] Langdon, G. G. (1984). An introduction to arithmetic coding. IBM J. of Res. and Dev. 28(2), 135-149.

[Lev65] Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) 845-848 (trans. Soviet Physics Doklady 10(8) 707-710, 1966).

[Loe97] Loewenstern, D. and Yianilos, P. N. (1997). Significantly lower entropy estimates for natural DNA sequences. Data Compression Conference DCC '97, 151-160.

[Pow98] Powell, D., Allison, L., Dix, T. I. and Dowe, D. L. (1998). Alignment of low information sequences. Proc. 4th Australasian Theory Symposium, CATS '98, Perth, 215-229, Springer Verlag.

[vRi79] van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths.

[Sel74] Sellers, P. H. (1974). An algorithm for the distance between two finite sequences. Jrnl. Combinatorial Theory 16 253-258.

[Sha48] Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. Jrnl. 27, 379-423 and 623-656.

[Woo97] Wootton, J. C. (1997). Simple sequences of protein and DNA. In DNA and protein Sequences Analysis, M. J. Bishop and C. J. Rawlings (eds), IRL Press, 169-183.

Table 1: Alignment and Null, x 3 Models v. Shuffling

  modelling  
2 x uniform2 x order 02 x order 1shuffling
alignnullalignnullalignnull1SD2SD3SD
uniform data:
10%1.40+/-.062.0 1.44+/-0.62.04+/-.01 1.51+/-.062.09+/-.01 100100100
 [100]    
20% 1.66+/-.062.0 1.70+/-.062.04+/-.01 1.76+/-.062.09+/-.01 100100100
 [100]    
30% 1.84+/-.062.0 1.88+/-.062.04+/-.01 1.94+/-.062.09+/-.01 100100100
 [100]    
40% 1.99+/-.052.0 2.03+/-.052.04+/-.01 2.09+/-.052.09+/-.01 99 93 76
 [61][39]    
50% 2.08+/-.042.0 2.12+/-.042.04+/-.01 2.17+/-.042.09+/-.01 73 43 23
 [4][96]    
unrel2.13+/-.042.0 2.17+/-.042.04+/-.01 2.23+/-.042.09+/-.01 22 5 4
 [100]    
MMf data:
10% 1.39+/-.062.0 1.16+/-.061.53+/-.06 1.23+/-.061.60+/-.06 100100100
 [100]    
20% 1.62+/-.062.0 1.36+/-.071.53+/-.07 1.43+/-.071.60+/-.07 100100100
 [99][1]    
30% 1.78+/-.052.0 1.49+/-.061.53+/-.06 1.57+/-.061.60+/-.06 100100100
 [74][26]    
40% 1.90+/-.042.0 1.60+/-.061.55+/-.06 1.67+/-.061.61+/-.05 99 92 75
 [7][93]    
50% 1.97+/-.042.0 1.66+/-.071.55+/-.07 1.73+/-.071.62+/-.06 81 51 27
 [100]    
unrel2.02+/-.032.0 1.63+/-.061.51+/-.04 1.70+/-.051.58+/-.04 24 15 1
 [100]    
MMg data:
10% 1.39+/-.062.0 1.25+/-.061.68+/-.05 1.14+/-.071.43+/-.08 100100100
  [100]   
20% 1.64+/-.062.0 1.50+/-.061.71+/-.05 1.39+/-.071.51+/-.08 100100100
  [100]    
30% 1.80+/-.062.0 1.65+/-.061.71+/-.05 1.54+/-.071.53+/-.07 100100100
  [40] [60]  
40% 1.92+/-.042.0 1.78+/-.061.73+/-.04 1.67+/-.061.57+/-.06 100 97 95
  [3] [97]  
50% 1.99+/-.042.0 1.84+/-.051.73+/-.05 1.73+/-.061.57+/-.06 92 77 47
  [100]  
unrel1.93+/-.042.0 1.80+/-.051.67+/-.04 1.58+/-.071.36+/-.06 99 93 74
  [100]  
 
Key: unrel=unrelated pairs units: bits/character
10% etc: related Pr(mutation) [##] times best, where non-zero
shuffling: # of "acceptable" alignments by margins of 1, 2 or 3 SD's
length of S1: 200.     100 trials per setting
Table 1: Alignment and Null, x 3 Models v. Shuffling

Also See: Further work on alignment of pairs of sequences and other problems in
[bioinformatics here], and D.R.Powell, L.Allison & T.I.Dix, Modelling-Alignment for Non-Random Sequences, AI2004, pp.203-214, 6-10 Dec 2004.