Stefan Wolfsheimer

Reliability and significance of local sequence alignments

lundi 18 mai 2009, 14h00 - 15h00

Salle de réunion, espace Turing

Pairwise sequence alignment is a tool in bioinformatics to quantify the similarity of two biological sequences (DNA, proteins or RNA) and to find related segments. Most database search tools, such as BLAST, rely on score based alignment where the alignment is obtained by maximizing an integer valued objective function. However, in some cases, the « true » alignment is not the optimum and one is interested in alternative alignments. I present the model of Miyazawa that allows one to directly translate a score based objective function into a probabilistic model. It employs only one additional temperature-like parameter that controls the importance of the optimum versus the entropy. The model allows us to draw suboptimal alignments from a Boltzmann distribution and, furthermore, to assess the reliability of the optimum. We studied the so called linear-logarithmic phase transition for protein alignment within the Miyazawa model. The locii of the boundary between the two phases is essential for the choice of scoring parameters and the « temperature » in order to discriminate better between unrelated and related pairs of protein sequences. By finite-size scaling analysis adopted from statistical mechanics and percolation theory we determined critical values and properties of critical fluctuations.

In the second part, I discuss importance sampling methods to assess the statistical significance of observed optimal scores, in particular the significance of very large scores of two related sequences. The empirical results suggest that the asymptotic Karlin-Altschul theory breaks down in the very far right tail of the optimal local score distribution over random sequences. The accurate test statistics is in fact less conservative than the classical one. The discrepancy of the tail distributions reflects another aspect of the linear-logarithmic phases transition.