If the probabilistic consensus algorithm is used it is possible to give the expected number of errors in a particular consensus sequence. This is produced by simply summing the error rates at each base.

Each confidence value has a known error rate determined by the formula 10^(-confidence / 10.0). We also know the frequency that each confidence value occurs in the consensus sequence and hence know the expected number of errors for each confidence value. Working on the assumption that we are likely to check and fix the consensus bases with the lowest confidence values first, this allows us to give information on the cumulative number of errors that we would fix by checking every consensus base with a confidence value less than a particular threshold.

The List Confidence option, in the View menu, provides this ability. The dialogue simply allows selection of one or more contigs. Pressing OK then produces a table similar to the following:

Sequence length = 164068 bases. Expected errors = 168.80 bases (1/971 error rate). Value Frequencies Expected Cumulative Cumulative Cumulative errors frequencies errors error rate -------------------------------------------------------------------------- 0 0 0.00 0 0.00 1/971 1 1 0.79 1 0.79 1/976 2 0 0.00 1 0.79 1/976 3 3 1.50 4 2.30 1/985 4 30 11.94 34 14.24 1/1061 5 2 0.63 36 14.87 1/1065 6 263 66.06 299 80.94 1/1867 7 151 30.13 450 111.06 1/2841 8 164 25.99 614 137.06 1/5168 9 96 12.09 710 149.14 1/8344 10 80 8.00 790 157.14 1/14069

The above table tells us that we have 164068 bases in our consensus sequences with an expected 169 errors (giving us an average error rate of one in 971). Next it lists each confidence value along with the frequency of this value and the expected number of errors. For any particular confidence value the cumulative columns tell us how many bases in the sequence have the same or lower confidences and how many errors are expected in those bases. From this we know that if all these bases were checked and all the errors fixed we would have a new expected error rate.

In the above table we see that there are 790 bases with confidence values of 10 or less. We expect there to be 157 errors in those 790 bases. As we expect there to be about 169 errors in total that implies that manually checking those 790 bases would leave only 12 undetected errors. Given that the sequence length is 164068 bases this means an average error rate of 1 in 14069. Note that this error rate could be achieved by checking only .48% of the total number of consensus bases. In this particular example, editing the same sequence with a 100% consensus cutoff using the either of the frequency bases consensus methods would require checking 25165 bases (15.34%), although the overall error rate would be better.

URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_122.html