first previous next last contents

The Consensus Calculations

There are currently three distinct consensus algorithms that may be used. The choice of the best algorithm will depend on the data that you have available and the purpose for which you are using Gap4.

If you have only base calls, but without any confidence values, then it is best to use method 1 (frequency fractions). This is also a fast algorithm so it may be appopriate for very high depth assemblies or for mutation studies.

If you have confidence values in addition to your base calls then either use method 2 (weighted frequencies) or method 3 (probabilistic).

In all three algorithms, a base with confidence 100 (if you do not have any confidence values Gap4 will have assigned the bases a confidence of 99) is used to force the consensus base to that base type. Additionally the consensus base will have a confidence of 100. If two or more conflicting sequence bases at the same consensus base have a confidence of 100 the consensus base will always be a dash and will have a confidence of 0.

The consensus sequence also has a confidence, even when sequence confidences are not available. The scale and meaning of the consensus confidence changes between consensus algorithms. However the consensus cutoff parameter always has the same meaning. A consensus base with a confidence 'X' will be called as a dash when 'X' is lower than the consensus cutoff, otherwise it is the determined base type.

Both the consensus cutoff and quality cutoff values can be set by using the "Configure cutoffs" command (in the Options menu) of Gap. Within the Contig Editor (see section Editing in gap4) these values can be adjusted by clicking on the "<" and ">" symbols adjacent to the "C:" (consensus cutoff) and "Q:" (quality cutoff) displays in the top left corner of the editor. These buttons are repeating buttons - the values will adjust for as long as the left mouse button is held down. Changing these values lasts only as long as that invocation of the contig editor.

Method 1: Frequency Fractions

In this method each standard base type is given the same weight. The consensus will be the most frequent base type in a given column provided that the consensus cutoff parameter is low enough. All unrecognised base types, including Staden codes and IYB codes are treated as dashes. Dashes are given a weight of 1/10th that of recognised base types. Pads are given a weight of the average of its neighbouring bases.

The confidence of a consensus base for this method is 100 multiplied by the frequency of the consensus base type occurring divided by the number of bases. So for example a column of bases of A, A, A and T will give a consensus base of A and a confidence of 75. Therefore a consensus cutoff of 76 or higher will give a consensus base of `-'.

In the event that more than one base type has the same confidence and this exceeds the consensus cutoff the bases are assigned in descending order of precedence: A, C, G and T.

The quality cutoff parameter has no effect on this method other than as an indicator to choose method 2. A quality cutoff of -1 will use the unweighted Frequency Fractions method whilst a quality value of 0 or higher will use the Weighted Frequency Fractions method.

Method 2: Weighted Frequency Fractions

NOTE: This is still work in progress and the precise details are likely to change.

If we have confidence values for each base we can sum together the confidence values instead of just counting the frequencies. Hence a column of 4 bases A, A, A and T with confidence values 10, 10, 10 and 50 would give combined totals of 30/80A and 50/80T (compared to 3/4A and 1/4T when using frequencies). As with the unweighted frequency method this sets the confidence value of the consensus base to be the the fraction of the chosen base type weights over the total weights (62.5 in the above example).

The quality cutoff parameter controls which bases are used in the calculation. Only bases with quality values less than or equal to the quality cutoff are used, otherwise they are completely ignored and have no effect on either the base type chosen for the consensus or the consensus confidence value. In the above example setting the quality cutoff to 20 would give a T with 100 confidence (100 * 50/50). Setting quality cutoff to -1 switches to the unweighted frequency fraction method.

This is Rule IV of Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Research 23, 1406-1410 (1995).

Method 3: Probabilistic

This consensus algorithm is based on the ideas that are put forward in Bonfield, J.K. and Staden, R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23, 1406-1410 (1995) (also see Dear, S. and Staden, R. A standard file format for data from DNA sequencing instruments. DNA Sequence 3, 107-110 (1992)) and is currently scaled to use Phred (Gorden, D. Abajian, C. and Green, P. Base-Calling of Automated Sequencer races Using Phred. II. Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)) confidence values.

If the sequences have confidence values that are proportional to -log10(error_rate) then this consensus method will give the most reliable results. Additionally it is the only method where the consensus confidence also has a specific meaning; the expected error rate for the consensus. This algorithm was developed using values output by Phred. These have the scale of -10 * log10(error_rate). So we expect 1 base out of every 100 with a confidence value of 20 to be incorrect. Combining these confidence values together can give us the most probable consensus base combined with a probability of it being correct.

The method used firstly works column by column to derive the consensus base. The sequences within a particular column are categorised by their chemistry and strand. We currently group them into "top strand dye primer", "top strand dye terminator", "bottom strand dye primer" and "bottom strand dye terminator" classes.

Within each class there may be no sequences or many sequences. For sequences in the same class we check for multiple sequences with the same base type. We chose the confidence for the highest base type and add on an additional confidence per extra sequence with the same base type. This is done for each of the four base types. Then we use Bayesian rules to derive the probabilities for each base type.

To further describe the method it is easiest to manually work through an example alignment. In our example we have 5 sequences in one column with the following characteristics.

Dye primer, top strand,        'A', confidence 20
Dye primer, top strand,        'A', confidence 10
Dye primer, top strand,        'T', confidence 20
Dye terminator, top strand,    'T', confidence 10
Dye primer, bottom strand,     'A', confidence 5

So we have three possible sets. Examining the "dye primer top strand" set we see there are three sequences (A, A and T). The highest A is 20. We add to this a fixed quantity to indicate one other occurance of an A in this set. For this example I'll add 5. So we then have confidence 25 for A and confidence 20 for T. This is equivalent to a .997 (approximated for this example) probability of A being correct and .99 probability of T being correct. This conflict is resolved by the following table.

  |   A     C     G     T
--+-----------------------
A | .997  .001  .001  .001
T | .0033 .0033 .0033 .990

A has a probability of .997 and so the remaining .003 is spread amongst the other base types. Similarly for the .01 of the T. Bayesian calculations on this table then give us probabilities of approximately .766A, .00154C, .00154G and .231T.

The other sets gives us probalities of .033 A, C, G and .9 T, and .316A, .228 C, G and T.

To combine the sets together we produce a table for further Bayesian calculation. Once again we fill in the probabilities and spread the remainder evenly amongst the other base types.

           |   A      C      G     T
-----------+--------------------------
Primer Top | .766  .00154 .00154 .231
Term   Top | .0333 .0333  .0333  .9
Primer Bot | .316  .228   .228   .228

This gives us final probabilities of .135A, .0002C, .0002G and .854T. This is intuitively what we'd hope for in that the T signal was present in both dye primer and dye terminator experiments with 1/100 and 1/10 errors rates whilst the A signal was present on both strands with 1/100 and 1/3 error rates. Both method 1 and method 2 would call this an A (just). So the consensus base with method 3 is a T with the consensus confidence at 8.4 (-10*log10(1-.854)).

In the case of pads the calculations are less clear. If a pad is present in a column we consider the pad as a separate base type and then evenly divide our remaining probabilities by 4 instead of 3.

It is clear that having confidence values for all four nucleotides at each base call could significantly improve the consensus reliability. However at present these are not available.

To look for poorly defined consensus bases the contig editor has display modes to highlight bad bases with a dark background and has search modes to look for low confidence consensus and high confidence sequence disagreements. FIXME: add cross reference.

The Quality Calculation

The quality and the consensus calculations are really one and the same. The difference is simply that the output produced by the quality calculation is a measure of how reliable the consensus produced is. This quality is used as the basis for problem searches, such as find next problem, and the quality display within the Template Display (see section Template Display).

One mode of operation for both the quality and consensus calculations is deriving information for each strand separately. This method is useful for comparing one strand with another to check that the sequence has been reliably determined (consistently) on both strands. Setting the configuration (using the main "Configure Cutoffs" command) to treat sequences flagged using the "Special Chemistry" Experiment File line (CH field, bit 0) affects this calculation. When set, the sequence is used in the consensus and quality calculations for both strands, and hence is equivalent to having data on both strands. Currently these parameters have no effect on the probabilistic consensus algorithm as this already takes into account the chemistry and strand information in determining the consensus base.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_121.html