For major sequencing projects one of the most labour intensive and slowest parts of the process is in resolving disagreements between readings and making the corresponding edits. Such projects use sequencing machines and ambiguities are resolved by skilled staff comparing the traces from the conflicting readings. Generally, incorrect bases need to be changed in order to give the correct consensus. This is a key point: IT IS THE POOR DATA THAT TAKES UP THE MAJORITY OF THE TIME. Hence we needed a new method that addressed the places where most time was employed and which corrected this apparent absurdity.
We first put forward the idea of using numerical estimates of base calling accuracy in our paper describing SCF format Dear, S. and Staden, R, 1992. A standard file format for data from DNA sequencing instruments. DNA Sequence 3, 107-110 and then expanded on their use for editing and assembly in Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23, 1406-1410 (1995).
In Bonfield and Staden, (1995) we stated "...the most useful outcome of having a sequence reading determined by a computer-controlled instrument would be that each base was assigned a numerical estimate of its probability of having been called correctly... having numerical estimates of base accuracy is the key to further automation of data handling for sequencing projects. ... The simple procedure we propose in this paper is a method of using the numerical estimates of base calling accuracy to obviate much of the tedious and time consuming trace checking currently performed during a sequencing project. In summary we propose that the numerical estimates of base accuracy should be used by software to decide if conflicts between readings require human expertise to help adjudicate. We argue that if the accuracy estimates are reasonably reliable then the majority of conflicts can be ignored... and so the time taken to check and edit a contig will be greatly reduced."
We proposed that the way to achieve this was by making consensus calculations central to the working of our sequence assembly and editing program, and that these calculations would make use of the accuracy estimates to give each consensus base a quality measure. The consensus would not be stored in the assembly project database but would be calculated when required by each function that needed it, and hence would always take into account the current data. In the Contig Editor the consensus should be updated instantly to reflect any change made by the user.
Every base in every reading is given an estimate of its accuracy (in the range 1 to 99). These values are stored in SCF files and copied into gap databases during assembly. The consensus calculation (see section The Consensus Calculation) is the heart of our strategy and it produces a consensus sequence with a confidence for the accuracy of each base. If any position is below a user-defined threshold the consensus is set to "-" which is equivalent to "N", but easier to see.
The next important component in our editing strategy is the "Next Problem" function in the Contig Editor (see section Next problem). This function moves the cursor to the next position in the consensus where there is a "problem". The Contig Editor locates the next problem by applying the consensus calculation to the contig. The cursor alights on the first dash ("-") or padding character ("*") found to the right or left of the current location. To edit a contig the user selects "find next problem" repeatedly, knowing that it will only move to places where there is a conflict between good data or where the data is poor. In this way the time consuming impact of the poor quality data is greatly reduced.
The main result of the new strategy is that far fewer disagreements between data are brought to the attention of the user and fewer traces have to be inspected by eye, and so the whole process is faster. Another consequence of the new strategy is that, as fewer bases need changing to produce the correct consensus, most of what appears on the screen will be the original base calls. Indeed we have taken this a step further and suggest that if a base needs changing because it has a high accuracy estimate, and is conflicting with other good data, then rather than change the character shown on the screen, the user should lower its accuracy value. By so doing more of the original base calls are left unchanged and hence are visible to the user. There is a function within the contig editor to reset the accuracy value for the current base to 0. Alternatively the accuracy value for the base that is thought to be correct can be set within the contig editor to 100.
As mentioned above, a natural and important outcome of the new strategy, though one which may necessitate more, rather than less, work is that the find next problem function will not only find places where there are disagreements between good readings, but also places where there is no data of sufficient accuracy. Previously some groups may have overlooked such regions if they relied only on locating regions of conflict and were satisfied if the sequence was covered by data on both strands.
We have been urging the manufacturers of sequencing instruments to provide these values for several years and a number of independent groups have started to develop their own base calling algorithms which will also output numerical estimates of base accuracy. Gap4 has been tested using the accuracy values output from the base calling program Phred Ewing, B. and Green, P. 1998. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 8, 186-194.