first previous next last contents

Introduction

The Gap contig editor is designed to allow rapid checking and editing of characters in assembled readings. Very large savings in time can be acheived by sophisticated problem finding procedures which automatically direct the user only to the bases that require attention.

The key to rapid editing is "on the fly" consensus calculation Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Research 23, 1406-1410 (1995). That is, that the contig editor employs the users' chosen consensus algorithm when scanning through the sequence to locate positions that require visual inspection. Ideally each gel reading base should have an accuracy or confidence value assigned prior to assembly, and these values should be employed by the consensus algorithm. Then the editor can locate all consensus positions where the consensus accuracy falls below a given threshold. This is the ideal situation and provides the fastest method of checking. However, even if confidence values are not available, or are not thought to be sufficiently reliable, using an appropriate consensus calculation algorithm as the basis of the search for problems in an assembly is still the best way of ensuring that the final sequence meets the required accuracy standards. Gap4 includes several consensus calculation algorithms and a number of different ways of searching within the contig editor for doubtful bases in the sequence. The consensus algorithm should allow the user to specify the minimum requirement for a definite base to appear in the consensus (including whether the two strands should both independently be assigned the same base type). Once that is done the user can ignore all positions in the sequence, except where the consensus algorithm cannot assign A,C,G or T. We are not in favour of "automatic editing" in which a program finds and changes base calls that look doubtful. Doubts about bases should be reflected in their confidence values, and these values used to determine the consensus. However, in keeping with this philosophy, if for example, a program to detect compressions was developed independently of the program that assigned confidence values, then we believe that it should alter the relevent confidence values, and not the base calls.

These searches can make use of base accuracy or confidence values, usually by using an "on the fly" consensus calculation, or can find positions where two apparently high quality base calls conflict. Also each strand of the sequence can be treated independently by the on the fly consensus calculation. In this case searches will reveal positions where the consensus is only reliable on one strand or where the two strands disagree. Usually these searches are performed in a mode in which the program will automatically display the relevant traces for the position where each problem is located. Colour coding of the characters in the editor window to depict their individual confidence values also provides a further indication of which bases need checking. Moreover, each edited position can be made visible by switching on a display mode which highlights them using colour codes to distinguish between replacements/insertions(pink), deletions(red), padding characters (green) and confidence value changes(purple).

Another search allows a final scan of the data to check that the edits that have been made are supported by the original data. For every edited position, these searches check that the consensus base type appears in at least one of the readings covering that position. Again the check can either require support from both strands at each position or from any one of the two.

The contig editor is mouse-driven and can insert, delete and change characters in readings. It has facilities to display the traces for data from fluorescent sequencing machines and for annotation of readings. In addition it allows the poor quality data from the ends of readings to be viewed and, if required, added to the sequences. The quality or confidence value for each base can be shown in grey scales: light is good quality, dark poor quality.

[picture]
(Click for full size image)

The figure shows a typical display from the contig editor. The principal components are the controls at the top; reading names on the left; sequences on the right; and status lines at the bottom.

Along the top of the editor window there is a row of labelled command buttons and menus which are activated by mouse clicks. The rightmost of these displays help. Next to this is the button to exit from the editor. To it's left are two menus, one of which is currently in use. To the left of this is a button that displays a search dialogue. Pressing it again will then perform the selected search. The next left is the undo button: each time the user clicks on this box the program reverses the previous edit command. The next button, shown adjacent to a dark box and is hence active, is the cutoffs button. The effect of this button is to toggle between showing or hiding the reading data that is of poor quality or is vector sequence. As can be seen, the ends of some of the aligned sequences are shown in lighter lettering. If cutoffs was not activated these segments of the readings would not be displayed. The next command to the left is Edit Modes menu which allows users to select which editing commands they want to enable. The next command toggles between insert and replace and so governs the effect of typing in the edit window.

Next are two pairs of repeater buttons for changing the consensus calculation thresholds. Clicking on a > box will increase the value to the left; clicking on a < will decrease the corresponding value. The left pair (labelled C) are for the overall consensus cutoff, and the right pair (labelled Q) controls the quality cutoff. The precise effect of these cutoffs varies depending on the consensus algorithm used. To change the consensus algorithm use the 'Configure Cutoffs' option in the main Gap4 Configure menu.See section Options Menu. For information on the available consensus algorithms themselves see see section The Consensus Calculation.

In the strip below the row of labelled buttons is a horizontal scrollbar and four arrows (<<, <, >, >>) for movement along the contig. Along the top of the aligned sequences are numbers showing the position in the consensus. Below this are separate lines showing the readings that cover this segment of the contig. To the left are the reading numbers, negative numbers indicating those in the reverse orientation to their original readings. Next to these are the reading names. The bottom sequence is the consensus for the readings aligned above. The reading names may be longer than can be displayed in the names display. In this case the horizontal scrollbar just above the names may be used.

Within the readings and on the consensus, annotated segments are shown in colour (here greyed in). One of the reading names has been highlighted by the user and the two others, whose names and numbers are slightly lighter, are the ones whose traces are being displayed (not shown in this picture). Some of nucleotides are shown in red. These all have quality values less than the quality cutoff, which is 65 in this figure.

Beneath the consensus line are several status lines. The topmost of these details the number of strands this sequence has been computed from. At the leftmost section of the screen we can see that the status line reflects that we only have one reading, in the reverse orientation, covering this region. Below the strand status line are the amino acid translations for each reading frame. Each translation can be shown or removed as wished.

Finally at the bottom of the editor is a status line giving summary details of whatever the mouse is moved over. The contents of this line can be configured by modifying your `.gaprc' file. In this case it contains information on reading number 45.

[picture]
(Click for full size image)

The example above shows a screen dump for data where the base quality values are shown as grey scales: the lighter the colour the better the data. Also activated in this example are the modes that show disagreements between individual bases in readings and the consensus in dark green and edits made in light green. Note that both the readings and consensus are coloured in grey. The dark consensus bases are low quality. The strands status line also backs this up.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_45.html