Introduction to read clipping

For most assembly routines to work well it is necessary to present them with data of reasonable quality. Generally sequences produced by machines suffer from having poor quality data at one or both ends because, as yet, the base calling software does not know when to stop. Moreover base calling software does not produce estimates of the accuracy of each called base which we believe would be the ideal way of deciding which data to employ for assembly. It may, however produce sequence with an increasing number of unknown (N or -) bases. The simplest way of deciding where to clip the 3' end of the data is to count the density of unknown bases, and to stop when the value gets too high. This is easily understood but not very reliable. In our package the program clip (see section Clip) is used for this purpose.

If the data is reasonable at the 5' end and a primer sufficiently far away from the cloning site is used, then the vector clipping program vector_clip (see section Screening Against Vector Sequences) can be used to decide where to clip the left end of the sequence and quality clipping is only required for the right end.

An alternative way to decide where to clip the sequences is to analyse the traces they were derived from. Several groups have worked on this, including Phil Green and LaDeana Hillier, and we have recently added programs for this purpose to the package.

