first previous next last contents

General Comments and Tips on Assembly

The program has several methods for assembly and it may not be obvious which is most appropriate for a given problem. The following notes may help. They also contain information on methods for checking the correctness of an assembly.

If you have access to an external program that can generate the order and approximate positions of readings then Directed Assembly can be used. The same is true if the experimental method used generates an ordered set of readings (see section Directed Assembly).

If you have access to a external global assembly program that can produce an assembly and write out correct experiment files then Enter Preassembled Data can be used (see section Enter pre-assembled data).

For routine shotgun assembly of whole data-sets or incremental data-sets Normal Shotgun Assembly can be used. Through the idea of "Masked assembly" this option also can also restrict the assembly to particular regions of the consensus (see section Normal shotgun assembly).

If you have a batch of readings that are known to overlap one another, but which, due to repeats, may also match other places in the consensus, then it can be helpful to use Assemble Independently. This will ensure that the batch of readings are compared only to one another, and hence will not be assembled into the wrong places (see section Assemble independently).

Almost all readings are assembled automatically in their first pass through the assembly routine. Those that are not can be dealt with in two ways. Either they can be put through assembly again with less stringent parameters, or entered using the "Put all readings in new contigs" routine and then joined to the contig they overlap using Find Internal Joins See section Find Internal Joins.. If it is found that readings are not being assembled in their first pass through the assembler, then it is likely that the contigs require some editing to improve the consensus. Also it may be that poor quality data is being used, possibly by users over-interpreting films or traces. In the long term it can be more efficient to stop reading early and save time on editing. For those using fluorescent sequencing machines the unused data can be incorporated after assembly using the Contig Editor and Double Strand.

Our methods for assembling Alu containing sequences are experimental. We apply repe (see section Repe) to produce a list of Alu containing readings and a list of Alu free readings. We sort the these lists using the UNIX sort function prior to assembly. The Alu free list (with file name pass) is sorted into ascending order on Alu match score order using "sort -n +1 -o pass.sort pass" to produce the file pass.sort; and the Alu list is sorted (with file name fail) into descending order on the amount of non-Alu sequence at either end of the readings using "sort -n -r +3 -o fail.sort fail" to produce the file fail.sort.

The Alu free data is assembled first using the file of file names pass.sort. Then the Alu containing data is assembled using the file of file names `fail.sort'. It is useful to assemble at high stringency (ie using a low percentage mismatch); to screen the full length of each reading prior to assembly (see section Screen only) and to sort the resulting file into ascending order on percentage mismatch. It is also useful to use the the "Check Assembly" function (see section Check Assembly).

An independent and important check on assembly is obtained by sequencing both ends of templates. Providing the correct information is given in the experiment files gap can check the positions and orientations of readings from the same template (see section Find read pairs). Any inconsistencies are shown both textually and graphically. In addition this information can be used to find possible joins between contigs.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_29.html