first previous next last contents

Algorithms

For locating sequencing vector the program uses a dynamic programming algorithm and two percentage matches as cutoffs - one for the 5' end and another for the 3' end. Both searches include the poor quality data at the ends of the readings. This mode writes the SL and SR records in experiment files.

If the users selects the vector_primer file mode of vector_clip the program searches each reading for all of the forward and reverse sequence segments in the primer_vector file and notes the one which matches best. If this one is above the user defined threshold the experiment file will be modified accordingly and the reading searched again for the corresponding sequence from the other side of the cloning site (in order to look for matches at the reading's 3' end). Again if the user defined threshold is reached the experiment file will be modified accordingly. This mode writes the SL and SR records in experiment files.

For locating cloning vector two algorithms are available, both of which use hashing. The original method needs a "Word length" (word_length), the "Number of diagonals to combine" (num_diags) and a "Cutoff score" (diagonal_score). The word length is the minimum number of consecutive bases that will count as a match. The algorithm treats the problem like a dot matrix comparison. First it finds all matches of length word_length; then it locates the diagonal with the highest normalised score. Then it adds the scores for the adjacent diagonals (num_diags). If the combined score is at least "diagonal_score" the experiment file is updated to indicate the location of the vector sequence. The score represents the proportion of a diagonal that contains matching words, and the maximum score for any diagonal is 1.0. This mode writes the CS records in experiment files. If the whole reading is cloning vector this mode writes a PS record containing "all cloning vector",

A newer method also hashes using "word_length" consecutive bases and accumulates the hits for each diagonal, but instead of using a score cutoff, it decides if there is a match using a probability threshold "P" supplied by the user. For each length of diagonal vector_clip calculates "E" the score that would be expected for probability "P", and then compares it with the observed score "O". If for any diagonal O>E a match is declared and expressed as 100(O-E)/E. This new method is an attempt to overcome the problem that even though the scores on diagonals are normalised to lie in the range 0.0 to 1.0 the scores are still a function of the diagonal length. The probability P hence allows vector_clip to use a different cutoff score for each length of diagonal. Tests have shown that the probability based algorithm is very much more reliable than the older one. By default the program still uses the old algorithm, the probability based one being switched on by the user specifying a probability cutoff (option -P). It is strongly recommended that the probability based method is used and for our data we have found that a probability of 0.000001 gives good results. This mode writes the CS records in experiment files. If the whole reading is cloning vector this mode writes a PS record containing "all cloning vector".

The search for "vector rearrangements" uses a simple algorithm which looks only for a match of length "minimum match". All readings that contain a string of characters of at least this length that match a segment of the vector sequence exactly will be classed as "vector rearrangements" and their names will not be written to the file of passed file names. This mode writes a PS record containing "vector rearrangement" in experiment files if a match is found.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/vector_clip_3.html