first previous next last contents

Positional base preferences

This method for finding protein coding regions (or more specifically the reading frame most likely to be coding) could be considered as a variant of the codon usage method. Here, instead of measuring the closeness to an table of codon frequencies whose main discriminating power is due codon preferences, we look for similarity to the codon usage that would be expected from a protein sequence of average amino acid composition, but with no codon preference. The method is surprisingly effective: When tested against all the E. coli sequences in the EMBL sequence library it correctly identified the coding frame for 91% of window positions. (The E. coli sequences were chosen only for technical reasons: we have no reason to think the method would work less well on other organisms with roughly even base composition.) Staden R. (1990) Finding protein coding regions in genomic sequences. In Doolittle, R,R (ed), Methods in Enzymology, 183, Academic Press, San Diego, CA, 163-180.

The results for each reading frame are plotted in the graphics window with frame 1 in the top panel, frame 2 the middle and frame 3 in the bottom panel. Frame 1 is the frame of the first base in the active region. At each position along the sequence the program also plots a single dot for the reading frame with the highest score. These dots appear at the midpoints of the three panels and will form a continuous line if one reading frame is consistently the highest scoring. The figure shown below shows a nip plot window containing the results of the positional base preferences method on a sequence from E. coli. Also visible are the cross hairs. Their x position is shown in sequence base numbers in the left hand box above the plot, and the y coordinate, expressed using the score values of the gene search, is shown in the right hand box. Each line in the window has its own colour and can be dragged and dropped to new locations to reorganise the plot. The cursor in the plot can be used to control the position of the cursor in the sequence display.

[picture]
(Click for full size image)

The average amino composition used to derive the values in the codon table is that described by McCaldon and Argos McCaldon and Argos (1988), Proteins 4, 99-122.

The version in nip contains many methods of manipulating the table values which are not available in nip4, but which may be made accessible through a small external program. One difference in the nip4 version is that, as with the codon usage method, we now make the values independent of the number of stop codons found in the sequence.

As can be seen in the dialogue below the user can set the window length in codons (although around 67 codons is generally suitable) and can restrict the search to a sub region of the sequence. Note that the window length must be odd.

[picture]


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/nip4_25.html