The vector_primer files store the data for each vector/primer pair combination as a single record (line) and any number of records can be contained in a file. The items on each line must be separated by spaces or tabs (obviously this precludes the inclusion of items containing spaces and tabs) and a newline character ends the record. The items in a record are:
name, seq_r, seq_f, offset_r, offset_f, offset_c, file_name
name is an arbitrary record name. seq_r is the sequence between the reverse primer and the cloning site. seq_f is the sequence between the forward primer and the cloning site. offset_r is the number of bases between the end of the reverse primer and the cloning site. offset_f is the number of bases between the end of the forward primer and the cloning site. offset_c is the number of the base to the right of the cloning site. file_name is the name of the file containing the complete vector sequence.
An example file containing three entries (for m13mp18, a vector called f1, and a simplied m13 record) is shown below. "\" symbols have been used to denote wrapped lines and so it can be seen that the first record is shown on three lines, the next on 1, and the last on two. To try to achieve clarity a blank line has been introduced between each record but should not be included in real data files, and as stated above, real files must contain all the data for each entry on a single line.
m13mp18-20-21 tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc \ ggggatcctctagagtcgacctgcaggcatgcaagcttggc -24 41 6249 \ /pubseq/tables/vectors/m13mp18.seq f1 CCGGGAATTCGCGGCCGCGTCGACT CTAGACTCGAGTTATGCATGCA -65 27 572 af_clones_vec 1 tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc \ ggggatcctctagagtcgacctgcaggcatgcaagcttggc 1 1 1 1
Refer to the figure below, which contains an annotated section of the m13mp18 vector around the SmaI site, to see how it corresponds to the first record in the vector_primer file. The primers shown are the 16mer reverse(-21) and the 17mer forward(-20).
SmaI ++++++++++10+ ----20--------10-------- 123456789012 r(-21) 432109876543210987654321 aacagctatgaccatg acacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctcta 6210 6220 6230 6240 6250 6260 +++++++20++++++++30++++++++40 34567890123456789012345678901 f(-20) tgaccggcagcaaaatg gagtcgacctgcaggcatgcaagcttggcactggccgtcgttttacaacgtcgtgactgg 6270 6280 6290 6300 6310 6320
It is important to realise that while at present only the segments of sequence in the primer_vector file are used by vector_clip, the other items have been included in case they were needed in the future, and all items must be included in each record. We will provide a program to make it easy to add items to primer_vector files soon, but at present, as is shown in the third record, dummy values can be used for all the extra items in a record. (it is important that the dummy items are of the correct types, so as shown they can all be replaced by 1's).
There are several consequences of using vector_primer files to specify the sequencing vector details.
Firstly, to get the vector segments of readings marked correctly it will no longer be necessary to include the relevant data in their experiment files, but because gap4 needs the primer type data for read-pair analysis, it is strongly recommended that for sequence assembly projects the data is included. (Note in the future we could use the new mode to discover the correct primer-cloning site combination and use vector_clip to add the information to the experiment file. In fact when analysing data using the new mode we have come across examples of experiment files which specify the primer type and vector incorrectly, presumably due to the misnaming of readings.)
Secondly, because vector_clip compares all the primer-vector pairs in the primer_vector file it would be inefficient to include very large numbers of records in these files. Instead it would be better to have a master vector_primer file which contained all the combinations used in the lab and then to copy the relevant ones to project specific files.
Thirdly, the new mode cannot be used for searching the 3' ends of readings for matches when primer walking. i.e. because the 5' end will not match any of the vector segments the program would not know which data to use for the 3' end search.
One intention of the new method was to avoid users having to work out the cloning site and primer site positions but until we have written the simple program to create primer_vector file records, this will have to continue. Once that is done we can use vector_clip to discover the cloning site and primer information and add it to the experiment files for use by gap4. (Hence the extra fields in the primer_vector file).