first previous next last contents

Screen_seq

NAME

screen_seq -- filters out sequence readings from contaminating DNA

SYNOPSIS

screen_seq -[lcwmiIsSpft] [-l Length of minimum initial match (25)] [-c Cutoff score or coverage (75)] [-w Window size (20)] [-m Maximum vector length (100000)] [-i Input file of reading file names] [-I Input file of single reading to screen] [-s Input file of sequence file names] [-S Input file of single sequence to screen against] [-p Passed output file of file names] [-f Failed output file of file names] [-t Test only mode]

DESCRIPTION

screen_seq searches sequence readings to filter out those from extraneous DNA such as vector or bacterial sequences. We have separated this task from that of locating and marking the extents of sequencing vector and other cloning vectors. There we require precise identification of the junction between the vectors and the target DNA. The filtering process described here is designed to spot strong matches between readings and a panel of possible contaminating sequences, and it splits readings into passes and fails. Readings that fail have a PS line containing the word "contaminant" added to their experiment file.

Normal usage would be to compare a batch of readings in experiment file format against a batch of possible contaminant sequences stored in (at present) simple text files. Each batch is presented to the program as a file of file names, and the program will write out two new files of file names: one containing the names of the files that do not match any of the contaminant sequences (the passes), and the other those that do match (the fails). It is also possible to compare single readings and single contaminant files by giving their file names (i.e. it is not necessary to use a file of file names for single files).

Given the frequent need to compare against the full E. coli genome the algorithm is designed to be fast. The user controls the speed and sensitivity by supplying three parameters, "min_match", "window_size" and "coverage". The program will find the longest exact match of at least min_match characters, and then will combine all matches of length 8 and above which lie on diagonals within a distance window_size/2 either side of this to see what percentage of the length of the reading they cover. Typical values would be min_match = 20, window_size = 20, coverage = 75.

The search is conducted only over the clipped portion of the readings. On our Alpha machine it takes about 1 second to compare both strands of a reading against the 4.7 million bases of E. coli.

OPTIONS

-l Length of minimum initial match (25)
The length of match required to initiate a closer search.
-c Cutoff score or coverage (75)
The minimum percentage of the reading covered by matches of >7 bases that is required to define an overall match.
-w Window size (20)
The number of diagonals to search either side of the initial match is (Window size) / 2.
-m Maximum vector length (100000)
The maximum length of the longest sequence to screen the readings against.
-i Input file of reading file names
-I Input file of single reading to screen
-s Input file of sequence file names
-S Input file of single sequence to screen against
-p Passed output file of file names
-f Failed output file of file names
-t Test only mode
In test mode no experiment files are changed and the results are written to stdout. When not in test mode a dot "." is written to stdout for each comparison, and an exclamation mark "!" for each error detected.

EXAMPLES

Usage: screen_seq [options and paramters] 
Where options and parameters are:
    [-l minimum match (25)]           [-c coverage (75%)]
    [-w window size (20)]             [-m Max vector length (100000)]
    [-i readings to screen fofn]      [-I reading to screen]
    [-s seqs to screen against fofn]  [-S seq to screen against]
    [-t test only]
    [-p passed fofn]                  [-f failed fofn]

Screen the readings whose names are stored in fofn against a batch of possible contaminant sequences whose names are stored in vnames. Write the names of the readings that pass to file p and those that fail to file f. Increase the maximum sequence length to 5000,000 characters and require a minimum initial match of 20 and a percentage match of at least 60%. Use the default value for the window size.

screen_seq -i fofn -s vnames -p p -f f -l20 -m5000000 -c60

Screen the single reading stored in xpg33.g1 against a batch of possible contaminant sequences whose names are stored in vnames. If the reading does not match write its name to file p, otherwise to file f. Increase the maximum sequence length to 5000,000 characters and require a minimum initial match of 20 and a percentage match of at least 60%. Use the default value for the window size.

screen_seq -I xpg33.g1 -s vnames -p p -f f -l20 -m5000000 -c60

Screen the readings whose names are stored in fofn against a single possible contaminant sequence stored in ecoli.seq. Write the names of the readings that pass to file pass and those that fail to file fails. Increase the maximum sequence length to 5000,000 characters and require minimum initial match of 20 and a percentage match of at least 60%. Use the default value for the window size.

screen_seq -i fofn -S ecoli.seq -p pass -f fails -l20 -m5000000 -c60

NOTES

Limits

Screen_seq is currently set to be able to process a maximum of 10,000 readings and 5000 screening sequences in a single run. The maximum length of any screening sequence is 100,000 although this can be overridden by use of the -m parameter (set to 5000000 for E. coli). At present the sequences to screen against must be stored in simple text files containing individual sequences, with no entry names, and <100 characters per line.

The following errors can occur.

  1. "Failed to open file of file names to screen against". Fatal failure to open the file of file names to screen against.
  2. "Failed to open single file to screen against". Fatal failure to open the file to screen against.
  3. "Failed to open file of file names to screen". Fatal failure to open the file of file names to screen.
  4. "Failed to open single file to screen". Fatal failure to open the file to screen.
  5. "Failed to open file of passed file names". Fatal failure to open the file of file names for readings that do not match.
  6. "Failed to open file of failed file names". Fatal failure to open the file of file names for readings that match.
  7. "Failed to open single file to screen". Fatal failure to open the file to screen.
  8. "Error: could not open vector file". An individual sequence file could not be opened.
  9. "Error: could not read vector file". An individual sequence file could not be read.
  10. "Error: could not hash vector file". An individual sequence file could not be prepared for comparison.
  11. "Error: could not open experiment file". The file does not exist or is unreadable.
  12. "Error: no sequence in experiment file".
  13. "Error: sequence too short". The reading is shorter than the minimum match length.
  14. "Error: could not write to experiment file". The disk is full or the file is write protected.
  15. "Error: hashing problem". An error occurred in the comparison algorithm. Please report to staden-package@mrc-lmb.cam.ac.uk

Inconsistencies in the selection of options, such as selecting -I and -i, should also cause the usage message (shown below) to appear, and the program to terminate.

PS record added to the experiment file for any reading that matches.

SEE ALSO

See section Experiment File.See section Screening Against Vector Sequences.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 2 Febuary 1999.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/manpages_9.html