SEG

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CONSIDERATIONS
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

Seg replaces low complexity regions in protein sequences with X characters. If a resulting protein sequence is used as a query for a BLAST search, the regions with X characters are ignored.

DESCRIPTION

[ Previous | Top | Next ]

The Karlin-Altschul statistics that underlie BLAST assume that the probability of finding a residue at any particular position in a sequence is simply proportional to its composition. Low-complexity regions and polymers violate this assumption. Such regions occur frequently in proteins. Query sequences containing low-complexity sequences may give highly significant similarity scores when compared to unrelated low-complexity sequences of similar composition.

Seg uses the method of Wootton & Federhen (Computers and Chemistry 17; 149-163, (1993)) to divide a sequence into regions of high and low complexity. The output is a sequence just like the input sequence except that if low-complexity regions are found, the amino acid characters in these regions are replaced by X's. A BLAST search ignores these X regions.

EXAMPLE

[ Previous | Top | Next ]

Seg is used to find the regions of low complexity in a human major protein prion precursor.


% seg

 SEG of what input sequence(s) ?  Pir:ujhu

                  Begin (* 1 *) ?
                End (*   253 *) ?

 What should I call the output file (* ujhu.seg *) ?

        PIR1:UJHU   Len:     253

%

OUTPUT

[ Previous | Top | Next ]

Each output file contains the input sequence with the amino acid characters in low-complexity regions changed into X's. Here is the output file from the session above.


!!AA_SEQUENCE 1.0
  SEG of: a05017  check: 8781  from: 1  to: 253

P1;UJHU - major prion protein precursor - human
N;Alternate names: 11K amyloid protein; 27-30K sialoglycoprotein; PrP 27-30;
 PrP 33-35C; scrapie prion protein
C;Species: Homo sapiens (man)
C;Date: 25-Oct-1987 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997
C;Accession: A24173; A40372; A05017; S14078; I54322; I68597; I58135; I59184;
 I79633; I79634
R;Kretzschmar, H.A.; Stowring, L.E.; Westaway, D.; Stubblebine, W.H.; Prusiner,
 S.B.; Dearmond, S.J

a05015.seq  Length: 253  October 13, 1998 16:29  Type: P  Check: 4122  ..

       1  MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYX

      51  XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXTHSQWN

     101  KPSKPKTNMK HMXXXXXXXX XXXXXXXXXX XXXXXRPIIH FGSDYEDRYY

     151  RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHXXX XXXXXXXXXX

     201  XDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSXXXX XXXXXXXXXX

     251  XXG

INPUT FILES

[ Previous | Top | Next ]

You can specify either a single protein sequence or multiple protein sequences as input to Seg. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Seg rejects your protein sequence, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Xnu replaces statistically significant tandem repeats in protein sequences with X characters. If a resulting protein sequence is used as a query for a BLAST search, the regions with X characters are ignored.

RESTRICTIONS

[ Previous | Top | Next ]

Seg only accepts protein sequences. If you give Seg more than one sequence as input, Seg will not prompt you for begin and end positions or for the output file name.

CONSIDERATIONS

[ Previous | Top | Next ]

If 20 different characters were distributed randomly, but with equal probability along a sequence, then each character would add 4.322 bits of information to the sequence (log(base 2) 20). If a sequence contained only one character, then each character would add 0.0 bits of information to the sequence (log(base 2) 1).

The parameters -LOWcut and -HIGhcut are cutoffs in units of bits/residue that set the "lowness" of complexity of the regions you want to find.

To find all homopolymers of length six or greater, set the window to six and both the cutoffs to zero.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % seg [-INfile=]PIR1:Ujhu -Default

Prompted Parameters: (for single sequences)

-BEGin=1  -END=253    sets the range of interest
[-OUTfile=]ujhu.seg   names the output file

Local Data Files:     None

Optional Parameters:

-BEGin=1  -END=100    sets the range of interest (for multiple sequences)
-WINdow=12            sets the minimum size of first stage segment
-LOWcut=2.2           sets the maximum complexity of a first stage segment
-HIGhcut=2.5          sets the maximum complexity of a second stage segment
-MINhighlen=0         sets the minimum length of a high-complexity segment
-EXTension=.seg       sets the default output file name extension
-LIStfile[=seg.list]  writes a list file of output sequence names
-NOMONitor            suppresses screen monitor of input sequence names
-NOSUMmary            suppresses the screen summary

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

Seg was written by Wootton and Federhen at the National Center for Biotechnology Information (NCBI). Their public-domain program was modified by Scott Rose for distribution with Version 9 of the Wisconsin Package. The document you are now reading was written by John Devereux. We are extremely grateful to Wootton and Federhen for their work on Seg and to NCBI for making this program available to the scientific community.

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, Seg ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, Seg ignores ending positions specified for sequences in a list file.

-WINdow=12

To limit the computing required, Seg starts off by looking for low-complexity regions of size 12 or greater. You can set this window size lower if you want to find shorter low-complexity regions.

-LOWcut=2.2

In the first stage of the algorithm, Seg identifies segments having a complexity equal to or less than the cutoff in bits/residue set by this parameter (Equation (3) from Wootton and Federhen). The range of acceptable values for this parameter is from 0.0 to 4.322.

This parameter is K(1) in Wootton and Federhen and is referred to as the trigger segment complexity. See the CONSIDERATIONS topic above.

-HIGhcut=2.5

In the second stage of the algorithm, Seg extends the low-complexity segments found in the first stage into overlapping low-complexity segments that have a complexity equal to or less than the cutoff in bits/residue set by this parameter (Equation (3) from Wootton and Federhen). The complexity of these extension segments can be higher than the complexity for the first stage segments. The range of acceptable values for this parameter is from the lowcut setting to 4.322.

This parameter is K(2) in Wootton and Federhen and is referred to as the extension segment complexity. When this second stage of the algorithm is finished, the resulting extended segment-contigs are referred to as "raw segments". See the CONSIDERATIONS topic above.

-MINhighlen=0

The maximal BLAST bit score of a segment pair is 4.322 times the length of the pair. If you are searching with a very short region, even though it may be locally complex, it may not contain enough total information to reach the cutoff score and it will therefore seem to find nothing at all even when there are related sequence segments in the database. This parameter lets you set a minimum acceptable length for a high complexity segment. If Seg finds one shorter than this minimum length, it extends it into adjacent low-complexity segments.

-EXTension=.seg

This program normally creates output file names by using the original input file name for the base name and the program name for the name extension. Use this parameter to specify some other file name extension.

-LIStfile=seg.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Seg makes one up using seg for the file name and .list for the file name extension.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:29 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com