HTHSCAN

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
CONSIDERATIONS
ALGORITHM
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

DESCRIPTION

[ Previous | Top | Next ]

HTHScan predicts helix-turn-helix (H-T-H) motifs in protein sequences. For each sequence, HTHScan prints a list of possible H-T-H motifs sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given family-specific weight matrix. HTHScan has weight matrices for the araC and lysR families of H-T-H motifs and one for homeobox domains.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with HTHScan that was used to find H-T-Hs in the arabinose operon regulatory protein araC sequence from E. coli:


% hthscan

  HTHScan of what sequence(s)? PIR:Rgeca

                  Begin (* 1 *) ?
                End (*   292 *) ?

  Search using weight matrix for which H-T-H family:

      A.  AraC
      B.  LysR
      C.  Homeobox

     Please choose one: (* A *):

  Only display H-T-Hs whose score exceeds (* 4.0 *) ?

  What should I call the output file (* rgeca.hthscan *) ?

                  Input sequences processed: 1
  Number of sequences with predicted H-T-Hs: 1
                                Output file: rgeca.hthscan
  CPU time (sec): 1.22

%

OUTPUT

[ Previous | Top | Next ]

Here is the output file:


HTHScan of PIR1:Rgeca  September 29, 1998 10:18

  Weight matrix: GenRunData:htharac.dat
  Minimum score for H-T-Hs (threshold): 4.0

> sequence: pir1:rgeca
      name: rgeca  check: 4061  from: 1  to: 292

   1. 197 IASVAQHVCLSPSRLSHLFR 216
      Score: 39.8
      Probability: 4.031E-12

  Databases searched:
        NBRF, Release 57.0, Released on 30Jun1998, Formatted on 18Aug1998
  Input sequences searched: 1
  Number of sequences with predicted H-T-Hs: 1
  CPU time (sec): 0.68

The N-terminus->C-terminus direction of the predicted H-T-H is from left to right. The position of the first residue in the H-T-H is shown to the left. The position of the last residue in the H-T-H is shown to the right.

Below the H-T-H display is the score computed for the predicted H-T-H and the probability of random occurrence of that score or better given a sequence whose residue distribution is uniform and whose positions are independent of one another.

INPUT FILES

[ Previous | Top | Next ]

The input to HTHScan is one or more protein sequences. If HTHScan rejects your protein sequence, see Appendix VI for information on how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

RELATED PROGRAMS

[ Previous | Top | Next ]

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. SPScan scans protein sequences for the presence of secretory signal peptides (SPs). CoilScan locates coiled-coil segments in protein sequences.

CONSIDERATIONS

[ Previous | Top | Next ]

Because of the way HTHScan sorts and stores predicted H-T-H motifs during scanning, no particular ordering is guaranteed among H-T-H motifs that have exactly the same score .

Ambiguity codes (such as B or Z) in protein sequences contribute exactly 0 to the score of the sequence window within which they are found. Therefore, the scores and probabilities associated with any predicted motifs from such a sequence window are likely to differ to varying extents from what they would be otherwise. You shouldn't routinely encounter this problem because ambiguity codes are extremely rare in protein sequences.

ALGORITHM

[ Previous | Top | Next ]

HTHScan uses a log-odds position-weight matrix ("weight matrix") to detect the presence of H-T-H motifs in protein sequences. The weight matrix encodes the H-T-H motif as a set of weights representing the likelihood of each amino acid residue to appear in each position of the motif. The score reported by HTHScan for each prediction is a measure of the local goodness of fit between the target sequence and the H-T-H signal represented by the weight matrix. This score is the sum of the weights corresponding to the amino acid residues found in the target sequence at each weight matrix position.

The statistical significance of each score is computed as the probability of random occurrence of that score or better in a sequence with the same amino acid residue distribution as the target sequence and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used by HTHScan were prepared using sequence sets taken from Pfam Release 2.0 (Sonnhammer, E.L. et al. Proteins 28; 405-420 (1997)). The Pfam families used were HTH 1 (bacterial regulatory helix-loop-helix proteins, lysR family), HTH 2 (bacterial regulatory helix-loop-helix proteins, araC family), and homeobox (homeobox domain). The log-odds weight matrices were constructed from these sequences with MEME version 2.1 (Bailey, T.L. and Elkan, C. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36 (1994)).

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % hthscan [-INfile=]PIR:Rgeca -Default

Prompted Parameters:

-BEGin=1 -END=292         sets the range of interest
-FAMily=arac              specifies weight matrix by H-T-H family:
                            "arac", "lysr", or "homeobox"
-THRESHold=4.0            sets minimum score for H-T-H detection
[-OUTfile=]rgeca.hthscan  names the output file

Local Data Files:

-DATa=htharac.dat      assigns weight matrix for the araC family H-T-Hs
-DATa=hthlysr.dat      assigns weight matrix for the lysR family H-T-Hs
-DATa=hthhomeobox.dat  assigns weight matrix for the homeobox family H-T-Hs

Optional Parameters:

-NUMTOPscores=3        specifies maximum number of H-T-Hs to report
-EVEn                  assumes even target residue distribution
-NOPROBabilities       doesn't compute score probabilities
-VERbose               uses verbose output
-RSF[=hthscan.rsf]     saves features in the RSF file
-MONitor               displays screen trace of progress
-NOSUMmary             suppresses screen summary at end of the program

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

We thank Tim Bailey, Charles Elkan, and Bill Grundy for MEME (http://www.sdsc.edu/MEME), which was used to create the log-odds weight matrices. We thank Erik Sonnhammer, Sean Eddy, and Richard Durbin for the Pfam protein domain family database (http://www.sanger.ac.uk/Software/Pfam/), which was used to create input sequence sets for MEME.

HTHScan was written by Ted Slater.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

If you choose to search for the araC family of H-T-H motifs (the default), HTHScan will use the weight matrix file htharac.dat. If you choose to search for the lysR family of H-T-H motifs, HTHScan will use the weight matrix file hthlysr.dat. If you choose to search for the homeobox family of H-T-H motifs, HTHScan will use the weight matrix file hthhomeobox.dat.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-FAMily=arac

allows you to specify the weight matrix used by choosing the H-T-H motif family by name. You may specify arac for the araC family of bacterial regulatory proteins (represented by the weight matrix file HTHAraC.Dat), lysr for the lysR family of bacterial regulatory proteins (represented by the weight matrix file HTHLysR.Dat), or homeobox for the homeobox domain, (represented by the weight matrix file HTHHomeobox.Dat).

-THRESHold=4.0

allows you to specify the minimum acceptable score for an H-T-H motif prediction. HTHScan reads the default threshold value from the weight matrix file, itself.

-NUMTOPscores=3

specifies the maximum number of predicted H-T-H motifs to report for each sequence scanned. For example, if you use -NUMTOPscores=3, HTHScan will display no more than three of the highest scoring H-T-Hs predicted for each sequence. Use -NUMTOPscores=1 if you want to see only the highest scoring H-T-H in each sequence. By default, HTHScan will display all H-T-Hs predicted for each sequence.

-EVEn

tells HTHScan to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes HTHScan perform a little faster, because it does not have to compute the actual distribution of residues in each input sequence. However, reliability of the score probability calculations may be adversely affected.

-NOPROBabilities

tells HTHScan to forgo the calculation of the probability of random occurrence of the score in a sequence with even amino acid residue distribution whose positions are all independent of each other. This makes HTHScan run much faster.

-VERbose

tells HTHScan to print more documentation about each sequence to the output file. The number of lines of documentation printed depends upon the value of the % DocLines global switch described in "Using Global Switches" in Chapter 3, Using Programs in the User's Guide.

-RSF=hthscan.rsf

writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of HTHScan. This RSF file is suitable for input to other Wisconsin Package programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using hthscan for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Chapter 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

-MONitor=100

monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:28 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com