PLOTSIMILARITY(+)

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
GRAPHICS
<CTRL>C
COMMAND-LINE SUMMARY
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

PlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.

DESCRIPTION

[ Previous | Top | Next ]

PlotSimilarity calculates the average similarity among all members of a group of aligned sequences at each position in the alignment, using a user-specified sliding window of comparison. The window of comparison is moved along all sequences, one position at a time, and the average similarity over the entire window is plotted at the middle position of the window. The average similarity across the entire alignment is plotted as a dotted line.

If you give PlotSimilarity a single input sequence, you can choose the range and strand for that sequence, and then PlotSimilarity prompts you for the name, range, and strand of a second input sequence. In this way, you can plot the average similarity between the two aligned sequences created with % gap -OUT.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using PlotSimilarity to display the similarity among the group of aligned 70 kd heat shock and heat shock cognate peptide sequences in the file hsp70.msf:


% plotsimilarity

 PLOTSIMILARITY between what sequence(s) ? hsp70.msf{*}

      hsp70.msf{s11448}
      hsp70.msf{s06443}

      /////////////////////

      hsp70.msf{s29261}

 What window to average (* 10 *) ?

 The minimum density for this plot is  646.1 residues/100 platen units.
 What density do you want (* 646.1 *) ?

  When your LaserWriter attached to tty07 is ready, press <Return>.

%

OUTPUT

[ Previous | Top | Next ]

If you are reading the Program Manual, the plot from this session is shown in the figure below.

INPUT FILES

[ Previous | Top | Next ]

PlotSimilarity accepts multiple (two or more) aligned nucleotide sequences or aligned protein sequences as input. The multiple sequence alignment created by the PileUp program can be used as input to PlotSimilarity. The gapped output files from the Gap and BestFit programs , which were created using the -OUTfile2 and -OUTfile3 command-line parameters, can also be used as input to PlotSimilarity. If the first sequence entered into PlotSimilarity is a single sequence, the program prompts you for the second sequence.

If the input sequences are named in a list file, you can restrict the range of interest for any particular sequence with appropriate sequence attributes like Begin:43 and End:682. You can restrict the range of interest for all sequences in the alignment by including expressions like -BEGin=20 and -END=70 on the command line. The command-line range limiters take precedence over the range limiters for sequences in a list file when both are used. If no range limitation is specified, the entire length of each sequence is used.

You can also optionally specify a weight for each sequence in a list file with the weight: sequence attribute. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for more information about sequence attributes in list files.)

You can assign weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence. (See "Using Multiple Sequence Format (MSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of MSF files.)

You can assign weights to sequences in an RSF (rich sequence format) file by modifying the weight attribute for each sequence within SeqLab. (See "Using Rich Sequence Format (RSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of RSF files. Also see "Viewing and Editing Sequence Attribute and Reference Information" in Chapter 2, Editing Sequences and Alignments in the SeqLab Guide for more information about modifying the weight attribute for each sequence within an RSF file.)

If a sequence from an MSF or RSF file is listed in a list file with a weight, the sequence weight is taken from the list file (the sequence weight in the MSF or RSF file is ignored). A weight of 1.0 is assumed if none is specified for a sequence. With -WEIGHT=1.0, PlotSimilarity ignores weights specified for individual sequences and gives all of the sequences in the alignment equal weight.

PlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file. If it can't read the matrix name, it uses the default scoring matrix (see the LOCAL DATA FILES topic below). The function of PlotSimilarity depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment of a group of related sequences. Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

GapShow displays an alignment of two sequences by making a graph that show the distribution of similarities and gaps.

ALGORITHM

[ Previous | Top | Next ]

The average similarity at a position in an alignment is the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. The comparison score between any two sequence symbols is the comparison value between those symbols in the scoring matrix (see the LOCAL DATA FILES topic below) multiplied by the weight of each of the two sequences. The average similarity across the entire alignment (plotted as a dotted line) is the sum of the separate window similarities divided by the number of windows.

With -IDEntity, the program plots a measure of the level of identity among all sequences in the multiple sequence alignment. The calculations are done exactly as described above, but all identical symbol comparisons are given a value of 1; all other comparisons are given a value of 0.

With -PROFile, the program plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.

CONSIDERATIONS

[ Previous | Top | Next ]

PlotSimilarity does not create the multiple sequence alignment. You can create the alignment using PileUp, Gap, or BestFit (see the INPUT FILES topic above).

SUGGESTIONS

[ Previous | Top | Next ]

You can plot a measure of identity between all sequences in the alignment using -IDEntity.

You can plot a measure of the level of conservation in a profile created from a multiple sequence alignment using -PROFile. This plot provides similar information to a plot of the similarity among the sequences in the multiple sequence alignment.

GRAPHICS

[ Previous | Top | Next ]

The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.

<CTRL>C

[ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % plotsimilarity [-INfile1=]hsp70.msf{*} -Default

Prompted Parameters:

-WINdow=10             sets the comparison window size
-DENsity=624.3         sets the number of bases per 100 platen units

Prompted Parameters: (for comparing 2 sequences only)

[-INfile2=]ggamma.gap  names the second input sequence
-BEGin1=1 -END1=1700   sets the range of interest for sequence 1
-BEGin2=1 -END2=1700   sets the range of interest for sequence 2
-REVerse1 -REVerse2    uses the reverse strand of one or both sequences

Local Data Files:

-MATRix=blosum62.cmp    assigns the scoring matrix for proteins
-MATRix=plotsimdna.cmp  assigns the scoring matrix for nucleic acids

Optional Parameters:

-BEGin1=1 -END1=718      sets the range of interest in the alignment
-OUTfile[=hsp70.plotsim] writes the similarity values to a file
-WEIGHT=1                sets the weight for all input sequences
-IDEntity                plots the level of identity among the sequences
-BARgraph                plots a bar graph (rather than a continuous curve)
-PROFile                 plots positional conservation in a profile
-MINScale=0              sets the bottom of the similarity score scale
-MAXScale=2              sets the top of the similarity score scale
-EXPand                  scales plot between observed min and max
                           similarity scores
-NOAVErage               suppresses the plot of overall similarity
-NOPLOt                  suppresses the plot
-CMASK=filename          creates a SeqLab colormask file with grayscale
                           values for levels of similarity

All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.

-FIGure[=FileName]  stores plot in a file for later input to FIGURE
-FONT=3             draws all text on the plot using font 3
-COLor=1            draws entire plot with pen in stall 1
-SCAle=1.2          enlarges the plot by 20 percent (zoom in)
-XPAN=10.0          moves plot to the right 10 platen units (pan right)
-YPAN=10.0          moves plot up 10 platen units (pan up)
-PORtrait           rotates plot 90 degrees

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless 1) you name a file on the command line with an expression like -MATRix=mymatrix.cmp; 2) a different file is named in the text heading of the input file; 3) you have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 4) you have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

PlotSimilarity reads a scoring matrix from your local directory or the public database with the values for every possible match. PlotSimilarity tries to read the name of the scoring matrix from the text heading of the input file. If it can't read the matrix name, it uses the default scoring matrix. The default file plotsimdna.cmp (nucleotide sequence alignments) has a 1.0 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap; all of the other locations have zeros. The default file blosum62.cmp (protein sequence alignments) is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The scores in this matrix for pairwise amino acid comparisons range from -4 to +11. You can use Fetch to copy these files and then you can modify them to suit your own needs.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-WINdow=10

sets the size of the sequence window within which the average similarity score is calculated for the alignment.

-DENsity=1000

sets the number of bases or amino acids per 100 platen units (PU). This is usually equivalent to the number of bases or amino acids per page. Output from different GCG graphics programs that are run at the same density can be compared by lining up the plots on a light box.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, PlotSimilarity ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, PlotSimilarity ignores ending positions specified for sequences in a list file.

-OUTfile=hsp70.plotsim

writes an output file with the average similarity value at each position in the alignment.

-WEIGHT=1.0

sets the sequence weight for all input sequences. When the weight is set with this parameter, PlotSimilarity ignores weights specified for individual sequences in a list file, MSF file, or RSF file.

-IDEntity

plots the level of identity between the sequences.

-BARgraph

plots the similarity as a bar graph (rather than a continuous curve).

-PROFile

plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile.

-MINScale=0

sets the bottom of the similarity score scale.

-MAXScale=2

sets the top of the similarity score scale.

-EXPand

scales the plot between the observed minimum and maximum scores, rather than between the minimum and maximum scores in the scoring matrix.

-NOAVErage

suppresses the plot of overall average similarity between the sequences.

-NOPLOt

suppresses the plot.

-CMASK=plotsimilarity.cmask

creates a SeqLab grayscale colormask file according to the average similarity value at each position in the alignment. This file can be used to shade each column of the alignment in the Editor mode of SeqLab.

The parameters below apply to all Wisconsin Package graphics programs. These and many others are described in detail in Chapter 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.

-FONT=3

draws all text characters on the plot using Font 3 (see Appendix I).

-COLor=1

draws the entire plot with the pen in stall 1.

The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

moves the plot up by 30 platen units (pan up).

-PORtrait

rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: December 9, 1998 16:22 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com