PRETTY

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CALCULATING AND DISPLAYING A CONSENSUS
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

DESCRIPTION

[ Previous | Top | Next ]

Pretty prints sequences with their columns aligned and can display a consensus for the alignment, allowing you to look at relationships among the sequences. This program can be used for aligned sequences in an MSF (multiple sequence format) or RSF (rich sequence format) file, or for separate sequences that have had gaps added to make them all align.

You can change the alignments displayed by Pretty with a text editor. The output from Pretty can then be separated into individual sequence files by running Pretty with the command-line parameter -UGLy.

EXAMPLE

[ Previous | Top | Next ]

By repeatedly using the Gap program with the command-line parameter -OUT, gaps were added to a group of picorna virus capsid proteins in the antigenic region to make them align with each other and with a growing consensus sequence. Here is a session using Pretty to display the alignment and calculate a consensus sequence of the antigenic region from those picorna virus capsid protein sequences.


% pretty -CONsensus -CASe

 PRETTY format what sequence(s) ?  @pretty.list

                      fa10.ugly  len: 349  wgt: 0.50
                      fa12.ugly  len: 349  wgt: 0.50

                      //////////////////////////////

                       r14.ugly  len: 349  wgt: 0.50
                        r2.ugly  len: 349  wgt: 0.50

                  Begin (* 1 *) ?
                End (*   349 *) ?

 Find consensus to what minimum plurality (* 2.00 *) ?

 What should I call the output file (* pretty.pretty *) ?

%

OUTPUT

[ Previous | Top | Next ]

Here is part of the output file:


Plurality: 2.00  Threshold: 4
AveWeight 0.55  AveMatch 2.91  AvMisMatch -2.00

PRETTY of: @pretty.list   October 7, 1998 10:35  ..

           1                                                   50
fa10.ugly  .......... .......... .......... ..TTttGESA D.PvtTtVE.
fa12.ugly  .......... .......... .......... ..TTatGESA D.PvtTtVE.
fo1k.ugly  .......... .......... .......... ..TTsaGESA D.PvtTtVE.
   e.ugly  Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT...... kv.Affynrs
 p1m.ugly  GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPthSk eiPALTAVET
 p1s.ugly  GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk eiPALTAVET
 p2s.ugly  GigdmiEgav .Egitknalv pptstnsLPg hkpsGPahSk eiPALTAVET
 p3s.ugly  Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk evPALTAVET
 cb3.ugly  ...gpvEdaI .......T.. Aaigr..vad tvgTGPtnSe aiPALTAaET
 r14.ugly  GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kvPiLTAnET
  r2.ugly  ...npvEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaET
Consensus  G-----E--I -E-T---T-- A------LP- --TTGPGESA D-PALTAVET

/////////////////////////////////////////////////////////////////

           301                                               349
fa10.ugly  aElyCPRPll AIkvtsqdRy KqKI.iAPa. ..KQll.... .........
fa12.ugly  aElyCPRPll AIevssqdRh KqKI.iAPg. ..KQll.... .........
fo1k.ugly  aEtyCPRPll AIhpt.eaRh KqKI.vAPv. ..KQTl.... .........
   e.ugly  krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald isrty....
 p1m.ugly  irvWCPRPPR AlaYygpGvD ykdgtltPls tkdlTTy... .........
 p1s.ugly  irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy... .........
 p2s.ugly  VrvWCPRPPR AvPYfgpGvD ykdg.ltPlp ekglTTy... .........
 p3s.ugly  VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy... .........
 cb3.ugly  VkaWiPRPPR lcqYekakn. vnfrssgvtt trqsiTtmtn tgaiwtti.
 r14.ugly  VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd i.ksy....
  r2.ugly  VkaWCPRPPR AleY.Trahr tnfkiedrsi qtaivTrpii ttagpsdmy
Consensus  VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT---- ---------

INPUT FILES

[ Previous | Top | Next ]

Pretty accepts multiple (one or more) aligned nucleotide sequences or aligned protein sequences as input. You can specify an MSF file, such as the output file from a session with PileUp, as input to Pretty with a command like % pretty pileup.msf{*}. Similarly, you can specify an RSF file, such as the output file from a session with PileUp in SeqLab, as input to Pretty with a command like % pretty pileup.rsf{*}. Weights can be specified for sequences in both MSF and RSF files. (See the Vote Weight discussion below.) Multiple sequence alignments can also be represented with list files. For Pretty, these files may include a vote weight for each sequence with the wgt: sequence attribute.

Here is the input file of sequence names (pretty.list) from the example session:


!!SEQUENCE_LIST 1.0
A multiple sequence alignment represented as a list file for input to
the programs PRETTY, PROFILEMAKE and LINEUP.

7/30/94   ..

GenDocData:fa10.ugly    wgt: 0.5
GenDocData:fa12.ugly    wgt: 0.5
GenDocData:fo1k.ugly    wgt: 1.0
GenDocData:e.ugly       wgt: 1.0
GenDocData:p1m.ugly     wgt: 0.25
GenDocData:p1s.ugly     wgt: 0.25
GenDocData:p2s.ugly     wgt: 0.25
GenDocData:p3s.ugly     wgt: 0.25
GenDocData:cb3.ugly     wgt: 1.0
GenDocData:r14.ugly     wgt: 0.5
GenDocData:r2.ugly      wgt: 0.5

The function of Pretty depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment of a group of related sequences. If you run Gap with the command-line parameters for sequence output, it writes sequence files with the sequences expanded by the addition of gaps. LineUp is an editor that allows you to edit multiple sequence alignments.

PrettyBox produces a PostScript file containing a multiple sequence alignment with residues shaded on the basis of agreement to a calculated consensus sequence, allowing you to identify relationships among the sequences.

PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment.

RESTRICTIONS

[ Previous | Top | Next ]

Pretty displays sequences that have already been aligned. You can use up to 500 sequences, although the total length of all sequences combined must be less than 2,000,000 characters.

CALCULATING AND DISPLAYING A CONSENSUS

[ Previous | Top | Next ]

If you use one of the command-line parameters -CONsensus, -DIFferences, or -CASe, Pretty calculates a consensus for each column of the alignment using the scoring matrix blosum62.cmp for peptides or prettydna.cmp for nucleic acids. The consensus symbol for a column is determined in two steps:

1) The program finds the symbol whose comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is greater than or equal to some set threshold value; votes can be either 1.0 or some vote weight assigned to the sequence from which the vote comes.

2) Among the coalition of symbols that voted for the winning symbol, the most common symbol is chosen as the consensus.

If there is no coalition of votes that is larger than all of the other coalitions, or if the largest coalition of votes is below the minimum plurality, then there is no consensus for the column.

The weights for each sequence and the minimum plurality are floating point numbers. The threshold value is an integer.

If you use -IDEntity, a consensus symbol is chosen only when all of the sequence symbols in a column of the alignment are identical, regardless of their votes.

If you use -CASe, Pretty shows the symbols in a column in uppercase when their comparison value with the consensus symbol meets or exceeds the threshold. All other symbols are in lowercase.

If you use -DIFferences, Pretty only shows those symbols in a column whose comparison value with the consensus symbol is lower than the threshold. These symbols are shown in lowercase; all other positions in the column are left blank.

If you use -CONsensus, Pretty adds a line to your alignment with the consensus sequence.

-THReshold=1

determines the scoring matrix value below which a symbol may not vote for a coalition. Pretty chooses a default threshold that is appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default threshold accordingly. Use -THReshold to specify an alternative threshold if you don't want to accept the default value.

-PLUrality=2.0

defines the number of votes (vote weights) below which there is no consensus.

Vote Weight

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to Pretty is a list file, you can assign each sequence a vote weight with the wgt sequence attribute. The vote weight is the vote that each row casts for the consensus. A weight of 1.0 is assumed if no vote weight is specified. (See the INPUT FILES topic for information about the list file used to run the example above.) Note how each kind of sequence is assigned a vote weight so that their combined impact on the election is never more than one vote. For more information about list files, see "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide.

You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence at the top of the file. (See ," "Using Multiple Sequence Format (MSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of MSF files.)

You can assign vote weights to sequences in an RSF (rich sequence format) file by modifying the weight attribute for each sequence within SeqLab. (See "Using Rich Sequence Format (RSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of RSF files. Also see "Viewing and Editing Sequence Attribute and Reference Information" in Chapter 2, Editing Sequences and Alignments in the SeqLab Guide for more information about modifying the weight attribute for each sequence within an RSF file.)

If a sequence from an MSF or RSF file is listed in a list file with a vote weight, the vote weight in the list file is used; the sequence weight in the MSF or RSF file is ignored. If you add -WEIGHT=1.0 to the command line, Pretty ignores weights specified for individual sequences and gives all of the sequences in the alignment equal weight.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % pretty [-INfile=]@pretty.list -Default

Prompted Parameters:

-BEGin=1 -END=349         sets the range of interest
[-OUTfile=]pretty.pretty  names the output file

Local Data Files:

-MATRix=prettydna.cmp  assigns the scoring matrix for nucleotides
-MATRix=blosum62.cmp   assigns the scoring matrix for proteins

Optional Parameters:

-CONsensus         generates a consensus sequence
-IDEntity[=*]      shows only positions of unanimous agreement
                     in the consensus
-DIFferences[="-"] shows only positions disagreeing with the calculated
                     consensus
-CASe              shows positions agreeing with the calculated consensus
                     in uppercase
-THReshold=1       sets minimum comparison value for symbol to vote
                     in consensus
-PLUrality=2.0     defines the minimum number of votes for a consensus
                     to exist
-LINesize=50       sets the number of residues per line
-WEIGHT=1.0        sets the weight for all input sequences
-BLOcksize=10      sets the number of residues per block
-UGLy              writes the individual sequences into new files

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

We are very grateful to Ann Palmenberg of the UW Biophysics lab for help with the design of Pretty. The sequences in the example were aligned for Dr. Palmenberg's work.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

If you use one of the command-line parameters -CONsensus, -DIFferences, or -CASe, Pretty calculates a consensus for each column using a scoring matrix (see Chapter 4, Using Data Files in the User's Guide). You can provide your own matrix called either blosum62.cmp for peptides or prettydna.cmp for nucleic acids. You can specify some other matrix with the command-line parameter -MATRix=filename.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-CONsensus

causes Pretty to show a consensus sequence for the set of sequences you are displaying. (Read how Pretty finds the consensus above.)

-IDEntity=*

causes Pretty to show a consensus indicating where there is complete agreement among all of the sequences. If an optional character is added after the command-line parameter, Pretty uses that character to indicate complete agreement. Otherwise, the consensus contains the completely conserved sequence symbol.

-DIFferences="-"

causes Pretty to print only those symbols in each column whose comparison value with the consensus symbol is lower than the threshold (see -THReshold below), and to print blank spaces at all other positions. If an optional character is added, Pretty prints that character instead of blank spaces. The optional character has to be enclosed in quotes.

-CASe

causes Pretty to print in uppercase all those symbols in each column whose comparison value with the consensus symbol is greater than or equal to the threshold (see -THReshold below), and to print all other symbols in lowercase. This parameter overrides -DIFferences if both are used.

-THReshold=1

determines the scoring matrix value below which a symbol may not vote for a coalition (see the CALCULATING A CONSENSUS topic above). Pretty chooses a default threshold that is appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default threshold accordingly. Use -THReshold to specify an alternative threshold if you don't want to accept the default value.

-PLUrality=2.0

defines the number of votes (vote weights) below which there is no consensus (see the CALCULATING A CONSENSUS topic above).

-LINesize=50

specifies the number of sequence symbols to display on each line.

-WEIGHT=1.0

sets the sequence weight for all input sequences. When the weight is set from the command line, Pretty ignores weights for individual sequence files in a list file, a multiple sequence format (MSF) file, or a rich sequence format (RSF) file.

-BLOcksize=10

specifies the number of sequence symbols to put into each block.

-UGLy

rewrites the sequences in a Pretty output file into individual sequence files in GCG format. The Pretty output file must have a line with two periods (..) separating the text in the heading from the sequences. -UGLy also causes Pretty to write a list file to go with the new sequence files.

Printed: December 9, 1998 16:25 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com