PILEUP(+)

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
SCREEN MONITOR
OUTPUT
DENDROGRAM
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
GRAPHICS
<CTRL>C
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

DESCRIPTION

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment using a simplification of the progressive alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).

The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of the pairwise alignments that created the final alignment.

As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final alignment restricted to a maximum length of 7,000 characters (including gap characters inserted into the sequence by PileUp to create the alignment). However, if you include long sequences in the alignment, the number of sequences PileUp can align decreases. See the RESTRICTIONS topic, below, for a more complete discussion of sequence number and size limitations.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using PileUp to create a multiple sequence alignment of an unaligned group of 70 kd heat shock and heat shock cognate protein sequences. (Note that to specify a list file, for example hsp70.list, as input, you must precede it with an at (@) sign.)


% pileup

 PileUp of what sequences ?  @hsp70.list

   1      JU0062   675 aa
   2      A25646   634 aa

  ///////////////////////////

  24      B36590   642 aa
  25      S29261   638 aa

 What is the gap creation penalty (* 8 *) ?

 What is the gap extension penalty (* 2 *) ?

 This program can display the clustering relationships graphically.
 Do you want to:

     A) Plot to a FIGURE file called "pileup.figure"
     B) Plot graphics on LaserWriter attached to /dev/tty10
     C) Suppress the plot

 Please choose one (* A *):

 The minimum density for a one-page plot is 20.0 sequences/100 platen units.
 What density do you want (* 20.0 *) ?

 What should I call the output file name (* hsp70.msf *) ?

 Determining pairwise similarity scores...

   1   x     2       3.66
   1   x     3       3.73

 ////////////////////////

  23   x    25       2.29
  24   x    25       2.22

 Aligning...

   1     ................................-....
   2     ................................-.
         ................................-....

 /////////////////////////////////////////////////////////////

  23     ...............................-....
  24     .................................-.
         .................................-....

  FIGURE instructions are now being written into pileup.figure

        Total sequences:         25
       Alignment length:        743
               CPU time:      27.15

            Output file: hsp70.msf

%

SCREEN MONITOR

[ Previous | Top | Next ]

PileUp names each sequence to be aligned as it is read in. It then displays the message Determining pairwise similarity scores... and shows a quality ratio for every pairwise alignment. This ratio is the alignment's quality divided by the length of the shorter sequence. If x is the number of sequences to be aligned, there are (x(x-1))/2 pairwise alignments whose ratio must be calculated.

Next PileUp displays the message Aligning... as it performs each of the pairwise alignments that together create the final multiple sequence alignment. There are x-1 alignments in this part of the program.

OUTPUT

[ Previous | Top | Next ]

Below is some of the output file containing the multiple sequence alignment. By default, similar sequences are positioned close to each other in the output file, but if you use -NOSORt, the aligned sequences are listed in the same order as they were presented to the program.


!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @hsp70.list

 Symbol comparison table: GenRunData:blosum62.cmp  CompCheck: 6430

                   GapWeight: 8
             GapLengthWeight: 2

 hsp70.msf  MSF: 743  Type: P  October 6, 1998 18:23  Check: 7784 ..

 Name: S11448           Len:   743  Check: 3635  Weight:  1.00
 Name: S06443           Len:   743  Check: 5861  Weight:  1.00

 /////////////////////////////////////////////////////////////

 Name: S29261           Len:   743  Check: 7748  Weight:  1.00

//

        1                                                   50
S11448  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S06443  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
A25398  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
S06158  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
S42164  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFAND
S20139  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFSND
B36590  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFAND
A25089  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAKSEG PAIGIDLGTT YSCVGLWQHD
S03250  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAGKGEG PAIGIDLGTT YSCVGVWQHD
A27077  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
S07197  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
A25646  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSGKG PAIGIDLGTT YSCVGVFQHG
S10859  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSARG PAIGIDLGTT YSCVGVFQHG
A29160  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKA AAVGIDLGTT YSCVGVFQHG
JH0095  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKN TAIGIDLGTT YSCVGVFQHG
A03310  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MATKG VAVGIDLGTT YSCVGVFQHG
JT0285  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKH NAVGIDLGTT YSCVGVFMHG
S09036  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MQAPRE LAVGIDLGTT YSCVGVFQQG
JU0062  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAQSVSG YSVGIDLGTT YSCVGVWQND
JU0164  ~~~~~~~~~~ ~~~~~MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE
A34041  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAANKG MAIGIDLGTT YSCVGVFQHG
S05776  ~~~~~~~~~~ ~~~~~~~~~~ ~~ADDVENYG TVIGIDLGTT YSCVAVMKNG
S20149  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAEGVFQ GAIGIDLGTT YSCVATYESS
A32493  MLAAKNILNR SSLSSSFRIA TRLQSTKVQG SVIGIDLGTT NSAVAIMEGK
S29261  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT

The gaps at the ends of each sequence are written as tildes (~) which may represent differences in input sequence lengths rather than missing characters or significant differences in the alignment. Internal gaps in each sequence are written as periods (.). When you create an end-weighted alignment in PileUp by using -ENDWeight, gaps at the ends of each sequence are written as periods since those gaps may represent missing characters or significant differences in the alignment. See Appendix III for more information about the two different gap characters.

DENDROGRAM

[ Previous | Top | Next ]

PileUp can plot a dendrogram like the one below that shows the clustering relationships used to determine the order of the pairwise alignments that together create the final multiple sequence alignment. Distance along the vertical axis is proportional to the difference between sequences; distance along the horizontal axis has no significance at all. The interpretation of the dendrogram is discussed in the ALGORITHM topic below.

INPUT FILES

[ Previous | Top | Next ]

PileUp accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of PileUp depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.

If the input sequences are named in a list file, you can specify the reverse complement strand of any particular nucleotide sequence in the list as input by using the strand:- sequence attribute. You can restrict the range of interest for any particular sequence with appropriate sequence attributes like Begin:43 and End:682. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for more information about sequence attributes in list files.) For example:


This is part of a list file suitable for input to PILEUP.

                   October 6, 1998  ..

PIR:A32493
PIR:S05776        Begin:43 End:682
PIR:B36590

///////////////////////////////////////

You can limit the range of interest for all of the sequences in the alignment by including expressions like -BEGin=20 and -END=70 on the command line. The command-line range limiters take precedence over the range limiters for sequences in a list file when both are used. If no range limitation is specified, the entire length of each sequence is aligned.

You can force the program to align the forward strand of all nucleotide sequences by including -NOREVerse on the command line. Conversely, you can force the program to align the reverse complement strand for all nucleotide sequences by including -REVerse on the command line. The command-line strand specification takes precedence over the strand specifications for sequences in a list file when both are used. If no strands are specified, the forward strands of all nucleotide sequences are aligned.

RELATED PROGRAMS

[ Previous | Top | Next ]

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

PlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for new sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between a sequence and a group of aligned sequences represented as a profile.

The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

RESTRICTIONS

[ Previous | Top | Next ]

As shipped, PileUp restricts each sequence in the final alignment to a maximum length of 7,000 characters. This maximum length includes the input sequence length plus the total length of all gap characters inserted into the sequence to create the final alignment. By default, each input sequence is restricted to a maximum length of 5,000. Also by default, PileUp can add a maximum of 2,000 gap characters for each sequence in the final alignment.

If you wish to align longer sequences, then you can specify a maximum sequence length of up to 7,000 with -MAXSeg, for example -MAXSeg=6000. If you increase the maximum sequence length in this way, then the maximum amount of allowed gapping is automatically reduced so that the final aligned sequence length cannot exceed 7,000 for any sequence.

If you wish to allow for more gapping in the final alignment, then you can specify a maximum number of gap characters for each sequence with -MAXGap, for example -MAXGAP=3000. If you increase the maximum amount of gapping permitted for each sequence in this way, the maximum sequence length is automatically decreased so that the final aligned sequence length cannot exceed 7,000 for any sequence.

As shipped, the total length of all of the sequences read into PileUp (including the gap allowance for each sequence) cannot be greater than 2,000,000. By reducing the gap allowance for each sequence using -MAXGap, you can increase the number of sequences that can be read into the program up to the maximum of 500 sequences.

The surface of comparison (see the CONSIDERATIONS topic for a explanation) is limited to 2,250,000.

All of these limits are adjustable (see the CONSIDERATIONS topic below).

ALGORITHM

[ Previous | Top | Next ]

A rigorously optimal alignment of even a small number of short sequences would be intractable, both in terms of memory and time. Therefore, PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

PileUp uses this clustering order and first aligns the two most-related sequences to each other in order to produce the first cluster. It then aligns the next most related sequence to this cluster or the next two most-related sequences to each other in order to produce another cluster. A series of such pairwise alignments that includes increasingly dissimilar sequences and clusters of sequences at each iteration produces the final alignment.

In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally, Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final alignment of Seq1 through Seq5.

Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned sequences rather than only individual sequences. For a pairwise alignment of individual sequences, the comparison score between any two sequence symbols is found in a scoring matrix (see the LOCAL DATA FILES topic below) . For a pairwise alignment of clusters of sequences, the comparison score between any two positions in those clusters is simply the arithmetic average of the scores for all possible symbol comparisons at those positions. When gaps are inserted into a cluster to produce an alignment, they are inserted at the same position in all of the sequences of the cluster.

CONSIDERATIONS

[ Previous | Top | Next ]

Because a rigorous optimal alignment of even a small number of short sequences would be intractable, PileUp uses an approach that may not produce the most optimal multiple sequence alignment. (See the ALGORITHM topic above for a description of this approach.)

Clustering

The approach used by PileUp is sensitive to the order in which sequences are aligned. A clustering algorithm determines this order from the pairwise similarities calculated before the final alignments are done. The goal of the clustering is to see that very similar sequences are aligned to each other before they are aligned to more distantly related sequences. There is, at present, no way for you to modify the order of these alignments.

While PileUp calculates the similarity between each of the sequences, this information is not used by the program to weight the sequences. That is, if there are several very similar sequences, the final alignment may be constrained to minimize the disruption of these sequences.

The dendrogram is not a phylogenetic reconstruction, although the vertical branch lengths are proportional to the distance between the sequences. Its purpose is to represent the clustering order used to create the final alignment. This order is the only information from the dendrogram used by PileUp. See the RELATED PROGRAMS topic for a description of programs in the Wisconsin Package that you can use to create phylogenetic reconstructions from multiple sequence alignments.

Global Alignment

If you know the difference between Gap and BestFit, consider PileUp an extension of the Gap program for more than two sequences, rather than an extension of the BestFit program. PileUp, like Gap, tries to find a global optimal alignment, while BestFit finds a local optimal alignment.

Because PileUp aligns sequences along their entire lengths, it is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. However, PileUp has been used successfully for this purpose.

By default, PileUp does not penalize gaps occurring at the ends of sequences. Therefore, related sequences that differ in the extent of their sequencing can be reasonably aligned by PileUp. You can override this default with -ENDWeight, in which case length differences among the sequences become significant.

Piling Up Unrelated Sequences

PileUp always aligns all of the sequences you specify, even if they are not related. The alignment can be degraded if some of the sequences are not similar to one another.

Arbitrary Gap Placement

In any pairwise alignment, the position of the inserted gaps may be arbitrary; equally optimal alignments can be generated by inserting the gaps differently. PileUp can exaggerate these arbitrary differences if you use either -LOWroad or -HIGhroad. This selection usually affects the final alignment. For the most part, however, the difference between the high road and low road alignments should not be very significant, although you may want to check.

Here is an example showing the difference between high and low road for the alignment of three short sequences. The first pairwise alignment creates an aligned cluster of the two most closely related sequences; the second alignment aligns this cluster to the third sequence creating the final multiple sequence alignment. Although the qualities after the first round alignments are the same, the quality of the final low-road alignment is higher than the high-road one.

             For:       Match = 10       Gap weight = 10
                     Mismatch =  0    Length weight =  0

                HighRoad                          LowRoad

                GACCAT                            GACCAT
Alignment  1    GAG.AT    Quality = 30            GA.GAT    Quality = 30

                GACC.AT                           GAC.CAT
Alignment  2    GAG..AT   Quality = 25            GA..GAT   Quality = 30
                AACGGAT                           AACGGAT

High road alignments shift all of the arbitrary gaps in the second sequence or cluster of aligned sequences to the right and all of the arbitrary gaps in the first sequence or cluster of aligned sequences to the left. Low road alignments do the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road when that is not possible.

Scoring Matrices

The default scoring matrices are not necessarily appropriate for all alignments. (See Chapter 4, Using Data Files in the User's Guide for more information.) We provide several alternative scoring matrices suitable for multiple sequence alignments. These matrices are listed in Appendix VII. PileUp chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.

Surface of Comparison

PileUp performs a series of pairwise alignments between clusters of sequences to create the final multiple sequence alignment. Each pairwise alignment requires enough computer memory for a surface of comparison proportional to the product of the lengths of the two clusters being aligned. Since all sequences in an aligned cluster have the same length, the length of a cluster is simply the length of any sequence within that cluster.

PileUp allows you to align sequences, the product of whose lengths is greater than the surface of comparison. In this case, the program limits the total length of gaps that can be inserted into each sequence and calculates the best alignment within this incomplete, or limited, surface of comparison. The program then performs a calculation to determine whether the alignment could possibly be improved if there were no restriction on the total length of gaps in each sequence. If the program cannot rule out this possibility, it displays the message *** Alignment is not guaranteed to be optimal *** . Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, a limited alignment often may be optimal even if this message is displayed. In any event, the program continues to completion.

SUGGESTIONS

[ Previous | Top | Next ]

Figure Files

By default, PileUp writes instructions for plotting the dendrogram into a figure file named pileup.figure. Such files can be plotted on any supported graphics device using the Figure program.

Batch Queue

PileUp can take more than a few minutes to run, depending upon the length and number of sequences being aligned. Most alignments should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide. Very large alignments may exceed the CPU limit set by some systems.

When PileUp is run in batch using -BATch, instructions for plotting the dendrogram are written to a figure file named pileup.figure unless the plot has been directed to a specific file or graphics device from the command line, or has been suppressed with the -NOPLOt command-line parameter.

Editing Multiple Sequence Alignments

PileUp writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. You can edit the alignments created by PileUp with LineUp using the command % lineup -MSF, but LineUp cannot be used to edit more than 30 sequences simultaneously.

You can also edit the alignment created by PileUp with a regular text editor. Any PileUp alignment that has been modified with a text editor can be put back into GCG's multiple sequence format (MSF) using the command % reformat -MSF.

The Pretty program can calculate a consensus for the multiple sequence alignment and can display the alignment several different ways.

Using the Output from PileUp

PileUp writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG sequence analysis program. For instance, you could generate a profile from the sequences in an MSF file with a command like % profilemake hsp70.msf{*} and then use that profile to search the database for sequences similar to the sequences in the alignment. (See in Chapter 2, Using Sequence Files and Databases in the User's Guide for help specifying sequences in MSF files.)

GRAPHICS

[ Previous | Top | Next ]

The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.

<CTRL>C

[ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % pileup [-INfile=]@hsp70.list -Default

Prompted Parameters:

-GAPweight=8            sets the gap creation penalty
-LENgthweight=2         sets the gap extension penalty
-DENsity=20.0           sets number of sequences per 100 pu in the dendrogram
[-OUTfile1=]hsp70.msf   names output file for multiple sequence alignment

Local Data Files:

-MATRix=blosum62.cmp   assigns the scoring matrix for proteins
-MATRix=pileupdna.cmp  assigns the scoring matrix for nucleic acids

Optional Parameters:

-BEGin=1     sets beginning position for every sequence to be aligned
-END=100     sets ending position for every sequence to be aligned
-REVerse     uses the reverse strand for each input sequence
-ENDWeight   penalizes end gaps like other gaps
-INSitu      realign a portion of an existing alignment
-HIGhroad    selects "top" alignment path for equally optimal gaps
-LOWroad     selects "bottom" alignment path for equally optimal gaps
-MAXSeg=5000 sets maximum segment length for every input sequence
-MAXGap=2000 sets maximum combined length of all gaps added to a sequence
-NOSORt      presents output sequences in the same order as input
-LINesize=50       sets the number of sequence symbols per line
-BLOcksize=10      sets the number of sequence symbols per block
-DEGap       removes gap characters ('.' and '~') from the input sequences
-NOPLOt      suppresses plot of clustering relationships
-NOMONitor   suppresses screen trace of each alignment
-NOSUMmary   suppresses screen summary at the end of the program
-BATch       submits program to the batch queue

All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.

-FIGure[=FileName]  stores plot in a file for later input to FIGURE
-FONT=3             draws all text on the plot using font 3
-COLor=1            draws entire plot with pen in stall 1
-SCAle=1.2          enlarges the plot by 20 percent (zoom in)
-XPAN=10.0          moves plot to the right 10 platen units (pan right)
-YPAN=10.0          moves plot up 10 platen units (pan up)
-PORtrait           rotates plot 90 degrees

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

PileUp was written by Irv Edelman.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

PileUp reads a scoring matrix from your local directory or the public database with the values for every possible symbol comparison. The file pileupdna.cmp has a 10 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap. All of the other locations have zeros. The file blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The scores in this matrix for pairwise amino acid comparisons range from -4 to +11. You can use the Fetch program to copy these files and then modify them to suit you own needs. (See the CONSIDERATIONS topic for more information about scoring matrices.)

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-GAPweight=8

sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

-DENsity=20.0

sets the number of sequence nodes displayed per 100 platen units in the dendrogram.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, PileUp ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, PileUp ignores ending positions specified for sequences in a list file.

-REVerse

sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, PileUp ignores any strand designation for individual sequences in a list file.

-ENDWeight

causes gaps at the ends of sequences to be penalized in the same way as all other gaps. (The default is not to penalize gaps at the ends of sequences.)

-INSitu

allows you to realign a portion of an existing alignment without changing the remainder of the alignment. You specify the portion to realign with -BEGin and -END. The program removes all gaps (. and ~) from this portion of the alignment, then realigns only this portion, and finally replaces the specified part of the original alignment with the newly realigned part.

-HIGhroad and -LOWroad

exaggerates the arbitrary insertion of gaps. (See the CONSIDERATIONS topic for a description of high and low road alignments.)

-MAXSeg=5000

sets the maximum length for each individual input sequence. Setting a higher limit (up to a maximum of 7,000) allows you to align longer sequences while setting a lower limit allows you to add more and longer gaps to each sequence. (See the RESTRICTIONS topic for a more detailed description.)

-MAXGap=2000

sets the maximum combined length of all gaps that can be added to each sequence. Setting a higher limit allows you to add more and longer gaps to each sequence while setting a lower limit allows you to align a greater number of sequences. (See the RESTRICTIONS topic for a more detailed description.)

-NOSORt

writes the aligned sequences in the same order as they were presented to the program, rather than presenting closely aligned sequences close together in the output.

-LINesize=50

specifies the number of sequence symbols to display on each line of the output MSF (multiple sequence format) file.

-BLOcksize=10

specifies the number of sequence symbols to place in each block of the output MSF (multiple sequence format) file.

-DEGap

removes gap characters (. and ~) from the input sequences before aligning.

-NOPLOt

suppresses the plot of clustering relationships used to create the multiple sequence alignment.

-MONitor=1,1

shows the progress of PileUp on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it by using -NOMONitor.

The screen monitor is updated every time the program determines a pairwise similarity between two sequences (in the first part of the program) and every time the program aligns two clusters of sequences (in the second part of the program). You can append two optional values to -MONitor to set these two monitoring intervals to some other numbers, for example -MONitor=20,10, outputs a line to the screen after every 20th pairwise comparison and every 10th alignment.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

The parameters below apply to all Wisconsin Package graphics programs. These and many others are described in detail in Chapter 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.

-FONT=3

draws all text characters on the plot using Font 3 (see Appendix I).

-COLor=1

draws the entire plot with the pen in stall 1.

The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

moves the plot up by 30 platen units (pan up).

-PORtrait

rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: December 9, 1998 16:22 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com