[ Program Manual | User's Guide | Data Files | Databases ]
PAUPSearch provides a GCG interface to the tree-searching options in PAUP (Phylogenetic Analysis Using Parsimony). Starting with a set of aligned sequences, you can search for phylogenetic trees that are optimal according to parsimony, distance, or maximum likelihood criteria; reconstruct a neighbor-joining tree; or perform a bootstrap analysis. The program PAUPDisplay can produce a graphical version of a PAUPSearch trees file.
PAUP is the copyrighted property of the Smithsonian Institution. Use the program Fetch to obtain a copy of paup-license.txt to read about rights and limitations for using PAUP.
The version of PAUP that currently ships with the Wisconsin Package is a developmental version. Be sure to check your results carefully.
PAUPSearch is a GCG front-end to Dr. David Swofford's PAUP (Phylogenetic Analysis Using Parsimony) program. It allows you to access most of the tree-searching functions of PAUP from the command line using GCG conventions, or from SeqLab. It takes any group of aligned sequences in GCG format directly as input (without having to convert them first into PAUP's native NEXUS format) and makes it possible to run the program in the batch queue.
PAUPSearch processes your input, checks the aligned sequences used for input, and writes a temporary script file in NEXUS format. (The NEXUS format was designed to be used as a standard file format for the interchange of information between programs used in phylogeny and classification. It is described in Maddison, et al., Systematic Biology 46; 590-621 (1997).) The script contains the alignment data and commands in the PAUP command language. PAUPSearch then calls PAUP itself, giving it the name of the script file as its input. After PAUP has completed the analysis, control is returned to PAUPSearch. The script is deleted unless you use -SCRIPT=paup.paupscript.
PAUPSearch analyzes sequence alignments. It searches for optimal trees using one of three optimality criteria: maximum parsimony, minimum evolution distance, or maximum likelihood (nucleotide sequences only). The PAUP functions supported by PAUPSearch include searching for optimal trees, neighbor-joining reconstruction of a tree, bootstrap analysis (a method of assigning confidence levels to groupings in the tree), and length analysis (showing the distribution of scores of a random sample of the possible trees that could be derived from the aligned sequences). You can use PAUPDisplay to graphically display the trees obtained from PAUPSearch or to analyze these trees in a number of ways.
This document provides only an overview of the types of analyses that PAUP can do. For detailed information about maximum parsimony, minimum evolution, maximum likelihood, tree reconstruction, and PAUP itself, you can purchase additional copies of the PAUP User's Manual from the publisher, Sinauer Associates, Inc., 23 Plumtree Road, Sunderland MA 01375-0407 USA, phone 413-549-4300, FAX 413-549-1118. Information about the availability of the manual can be obtained on their web site (http://www.sinauer.com in the Systematics section).
Here is a session with PAUPSearch that finds the most parsimonious tree(s) from an alignment of human glucose transporter proteins.
% paupsearch What aligned sequences to analyze ? hum_gtr.msf{*} Reading sequences: 1 hum_gtr.msf{Gtr1_human} 548 aa 2 hum_gtr.msf{Gtr3_human} 548 aa 3 hum_gtr.msf{Gtr4_human} 548 aa 4 hum_gtr.msf{Gtr2_human} 548 aa 5 hum_gtr.msf{Gtr5_human} 548 aa Analysis options: 1 Heuristic tree search 2 Branch-and-bound tree search 3 Exhaustive tree search 4 Reconstruct neighbor-joining tree 5 Show parsimony length distribution of a random sample of trees 6 Bootstrap analysis using heuristic search 7 Bootstrap analysis using branch-and-bound search 8 Bootstrap analysis using neighbor-joining distance Choose an analysis (* 1 *) : Optimality criterion for tree search: P Parsimony D Distance (Minimum Evolution) Choose a criterion (* P *) : Setting criterion to parsimony. What should I call the trees file (* hum_gtr.pauptrees *) ? Creating NEXUS file for input to PAUP. Calling PAUP to perform analysis... P A U P * Portable version 4.0.0d55 for Unix Fri Oct 23 11:11:32 1998 --------------------------NOTICE------------------------ PAUP* is experimental in this release. Please check your results carefully! -------------------------------------------------------- Processing of file "paup1038092136141.data" begins... Aligned sequences from GCG file(s) 'HumGtr.Msf{*}' Data matrix has 5 taxa, 548 characters Data read in 'protein' format Valid character-state symbols: ACDEFGHIKLMNPQRSTVWY* Missing data identified by '?' "Equate" macros in effect: B,b ==> {DN} Z,z ==> {EQ} Gaps identified by '.', treated as "missing" Heuristic search settings: Optimality criterion = maximum parsimony Character-status summary: Of 548 total characters: All characters are of type 'unord' All characters have equal weight 178 characters are constant 286 variable characters are parsimony-uninformative Number of parsimony-informative characters = 84 Starting tree(s) obtained via stepwise addition Addition sequence: simple (reference taxon = Gtr1 Human) 1 tree held at each step during stepwise addition Tree-bisection-reconnection (TBR) branch-swapping performed MULPARS option in effect Steepest descent option not in effect 'MaxTrees' setting = 100 (will not be increased) Branches collapsed (creating polytomies) if maximum branch length = 0 Topological constraints not enforced Trees are unrooted Heuristic search completed Total number of rearrangements tried = 28 Score of best tree(s) found = 738 Number of trees retained = 2 Time used = <1 sec (CPU time = 0.02 sec) 2 trees saved to file "Hum_Gtr.Pauptrees" %
Here is some of the output file:
#NEXUS [! Aligned sequences from GCG file(s) 'Hum_Gtr.Msf{*}' ] [Length: 548 Type: P October 23, 1998 11:11] [ Name: Gtr1_Human Len: 548 Check: 7317 Weight: 1.00] [ Name: Gtr3_Human Len: 548 Check: 8193 Weight: 1.00] [ Name: Gtr4_Human Len: 548 Check: 6116 Weight: 1.00] [ Name: Gtr2_Human Len: 548 Check: 7171 Weight: 1.00] [ Name: Gtr5_Human Len: 548 Check: 8485 Weight: 1.00] begin data; dimensions ntax=5 nchar=548; format datatype=protein interleave gap=.; matrix [ 1 50] Gtr1_Human .......... ..MEPSSKKL TGRLMLAVGG AVLG.SLQFG YNTGVINAPQ Gtr3_Human .......... ....MGTQKV TPALIFAITV ATIG.SFQFG YNTGVINAPE Gtr4_Human MPSGFQQIGS EDGEPPQQRV TGTLVLAVFS AVLG.SLQFG YNIGVINAPQ Gtr2_Human .......... ....MTEDKV TGTLVFTVIT AVLG.SFQFG YDIGVINAPQ Gtr5_Human .......MEQ QDQSMKEGRL TLVLALATLI AAFGSSFQYG YNVAAVNSPA ////////////////////////////////////////////////////////////////// [ 501 548] Gtr1_Human VPETKGRTFD EIASGF..RQ GGASQ.SDKT PEELFHPLGA DSQV.... Gtr3_Human VPETRGRTFE DITRAFEGQA HGADR.SGKD GVMEMNSIEP AKETTTNV Gtr4_Human VPETRGRTFD QISAAFHRTP SLLEQ.EVK. PSTELEYLGP DEND.... Gtr2_Human VPETKGKSFE EIAAEFQKKS GSAHR.P..K AAVEMKFLGA TETV.... Gtr5_Human VPETKAKTFI EINQIFTKMN KVSEVYPEKE ELKELPPVTS EQ...... ; endblock; begin trees; [Treefile saved Fri Oct 23 11:11:34 1998] [!>Heuristic search settings: > Optimality criterion = maximum parsimony > Character-status summary: > Of 548 total characters: > All characters are of type 'unord' > All characters have equal weight > 178 characters are constant > 286 variable characters are parsimony-uninformative > Number of parsimony-informative characters = 84 > Starting tree(s) obtained via stepwise addition > Addition sequence: simple (reference taxon = Gtr1 Human) > 1 tree held at each step during stepwise addition > Tree-bisection-reconnection (TBR) branch-swapping performed > MULPARS option in effect > Steepest descent option not in effect > 'MaxTrees' setting = 100 (will not be increased) > Branches collapsed (creating polytomies) if maximum branch length = 0 > Topological constraints not enforced > Trees are unrooted > >Heuristic search completed > Total number of rearrangements tried = 28 > Score of best tree(s) found = 738 > Number of trees retained = 2 > Time used = <1 sec (CPU time = 0.02 sec) ] translate 1 Gtr1_Human, 2 Gtr3_Human, 3 Gtr4_Human, 4 Gtr2_Human, 5 Gtr5_Human ; utree PAUP_1 = (1:82,2:115,(3:100,(4:110,5:182):79):70); utree PAUP_2 = (1:81,(2:109,(4:107,5:185):80):65,3:111); end;
The input to PAUPSearch is a set of four or more aligned sequences. In most cases, you will use a multiple sequence alignment in MSF or RSF format created by PileUp, LineUp, or the SeqLab editor. PAUPSearch also will accept a list file containing the names of four or more individual sequences or a wildcard file specification that represents the names of four or more sequences.
Since PAUPSearch does not create alignments, it is your responsibility to ensure that the sequences specified by a list file or wild-card file specification are in alignment before using them as input to PAUPSearch . One way to verify this is to use Pretty to display the sequences; if the Pretty output shows an acceptable alignment, the sequences are suitable for use with PAUPSearch.
The function of PAUPSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.
PileUp creates a multiple sequence alignment of a group of related sequences. LineUp is a screen editor for editing multiple sequence alignments or for manually creating multiple sequence alignments of up to 30 sequences.
The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.
PAUPSearch requires pre-aligned sequence data as input -- it does not perform alignments. There must be at least four sequences in the aligned set.
A PAUP tree search has two components: a method for systematically generating trees that can be constructed from the data and an optimality criterion for evaluating these trees.
Three optimality criteria are available: maximum parsimony, distance, and maximum likelihood. According to the parsimony criterion, the optimal tree is the one that requires the least amount of evolutionary change to explain the data. (Such a tree is also referred to as the most parsimonious tree.) To use distance as a criterion, the program calculates a distance matrix from the aligned sequences and uses the matrix values to compute the sum of the branch lengths for each tree according to the minimum evolution algorithm. The distance criterion regards the optimal tree to be the one with the minimum sum of branch lengths. Maximum likelihood is a statistics-based method. Given a model for evolutionary change, a data set, and one or more trees, this method calculates the likelihood that the data set resulted from each tree. The tree with the highest likelihood is considered to be the optimal tree. For any of these criteria, it is possible that more than one tree will tie for the optimal tree. In that event, all of the optimal trees are reported at the conclusion of the search.
PAUP provides three methods to generate trees: an exhaustive search, a branch-and-bound search, and a heuristic search. Each of these three methods creates a candidate tree by adding branches to a partial tree in a stepwise-addition process until a complete tree has been constructed. The difference in the methods is in how the branches are added and what happens after a complete tree has been constructed. Here is a greatly simplified description of each method:
* Branch-and-bound search. This method is also guaranteed to find the optimal tree(s). Unlike the exhaustive search, the branch-and-bound algorithm constructs trees with some "intelligence" rather than by brute force. It uses the same stepwise-addition process as the exhaustive search, but it computes the score of the partial tree each time it adds a branch. If the score of the partial tree is worse than that of the best complete tree found so far, the algorithm abandons this pathway and backtracks to a previous partial tree to use as the next starting point for adding branches. When the search is short-circuited in this way, a branch-and-bound search is faster than an exhaustive search.
In practice, the increase in speed is dependent on the data and on the optimality criterion. When the optimality criterion is parsimony, a branch-and-bound search is usually faster than an exhaustive search. When the optimality criterion is not parsimony, branch and bound offers little or no improvement over an exhaustive search. There is no way of predicting if the branch-and-bound algorithm will speed up a search. For some data sets, the branch-and-bound search reverts to an exhaustive search.
* Heuristic search. This method is not guaranteed to find the optimal tree(s). However, it is the fastest type of search and is the only realistic option for large data sets. An initial complete tree is constructed, either by using one of several stepwise-addition methods or by creating a neighbor-joining tree (see "Tree Reconstruction Using Neighbor Joining" below). Next, branches and/or subtrees of this initial tree are swapped to grossly rearrange the tree to see if this improves the score. Several branch-swapping schemes are available. Because the heuristic method provides no guarantees, you should repeat the search using different options for the stepwise-addition and branch-swapping steps to be confident about the results.
Some tips for choosing a method can be found under the CONSIDERATIONS topic. Detailed information about these methods and how to use them effectively can be found in Sinauer's PAUP User's Manual.
Instead of evaluating large numbers of trees, the neighbor-joining method uses a step-by-step algorithm to construct a single tree from distance data. It starts from a star phylogeny -- a completely degenerate tree in which all the branches originate from the same point -- and successively groups pairs of branches until the complete tree is formed. This process is much faster than any of the searching methods. However, it will produce only one tree, even if other trees exist that have equally good scores. Neighbor joining often produces the same tree as a tree search using the distance criterion because it is based on the same premise: the best tree is the one with the minimum sum of branch lengths based on a corrected distance matrix calculated from the aligned sequences. Neighbor joining uses the same distance correction methods that are used by a search using the distance criterion.
Detailed information about distance correction methods and the neighbor-joining algorithm can be found in Sinauer's PAUP User's Manual and in the Wisconsin Package Program Manual documentation for the Distances and GrowTree programs.
Bootstrapping is a method of assessing confidence for the groupings in a tree. The original data set is resampled with replacement to create a number of "bootstrap replicate" data sets of the same size as the original data set. This is done by using a random number generator to determine which columns of the original alignment will be included in a replicate data set. A given column can be used more than once in the replicate data set or not at all. Each replicate data set is subjected to the same analysis as the original data set, and a consensus bootstrap tree is created from the trees obtained from each of the bootstrap replicates, using the groupings found in the majority of the replicate trees. The percentage of occurrence of a particular grouping among the replicate trees is a rough estimate of support for that grouping.
The single most critical step in tree reconstruction is the sequence alignment. If the alignment is poorly done, no amount of care or tweaking of analysis parameters will guarantee a correct tree. Multiple alignments that are created by computerized methods such as PileUp will need to be inspected and edited by hand, using an editor such as LineUp or SeqLab. Be especially careful with nucleic acid sequences that are coding regions, since computerized alignment methods have no knowledge of codon boundaries. They may insert a gap whose length is not a multiple of three or may insert a gap in the middle of a codon, for example.
Once the alignment is satisfactory, you must decide whether to use the entire alignment, or only portions of it. Only homologous regions of the sequences should be used to reconstruct a tree. Any regions of an alignment that contain data for which no homologs occur in the other sequences should be eliminated from consideration. For example, if there are gap characters at the beginning or end of one or more sequences in the alignment, the sequence data at the extremes of the alignment should not be used, since the longer sequences contain regions that have no homologs in the shorter sequences. Similarly, regions in the interior of the alignment that contain gaps in some of the sequences should probably be edited out of the alignment before trying to reconstruct a tree.
Some biological phenomena can interfere with tree reconstruction. Gene duplication is one of them. When genes are duplicated (by polyploidy or by regional duplication), one of the copies often accumulates mutations and either acquires a different function than the original gene or becomes a pseudogene. In this situation, it is often unclear which of the alternative loci will give the correct tree for the functional gene. Another complication is recombination: if recombination has occurred between sequences in the data set, no single tree can correctly explain the data.
Some data sets can also confound the existing methods for tree construction. For example, a set of sequences consisting of mostly closely related sequences with a few very divergent sequences cannot be analyzed using parsimony or a distance method based on an improperly corrected distance matrix. These methods will systematically group the widely diverged sequences together as sister groups, even if they actually belong to different lineages. If you don't want to drop the diverged sequences from the analysis, you will need to add sequences to the alignment that bridge the distance between the more distant sequences and the group of closely related sequences, or use a distance method based on a properly corrected distance matrix.
Depending on the number of sequences you are analyzing, their length, the degree of similarity among them, the type of search, and the optimality criterion, PAUP tree searches can take from less than a second to days or weeks of computer time. In some cases, the analysis may never come to a conclusion.
The number of possible trees grows enormously with the number of sequences, so that four sequences have only three possible trees, seven sequences have 945 possible trees, 10 sequences have over 2 million possible trees, and 11 sequences have over 34 million possible trees. Because of this, the exhaustive and branch-and-bound methods for tree searches ("alltrees" and "bandb" and bootstrap analyses using these methods) should not be attempted for more than 10 or 11 sequences, unless the sequences are very similar or very short, or unless steps are taken that will constrain the search in some way (for example, using a very low upper bound setting for the branch-and-bound search).
Searches that use the maximum likelihood criterion can be very slow because of the amount of computation involved. Searches that use the distance criterion and the parsimony criterion are much faster than searches using maximum likelihood. For some data sets, parsimony is faster than distance; for other data sets, the reverse is true. The neighbor-joining algorithm is fastest of all, since it reconstructs a single tree from a star phylogeny rather than creating and evaluating large numbers of trees.
To give you an idea of how quickly PAUP's time requirements can rise in response to increases in data size and changes in program parameters, we performed some sample searches on a DEC Alpha 3000-300. The times listed are the CPU times reported by PAUP. By comparison, construction of a neighbor-joining tree was clocked at 0.0 seconds for each of the data sets used below.
Increasing number of sequences. 246-base alignments of MHC class II DR-beta genes from Pan troglodytes , using parsimony as the optimality criterion:
number of sequences heuristic search exhaustive search (sec) (hr:min:sec)8 00.02 00:00:00.55 10 00.03 00:01:56.60 12 00.07 12:38:23.50 20 01.00 not done
Different optimality criteria. Heuristic searches of two different data sets (MHC genes as above, and a 1,230-base alignment of eight plant cytochrome b genes):
criterion 8 MHC genes 20 MHC genes 8 cytochrome b genes (min:sec) (min:sec) (min:sec)parsimony 0:00.02 0:00.92 0:00.18 distance 0:00.05 0:03.17 0:00.02 likelihood 0:02.52 2:36.80 0:35.52
Because of the large amounts of time and computer resources that a tree search can consume, it is not a good idea to run one of the search methods on a large set of newly aligned sequences "just to see what turns up." A better plan would be to try a faster (and cheaper!) algorithmic reconstruction method first (such as PAUP's neighbor-joining implementation, or Distances followed by GrowTree) and examine the resulting tree. If there are anomalies in this tree, such as a lot of negative or zero branch lengths or relationships that you know from other evidence to be incorrect, there may be problems with the data set, and you should not do a time-consuming tree search before examining the sequences to find the source of the problem.
To get the best nucleotide alignments of coding regions, you also should align the sequences at the protein level and adjust the nucleotide alignment to conform to the amino acid alignment. You can do this manually using LineUp or SeqLab.
One way of detecting the presence of recombination in your sequence set is to reconstruct trees from different sections of the alignment. If different trees are found for different sections, it's possible that recombination has occurred.
To check the distance distribution of your sequences, create an uncorrected distance matrix from the alignment (using Distances and examine the contents. If there are mostly closely related sequences with a few very divergent sequences, you must either add sequences to the alignment to bridge the distance between the more distant sequences and the group of closely related sequences, or you must use a distance method based on a properly corrected distance matrix.
For large data sets that require long run times, it can be inconvenient to run PAUPSearch interactively. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide.
PAUPSearch bootstrap analyses should almost always be done as a batch job.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % paupsearch [-INfile=]hum_gtr.msf{*} -Default Prompted Parameters: [-OUTfile=]paup.pauptrees sets primary output filename (trees file or log file) -ACTion=hsearch does a heuristic tree search BANDb does a branch-and-bound tree search ALLtrees does an exhaustive tree search NJ reconstructs a neighbor-joining tree RANDtrees displays a score distribution of randomly generated trees BOOTstrap does a bootstrap analysis to assess confidence -CRITerion=p sets parsimony... D distance (minimum evolution)... L likelihood... as the optimality criterion for evaluating trees during tree searches and bootstrap analyses Local Data Files: None Optional Parameters: -BEGin=1 sets beginning position of alignment to analyze -END=548 sets ending position of alignment to analyze -SCRIPT[=paup.paupscript] saves the NEXUS file used to run PAUP -NORUN doesn't perform analysis (just saves the script if one is requested) -NOMONitor doesn't display information about analysis parameters used -BATch submits program to the batch queue -LOGFile=paup.pauplog saves screen output to this filename Heuristic Tree Search Optional Parameters: -MAXTrees=100 sets maximum number of trees saved at any time during the analysis -AUTOINCrease=100 automatically increases MAXTrees by another 100 trees when limit is reached -STARt=stepwise sets the method used to create the initial tree nj -ADDSeq=simple sets the method of choosing the addition sequence during random stepwise addition asis closest furthest -REFtax=1 designates the reference taxon for unrooted tree searches (if ADDSeq=simple) -HREPs=10 sets number of random-addition-sequence replications to perform (if ADDSeq=random) -RSEED=5 sets initial seed for generating random addition sequences (if ADDSeq=random) -NCHUCK=5 sets number of trees to keep whose scores are greater than or equal to CHUCKSCORE (if ADDSEQ=random) -CHUCKScore=700 sets minimum score to save when NCHUCK is set (if ADDSEQ=random) -ABORTREP cancels the current random-addition-sequence replicate if chucking limits are met (if ADDSEQ=random and NCHUCK and CHUCKSCORE are set) -RSTATus displays info on random addition sequence replicates (if ADDSeq=random) -SWAP=tbr selects algorithm used for branch swapping spr nni none -HOLD=5 holds the five best trees in memory at each cycle of stepwise addition -USENONmin swaps on nonminimal trees if HOLD > 1 -STEEPest uses steepest-descent modification to branch swapping -KEEP=750 keeps all trees whose scores are <= 750.0 (KEEP=0 [default] keeps only the best trees) -NBEST=5 saves the five best trees found during the search -ALLSwap swaps on all the NBEST trees found -NOMULPARS doesn't save all minimal trees found during the search -COLLapse=maxbrlen collapses branches to polytomies if maximum branch length = 0 minbrlen collapses branches to polytomies if minimum branch length = 0 ambequal collapses branches to polytomies if MPR-sets of incident nodes are identical for all characters no doesn't collapse zero-length branches to polytomies -NOQUICKSwap doesn't perform nearest neighbor interchange between initial tree and branch swapping -DSTATus=10 sets the number of seconds between reports of the progress of the search Branch-and-Bound Tree Search Optional Parameters: -KEEP=750 keeps all trees whose scores are <= 750.0 (KEEP=0 (default) keeps only the best trees) -COLLapse=maxbrlen collapses branches to polytomies if maximum branch length = 0 minbrlen collapses branches to polytomies if minimum branch length = 0 ambequal collapses branches to polytomies if MPR-sets of incident nodes are identical for all characters no doesn't collapse zero-length branches to polytomies -MAXTrees=100 sets maximum number of trees saved during the analysis -AUTOINCrease=100 automatically increases MAXTrees by another 100 trees when limit is reached -NOMULPARS doesn't save all minimal trees found during the search -ADDSeq=furthest sets how to choose the addition sequence to use during asis stepwise addition simple maxmini kmaxmini -UPBound=750 sets an upper bound on the score of the best tree(s) -FDOnly outputs only a frequency distribution (no trees) of trees whose scores are <= KEEP value -FDType=barchart format of frequency distribution HISTogram -NCLAsses=20 number of histogram divisions in frequency distribution -DSTATus=10 sets the number of seconds between reports of the progress of the search Exhaustive Tree Search Optional Parameters: -FD=barchart sets format of frequency distribution (or suppress HISTogram the frequency distribution output) no -NCLAsses=20 sets number of histogram divisions in frequency distribution -FREQfile[=paup.paupfreq] saves the frequency distribution of scores to a separate file -KEEP=750 keeps all trees whose scores are <= 750.0 (KEEP=0 [default] keeps only the best trees) -COLLapse=maxbrlen collapses branches to polytomies if maximum branch length = 0 minbrlen collapses branches to polytomies if minimum branch length = 0 ambequal collapses branches to polytomies if MPR-sets of incident nodes are identical for all characters no doesn't collapse zero-length branches to polytomies -MAXTrees=100 sets maximum number of trees saved during analysis -AUTOINCrease=100 automatically increases MAXTrees by another 100 trees when limit is reached -DSTATus=10 sets the number of seconds between reports of the progress of the search Likelihood Options Optional Parameters: -NST=1 sets the number of substitution types for the 2 substitution model 6 -TRATio=2.0 sets transition (ti) : tranversion (tv) ratio estimate -RMATrix=1,1,1,1,1 sets rate matrix (6-parameter model) estimate -VARiant=hky sets variant for unequal base frequencies f84 (2-parameter model) -BASEFReq=empirical sets base frequencies to use equal 0.35,0.25,0.2 -RATes=equal sets model for rate variation across sites gamma -SHApe=0.5 sets shape parameter of gamma distribution equation -NCAT=4 sets number of rate categories for gamma distribution -REPRAte=mean sets the representation of the rate categories median for gamma distribution -LOGITer displays the iteration log Distance and Neighbor-Joining Options Optional Parameters: -DISTance=mean sets distance method to use for prot or nuc... total p ...or for nuc only abs jc tajnei k2p k3p f81 f84 hky85 tamnei gtr logdet ml custom When DISTance=ml, all likelihood optional parameters can be used. -CLAss=a,a,a,a,a,a sets substitution classes (when DISTance=custom) -BASEFReq=empirical sets base frequencies to use (when DISTance=custom) equal -MISSDist=infer sets how to treat gaps and ambiguities ignore -SUBST=all calculates distance estimate on all substitutions, ti transitions only, or tv transversions only -RATes=equal sets the model for the substitution rate gamma variation across sites -SHApe=0.5 sets shape parameter of gamma distribution equation -NEGBRlen=setzero sets how negative branch lengths are treated prohibit allow setabs -LOGITer displays the iteration log Random Tree Search Optional Parameters: -FDType=barchart sets the format of the frequency distribution HISTogram -NCLAsses=20 sets number of histogram divisions in frequency distribution -FREQfile[=paup.paupfreq] saves the frequency distribution of scores to a separate file -TSEED=13 sets initial seed for the random number generator -NREPs=1000 sets number of random trees to evaluate Bootstrap Analysis Optional Parameters: -SEArch=h sets heuristic... B branch and bound... N neighbor joining... as the search algorithm When SEArch=h, all heuristic search optional parameters can be used. When SEArch=b, all branch-and-bound search optional parameters can be used. When SEArch=n, the distance optional parameter DISTance can be used. -BSEED=33 sets initial seed for the random number generator -BREPs=1000 sets number of bootstrap replications (resamplings) -CONlevel=50 includes in bootstrap consensus tree only groups that occurred in more than 50 percent of bootstrap replicates -KEEPALL retains all groups in the bootstrap consensus that are compatible with all of the groups that are already in the consensus
PAUP is designed and implemented by Dr. David L. Swofford of the Laboratory of Molecular Systematics, Smithsonian Institution. The GCG front-end programs PAUPSearch and PAUPDisplay were written for GCG by Sue Olson in collaboration with Dr. Swofford.
The PAUP (Phylogenetic Analysis Using Parsimony) Software, created by David L. Swofford, is the copyrighted property of the Smithsonian Institution. Sinauer Associates Inc. has exclusive right to publish and sell the PAUP Software. Through a contractual agreement with Dr. Swofford, the Smithsonian Institute and Sinauer Associates Inc., Genetics Computer Group has been granted permission to distribute PAUP Software with the Wisconsin Sequence Analysis Package, at no additional fee to the users of the Wisconsin Sequence Analysis Package. This license allows PAUP Software to be mounted alongside the Wisconsin Sequence Analysis Package for the sole purpose of facilitating scientific research. The PAUP Software may not be copied to other machines without express permission from the Smithsonian Institution and modifications may not be made to the PAUP Software.
PAUP SOFTWARE IS PROVIDED "AS IS" WITHOUT ANY WARRANTY OF ANY KIND. DAVID L. SWOFFORD, THE SMITHSONIAN INSTITUTION, SINAUER ASSOCIATES, INC. AND GENETICS COMPUTER GROUP, INC. DO NOT WARRANT, GUARANTEE, OR MAKE ANY REPRESENTATIONS REGARDING THE USE OR THE RESULTS OF THE PAUP SOFTWARE OR DOCUMENTATION IN TERMS OF THEIR CORRECTNESS, RELIABILITY, CURRENTNESS, OR OTHERWISE. IN NO CASE WILL THESE PARTIES BE LIABLE FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR OTHER DAMAGES THAT MAY RESULT FROM USE OF THE PAUP SOFTWARE.
If you cannot abide by the terms of this agreement, do not make use of the PAUP Software.
Copyright (c) David L. Swofford, Illinois Natural History Survey, 1989-1993
Copyright (c) David L. Swofford, 1994
Copyright (c) Smithsonian Institution, 1997
All Rights Reserved.
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
indicates which type of analysis is to be performed: heuristic tree search (HSearch), branch-and-bound tree search (BANDb), exhaustive tree search (ALLtrees), reconstruction of a neighbor-joining tree (NJ), displaying a score disribution of randomly generated trees (RANDtrees), bootstrap analysis (BOOTstrap).
sets the criterion to be used to evaluate trees during tree searches: parsimony (P), distance (D), or likelihood (L).
sets the beginning position for all input sequences. When the beginning position is set from the command line, PAUPSearch ignores beginning positions specified for individual sequences in a list file.
sets the ending position for all input sequences. When the ending position is set from the command line, PAUPSearch ignores ending positions specified for sequences in a list file.
saves the NEXUS file used that is used as a script to run PAUP. This file can be used for documentation purposes or can be edited and used as input to the PAUP program.
doesn't perform the analysis. This is used in conjunction with -SCRIPT when you want to create a script file and exit without performing the analysis.
doesn't display information about analysis parameters.
submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
creates a file into which program output is regularly written during a search. When you run a long search in batch, you can view the contents of this file to watch the progress of the search. You can control how often this information is written by assigning a value to -DSTATus. You would also use -LOGFile if you wanted to save the information that is sent to the screen when -LOGITer or -RSTATus are used. This can be useful in analyzing the effectiveness of a tree search.
sets the maximum number of trees that can be saved in memory at any time during the analysis. The default is 100.
automatically increments the value of -MAXTrees by another 100 trees each time the current limit is reached. If -AUTOINCrease is used without a value, the increment value is set to 100. The program default is -NOAUTOINCrease, which means that if the program generates more than the value of -MAXTrees trees at any time, the analysis will be terminated and results up to that point will be reported. Use -AUTOINCrease with caution, because with some data, you may end up with a "run-away" search that will keep generating trees until you run out of memory.
sets the method used to create the starting tree for the heuristic algorithm. When -CRITerion=D, the default value is nj (the neighbor-joining algorithm is used to create the initial tree). Otherwise the default is to create the initial tree by stepwise addition. When the criterion is parsimony, only stepwise can be used to specify a starting tree. But when the criterion is distance or likelihood, you can specify either nj or stepwise as a value for -STARt.
When -STARt=stepwise, -ADDSeq sets the method used to choose the next sequence to add to the growing tree during the stepwise-addition process. There are five methods: simple (the default) picks the sequence that has the next highest "advancement index" according to the "simple algorithm" devised by Farris. asis uses the order in which the sequences appear in the alignment. random picks the next sequence at random from the remaining sequences. closest picks the sequence that will increase the tree score the least and furthest picks the sequence that will increase the tree score the most.
The following seven parameters can be used in conjunction with -ADDSeq.
assigns the first sequence in the alignment to be used as the "reference" sequence when -ADDSeq=simple. The default is sequence number 1; you can assign another sequence to be the reference sequence by specifying its number (order in the alignment) as the -REFtax value. The distance between this sequence and each of the other sequences in the alignment is referred to as an advancement index, which is used to determine the order that sequences will be added to the growing tree.
sets the number of random-addition-sequence replications to perform (only applies if -ADDSeq=random).
sets the initial seed for the random number generator which determines which sequence to choose for stepwise addition (only applies if -ADDSeq=random).
abandons the current random-addition-sequence replicate if chucking limits are met (only applies if -ADDSeq=random and -NCHUCK and -CHUCKScore are set).
keeps in memory no more than five trees whose scores are greater than or equal to the value of -CHUCKScore. If -ADDSeq=random, this applies to the stepwise-addition stage, otherwise it applies to the branch-swapping stage. If the conditions set by -NCHUCK and -CHUCKScore are met before the conditions set by -KEEP, the "chucking" parameters take precedence.
sets the cutoff score for chucking to 700.0. This limits the number of trees (the value of -NCHUCK) that will be held in memory whose scores are greater than or equal to 700.0. If -ADDSeq=random, this applies to the stepwise-addition stage, otherwise it applies to the branch-swapping stage. If the conditions set by -NCHUCK and -CHUCKScore are met before the conditions set by -KEEP, the "chucking" parameters take precedence.
displays information about each random-addition-sequence replicate when -ADDSeq=random. Use -LOGFile also if you want to save this information to a file.
sets the algorithm for branch swapping. In descending order of effectiveness these options are: tbr (tree bisection-reconnection), spr (subtree pruning-regrafting); nni (nearest-neighbor interchange); and none (no branch swapping after the stepwise-addition stage). none is very fast, but also very ineffective at finding the optimal tree.
holds the five best trees in memory at each cycle of stepwise addition and uses them all in the next cycle of stepwise addition. The default is -HOLD=1. As branches (sequences) are added to the growing tree, the best tree at an intermediate stage may not lead to the best final tree. If some lower scoring trees are saved at each stepwise-addition cycle, it is more likely that the best tree will be found.
swaps on nonoptimal (nonminimal) trees as well as optimal trees if the value of -HOLD is greater than 1. This slows the search, but increases the effectiveness of branch swapping to find the best tree.
uses the steepest-descent modification to branch swapping. This tells PAUP to swap on all optimal starting trees. (Normally, if PAUP finds a better tree during the branch-swapping process, it will discard any starting trees that haven't been swapped yet, and use the new best tree as the starting tree for the next round of branch swapping.) By keeping and swapping branches on all of the optimal starting trees, you may increase the chance of finding other "families" of trees. However, you also may get bogged down finding large numbers of trees that will be found anyway if you perform more replicates of the random-addition sequence during the stepwise-addition stage.
saves all trees (up to the limit set by -MAXTrees) whose score is less than or equal to 750.0. The default is 0, which means that only the best trees are kept.
saves the five best trees found during the search. This is an alternative to -KEEP=750 that is especially useful when the criterion is distance or maximum-likelihood.
when -NBEST has a value assigned, use -ALLSwap to swap on all of the best trees that you are saving. It increases your chances of finding all of the trees with the best scores, at the expense of longer analysis times. The default is -NOALLSwap.
When -CRITerion=P, and a zero-length branch is encountered while finding the most parsimonious reconstruction, you can keep the zero-length branch in order to maintain its parent node as a dichotomous node (a node with two descendants), or you can collapse the branch into its parent node to form a polytomy (a node with more than two descendants). The default, maxbrlen, is a fairly restrictive collapsing method. It will collapse the branch to form a polytomy only if the maximum branch length over all most parsimonious reconstructions is zero. The minbrlen option is most permissive. It will collapse a branch if it has a length of zero in only one most parsimonious reconstruction. (Note that this can lead to "over collapsing" and result in a tree that is no longer fully parsimonious). The ambequal method collapses a branch if the most parsimonious reconstruction sets (MPR sets) for the two incident nodes are identical for all included characters. (An MPR set is the set of state assignments that could be made to any node in at least one maximum parsimonious reconstruction.) Use -COLLapse=no if you don't want to collapse zero-length branches to yield polytomies.
discards all but one optimal tree at the end of a round of branch swapping. If more than one optimal tree is found at the end of a round of branch swapping, PAUP normally keeps all of them to use as starting trees for the next round of branch swapping (up to the number specified by -MAXTrees). -NOMULPARS is faster than the default -MULPARS, but it reduces the effectiveness of branch swapping in finding all optimal trees. (Note that -NOMULPARS has different meanings for heuristic and branch-and-bound searches.)
Ordinarily, a nearest-neighbor interchange is performed between the stepwise-addition and branch-swapping steps of the heuristic algorithm to see if a better starting tree can be found. This eliminates this interchange. The default is -QUICKSwap.
sets the interval (in number of seconds) between progress reports during the search.
sets the maximum number of trees that can be kept in memory at any time during the analysis. The default is 100.
automatically increments the value of -MAXTrees by another 100 trees each time the current limit is reached. If -AUTOINCrease is used without a value, the increment value is set to 100. The program default is -NOAUTOINCrease, which means that if the program generates more than the value of -MAXTrees trees at any time, the analysis will be terminated and results up to that point will be reported. Use -AUTOINCrease with caution, because with some data, you may end up with a "run-away" search that will keep generating trees until you run out of memory.
sets the method used to choose the next sequence to add to the growing tree during the stepwise-addition process. There are five methods: furthest (the default) picks the sequence that will increase the score the most. asis uses the order that the sequences appear in the alignment. simple picks the sequence that has the next highest "advancement index" according to the "simple algorithm" devised by Farris. maxmini and kmaxmini are based on the "maximum-of-the-minimum-values algorithm" described in the User's Manual for the MEGA software (Kumar, et al, "MEGA: Molecular Evolutionary Genetic Analysis, version 1.0", Pennsylvania State University, University Park PA).
sets the upper bound on the score of the best tree(s) to 750.0. If a value for -UPBound is not explicitly set, the program will compute an upper bound using stepwise addition. Using -UPBound with a low enough value can speed up the search.
saves all trees (up to the limit set by -MAXTrees) whose score is less than or equal to 750.0. The default is 0, which means that only the best trees are kept.
displays only a frequency distribution in the output (no trees are produced). A value for -KEEP must also be specified if -FDOnly is used.
displays the frequency distribution as a BARchart or as a HISTogram.
sets the number of frequency distribution "bins" when -FDType=HISTogram.
When -CRITerion=P, and a zero-length branch is encountered while finding the most parsimonious reconstruction, you can keep the zero-length branch in order to maintain its parent node as a dichotomous node (a node with two descendants), or you can collapse the branch into its parent node to form a polytomy (a node with more than two descendants). The default, maxbrlen, is a fairly restrictive collapsing method. It will collapse the branch to form a polytomy only if the maximum branch length over all most parsimonious reconstructions is zero. The minbrlen option is most permissive. It will collapse a branch if it has a length of zero in only one most parsimonious reconstruction. (Note that this can lead to "over collapsing" and result in a tree that is no longer fully parsimonious). The ambequal method collapses a branch if the most parsimonious reconstruction sets (MPR sets) for the two incident nodes are identical for all included characters. (An MPR set is the set of state assignments that could be made to any node in at least one maximum parsimonious reconstruction.) Use -COLLapse=no if you don't want to collapse zero-length branches to yield polytomies.
discards all but one optimal tree found during a branch-and-bound search. If more than one optimal tree is found, PAUP normally keeps all of them (up to the number specified by -MAXTrees). -NOMULPARS is faster than the default -MULPARS, and therefore is useful if you only want to know the score of the best tree(s), and not the topology of the tree(s). (Note that -NOMULPARS has different meanings for heuristic and branch-and-bound searches.)
sets the interval (in number of seconds) between progress reports during the search.
sets the maximum number of trees that can be saved in memory at any time during the analysis. The default is 100.
automatically increments the value of -MAXTrees by another 100 trees each time the current limit is reached. If -AUTOINCrease is used without a value, the increment value is set to 100. The program default is -NOAUTOINCrease, which means that if the program generates more than the value of -MAXTrees trees at any time, the analysis will be terminated and results up to that point will be reported. Use -AUTOINCrease with caution, because with some data, you may end up with a "run-away" search that will keep generating trees until you run out of memory.
saves all trees (up to the limit set by -MAXTrees) whose score is less than or equal to 750.0. The default is 0, which means that only the best trees are kept.
When -CRITerion=P, and a zero-length branch is encountered while finding the most parsimonious reconstruction, you can keep the zero-length branch in order to maintain its parent node as a dichotomous node (a node with two descendants), or you can collapse the branch into its parent node to form a polytomy (a node with more than two descendants). The default, maxbrlen, is a fairly restrictive collapsing method. It will collapse the branch to form a polytomy only if the maximum branch length over all most parsimonious reconstructions is zero. The minbrlen option is most permissive. It will collapse a branch if it has a length of zero in only one most parsimonious reconstruction. (Note that this can lead to "over collapsing" and result in a tree that is no longer fully parsimonious). The ambequal method collapses a branch if the most parsimonious reconstruction sets (MPR sets) for the two incident nodes are identical for all included characters. (An MPR set is the set of state assignments that could be made to any node in at least one maximum parsimonious reconstruction.) Use -COLLapse=no if you don't want to collapse zero-length branches to yield polytomies.
displays the frequency distribution graph as a BARchart or as a HISTogram, or no (don't display the distribution).
sets the number of frequency distribution "bins" when -FDType=HISTogram.
creates a separate output file containing the frequency distribution in numerical rather than graphical form. This can be used as input to a spreadsheet program.
sets the interval (in number of seconds) between progress reports during the search.
specifies the number of substitution types for the substitution model. Accepted values are 1, 2 (the default), and 6.
sets the transition (ti) : transversion (tv) ratio. The default ratio is 2.0. In addition to setting the ratio yourself, you can ask the program to estimate it from the sequence data by specifying -TRATio=estimate.
sets the rate matrix when a substitution model with six substitution types is specified. To set the rates yourself, supply a list of five integers or real numbers, separated by commas, after the parameter. These numbers represent the rates for AC, AG, AT, CG, and CT substitutions, respectively. The default is -RMATrix=1,1,1,1,1. In addition to setting the rates yourself, you can ask the program to estimate them by specifying -RMATrix=estimate. When the program estimates the rates, the search will be slower than when you specify the rate matrix.
sets the variant for unequal base frequencies when a substitution model with two substitution types (-NST=2) is specified. The two values for this parameter are hky (Hasegawa-Kishino-Yano's 1985 model) and f84 (Felsenstein's 1984 method).
sets the base frequencies to use. You can supply a list of three real numbers, separated by commas, to represent the fraction of the bases that are A, C, and G, respectively (the fraction of bases that are T is calculated from the other three values), for example, -BASEFReq=0.25,0.33,0.3. Alternatively, you can tell the program that the frequencies are equal (all base frequencies will be set to 0.25) or ask it to calculate the base frequencies from the data in the alignment (empirical, the default).
sets the model for the substitution rate variation across sites. The substitution rate can be equal at all sites (the default) or can vary according to the gamma distribution.
sets the value of the shape parameter of the gamma distribution equation when -RATes=gamma. The default is 0.5, and the value must be greater than 0.0.
sets the number of rate categories for the discrete gamma distribution when -RATes=gamma. The higher the number of categories, the closer the discrete gamma distribution will conform to the continuous gamma distribution, but at an increasing cost in computer time and memory. The default value of 4 is a good compromise.
sets how the rate categories for the discrete gamma distribution are represented when -RATes=gamma. The rate categories can be represented by the mean (default) or the median value for that category.
displays the iteration log.
sets the distance correction method to use. The default is the p distance for nucleic acid sequences and mean distance for protein sequences. The methods that can be used for both types of sequences are total distance and mean distance. The following methods can be used only with nucleic acid sequences: p distance (uncorrected distance), abs (absolute distance, not normalized by the number of sites), jc (Jukes-Cantor), tajnei (Tajima-Nei), k2p (Kimura 2-parameter), f81 (Felsenstein 1981), f84 (Felsenstein 1984), hky85 (Hasegawa-Kishino-Yano, 1985), k3p (Kimura 3-parameter), tamnei (Tamura-Nei), gtr (general time-reversible), logdet (log determinant), ml (maximum likelihood distance), and a custom distance, which allows you to design your own distance correction method by means of the -CLAss and -BASEFReq parameters. When -DISTance=ml, any of the likelihood optional parameters can be used.
is used when -DISTance=custom to specify which of the six possible substitution types fall into the same class. The order of the substitutions in the parameter list is AC, AG, AT, CG, CT, GT. Classes are designated by the letters a through f. In the default case shown above, all six substitution types are assigned to the same class. To assign transitions (AG and CT) and transversions (AC, AT, CG, GT) to be in two separate classes, you would specify -CLAss=a,b,a,a,b,a.
is used when -DISTance=custom to specify whether the base frequencies in the sequence alignment should be considered equal or if they should be calculated from the data (empirical). The default is empirical.
specifies how to treat gaps and ambiguous bases in an alignment when computing distances between nucleic acid sequences. When -MISSDist=ignore, the program does not take any sites containing gaps or ambiguous bases into account when computing the pairwise distances between sequences. The default value, infer, directs the program to guess which nucleotide a gap or ambiguous base represents based on the composition of the sequence data. When both nucleotides of a sequence pair are maximally ambiguous at a site (gap or N) the site is ignored even if infer is specified.
estimates distances based on all substitutions (all), transitions only (ti) or transversions only (tv). This parameter is ignored when -DISTance=logdet.
sets the model for the substitution rate variation across sites. The substitution rate can be equal at all sites (the default) or can vary according to the gamma distribution.
sets the value of the shape parameter of the gamma distribution equation when -RATes=gamma. The default is 0.5, and the value must be greater than 0.0.
sets how negative branch lengths are treated if they occur in a tree. You can allow negative branch lengths, or you can specify one of the following: prohibit (branch lengths are optimized under the constraint that they be nonnegative), setzero (sets the negative branch length to 0.0 without affecting any of the other branch lengths), setabs (resets the negative branch length to its absolute value without affecting any of the other branch lengths).
displays the iteration log.
sets the initial seed for the random number generator to 13.
sets the number of random trees to evaluate to 1000.
output the frequency distribution as a BARchart or as a HISTogram.
sets the number of frequency distribution "bins" when -FDType=HISTogram.
creates a separate output file containing the frequency distribution in numerical rather than graphical form. This can be used as input to a spreadsheet program.
sets the search method to use for bootstrapping. There are three options: H (heuristic search), the default, B (branch-and-bound search), and N (neighbor-joining algorithm). When -SEArch=B, any of the branch-and-bound search optional parameters can be used. When -SEArch=H, any heuristic search optional parameters can be used. When -SEArch=N, the distance parameter -DISTance can be used.
sets the initial seed for the random number generator that determines which columns of the alignment will be used to form a bootstrap replicate data set.
sets the number of bootstrap replicates (resampled data sets) to create and analyze.
includes in the bootstrap consensus tree only those groups that occurred in more than 50 percent of the bootstrap replicates. (If -KEEPALL is used, any value you set with -CONlevel is ignored, in effect setting -CONlevel=50.)
retains all groups in the bootstrap consensus that are compatible with all of the groups that are already in the consensus. Default is -NOKEEPALL. (-KEEPALL overrides any value you set with -CONlevel, in effect setting -CONlevel=50.)
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.