[ Program Manual | User's Guide | Data Files | Databases ]
Distances creates a table of the pairwise distances within a group of aligned sequences.
Distances writes a matrix of the pairwise evolutionary distances between aligned sequences. The distances are expressed as substitutions per 100 bases or amino acids. Several methods may be chosen to correct the distances for multiple substitutions at a site. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, the Jin-Nei gamma distance method, and the Tamura method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance.
Here is a session using Distances to determine distances between the aligned sequences in the file hum_gtr.msf.
% distances DISTANCES for what aligned sequences ? hum_gtr.msf{*} Reading sequences... gtr1_human: 548 total, 548 read gtr1_human: 548 total, 548 read gtr1_human: 548 total, 548 read gtr1_human: 548 total, 548 read gtr1_human: 548 total, 548 read Distances will be computed for 5 protein sequences. Which distance correction method to use ? 1 Uncorrected distance 2 Jukes-Cantor distance 3 Kimura protein distance Choose the method to use: (* 3 *) What should I call the distance matrix file (* hum_gtr.distances *) ? Computing distances using Kimura method... 1 x 2: 48.61 1 x 3: 45.50 1 x 4: 65.74 1 x 5: 107.70 2 x 3: 61.53 2 x 4: 74.57 2 x 5: 113.82 3 x 4: 68.93 3 x 5: 104.43 4 x 5: 110.86 Statistics on pairwise distances: 5 of 10 pairs have distances exceeding 70.0. %
Here is the 5 x 5 distance matrix created during the example session:
DISTANCES between protein sequences in: hum_gtr.msf{*} October 20, 1998 13:00 Correction method: Kimura protein distance Distances are: estimated number of substitutions per 100 amino acids Symmatrix version 1 Number of matrices: 1 // Matrix 1, dimension: 5 Key for column and row indices: 1 gtr1_human 2 gtr3_human 3 gtr4_human 4 gtr2_human 5 gtr5_human Matrix 1: Part 1 1 2 3 4 5 ____________________________________________________________ .. | 1 | 0.00 48.61 45.50 65.74 107.70 | 2 | 0.00 61.53 74.57 113.82 | 3 | 0.00 68.93 104.43 | 4 | 0.00 110.86 | 5 | 0.00
Distances accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of Distances depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.
The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.
The sequences must be aligned properly for Distances to work. Since Distances does not create alignments, it is your responsibility to ensure that the sequences specified by a list file or wild-card file specification are in alignment before using them as input to Distances . One way to verify this is to use Pretty to display the sequences; if the Pretty output shows an acceptable alignment, the sequences are suitable for use with Distances.
Distances examines each pair of aligned sequences symbol-by-symbol and counts the number of exact matches, partial matches, and gap symbols. If the sequences are nucleic acids, transitions (purine-purine or pyrimidine-pyrimidine substitutions) and transversions (purine-pyrimidine substitutions) are also tallied. These counts are used, where appropriate, to compute the distance.
When sequences are very closely related, the observed distance and the actual distance between two sequences are equivalent. As the time since the sequences diverged increases, the probability that more than one substitution occurred at a single site also increases. Therefore for all but closely related sequences, the observed distance between the sequences underestimates the true distance.
In order to construct a valid tree, the observed distances must be corrected to account for multiple substitutions at a single site. A number of methods have been devised to make this correction. Each makes different assumptions about the substitution process.
This method computes the observed distance between sequences, with no correction for multiple substitutions. This uncorrected distance is sometimes referred to as the p-distance. It can be used for either nucleic acid or protein sequences, and gap positions can be factored into the calculation or ignored. A match score is computed by summing the number of exact matches. If -AMBIGuous is used, partial matches between ambiguous symbols also contribute to the match score as fractional scores (for example, the nucleotide W matched with A would score 0.5, while N matched with A would score 0.25). The similarity S is computed by dividing the match score by the number of positions scored plus the number of gap positions times the gap penalty. The distance is 1 - S. Gaps are ignored unless a nonzero value is specified for -GAPweight. End gaps are penalized as much as internal gaps, so if you choose to apply a gap penalty and gaps exist at the beginning and/or end of some of the sequences in the alignment, make sure to set the beginning and ending coordinates to exclude these regions.
This method for correcting distances can be used for nucleic acid or protein sequences. Gap positions can be factored into the equation by specifying a nonzero value for -GAPweight, and partial matches between ambiguous symbols can contribute to the match score if -AMBIGuous is used. The uncorrected distance D is computed and then corrected to account for multiple substitutions at a site using the equation below. The parameter b is 3/4 for nucleic acid sequences, 19/20 for protein sequences. End gaps are penalized as much as internal gaps, so if you choose to apply a gap penalty and gaps exist at the beginning and/or end of some of the sequences in the alignment, make sure to set the beginning and ending coordinates to exclude these regions.
The Jukes-Cantor method is based on two assumptions: that substitution occurs at any site along the sequence with equal probability, and that the probability of a change from one nucleotide to any of the other three nucleotides or from one amino acid to any of the other 19 amino acids is the same. These assumptions tend to break down as divergence time increases, so this correction method underestimates the true distance for more distantly related sequences.
This method applies to nucleic acid sequences only. It uses the same equation as the Jukes-Cantor method, except that the parameters are calculated somewhat differently: the value of the parameter b varies with the base composition of the sequence pairs. In addition, only exact matches are considered in computing the match score, and gap positions are always ignored. In the equations below, A=1, T=2, C=3, G=4.
h = S_{(i = A,C)} S_{(k = T,G)} (^{(1)}/_{(2)} pairfreq[i,k]^{(2)} * fraction[i] * fraction[k])
distance = -b ln( 1 - ^{(D)}/_{(b)} )
Tajima and Nei, Mol. Biol. Evol. 1; 269-285 (1984), equation 6.
The Tajima-Nei correction method makes two assumptions: substitution occurs at any site along the sequence with equal probability, and substitution occurs according to the "equal input" model of nucleotide substitution. The equal input model assumes that the rate of substitution to a given nucleotide is the same, regardless of the original nucleotide, i.e., that a change from A to T has the same rate as the change from G to T. If these assumptions do not hold, the method underestimates the true distance as the distance increases.
This method applies only to nucleic acids and takes into consideration the fact that transition substitutions (purine-purine or pyrimidine-pyrimidine) often occur much more frequently than transversion substitutions (purine-pyrimidine). Gap positions and ambiguous symbols other than R (purine) and Y (pyrimidine) are not scored.
This method gives better distance estimates than the Jukes-Cantor method when the rates of transitional and transversional substitutions are different. However, when the substitution pattern is more complex than this, this method underestimates the true distance for distantly related sequences.
This method applies only to nucleic acids and assumes that substitution occurs at any site along the sequence with equal probability. It takes different rates of transitions and transversions into account and also takes into account deviation of G+C content from the expected value of 50 percent. Gap positions and ambiguous symbols are not scored.
theta1 = fraction G+C in sequence 1
theta2 = fraction G+C in sequence 2
C = theta1 + theta2 - 2 * theta1 * theta2
distance = -C ln(1 - ^{(P)}/_{(C)} - Q) - 0.5(1 - C) ln(1 - 2Q)
When there are strong transition-transversion and G+C-content biases, this method can yield better distance estimates than the Jukes-Cantor, Kimura two-parameter, or Tajima-Nei methods. Tamura recommends that it be used only when the corrected distance is "not very large," and implies that estimated distances greater than 50 substitutions per 100 bases may not be accurate.
This is another method that applies only to nucleic acids and that takes transitions and transversions into account. Gap positions and ambiguous symbols other than R and Y are not scored. This method is designed to be used when the substitution rate varies extensively from site to site. The shape parameter a is the square of the inverse of the coefficient of variation.
P = transitions / nScored
Q = transversions / nScored
distance = ^{(1)}/_{(2)}a [(1 - 2P - Q)^{((-1/a))} +
^{(1)}/_{(2)} (1 - 2Q)^{((-1/a))} - ^{(3)}/_{(2)} ]
Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990).
The gamma distance correction is based on the assumption that the nucleotide substitution rate varies from site to site according to the gamma distribution.
This method applies only to proteins. The formula calculates distances based on the relationship between observed amino acid substitutions and actual (corrected) substitutions that was derived by Dayhoff and coworkers. Gap positions are ignored, and only exact matches contribute to the match score.
M. Kimura, The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge, 1983.
This method overestimates the true distance when the uncorrected distance is greater than about 70 observed substitutions per 100 amino acids (equivalent to a Jukes-Cantor distance estimate of about 127 substitutions per 100 amino acids).
The single most critical step in tree reconstruction is the sequence alignment. If the alignment is poorly done, no amount of care or tweaking of analysis parameters will guarantee a correct tree. Multiple alignments that are created by computerized methods such as PileUp will need to be inspected and edited by hand, using an editor such as LineUp or SeqLab. Be especially careful with nucleic acid sequences that are coding regions, since computerized alignment methods have no knowledge of codon boundaries. They may insert a gap whose length is not a multiple of three or may insert a gap in the middle of a codon, for example.
Once the alignment is satisfactory, you must decide whether to use the entire alignment, or only portions of it. Only homologous regions of the sequences should be used to reconstruct a tree. Any regions of an alignment that contain data for which no homologs occur in the other sequences should be eliminated from consideration. For example, if there are gap characters at the beginning or end of one or more sequences in the alignment, the sequence data at the extremes of the alignment should not be used, since the longer sequences contain regions that have no homologs in the shorter sequences. Similarly, regions in the interior of the alignment that contain gaps in some of the sequences should probably be edited out of the alignment before trying to reconstruct a tree.
Some biological phenomena can interfere with tree reconstruction. Gene duplication is one of them. When genes are duplicated (by polyploidy or by regional duplication), one of the copies often accumulates mutations and either acquires a different function than the original gene or becomes a pseudogene. In this situation, it is often unclear which of the alternative loci will give the correct tree for the functional gene. Another complication is recombination: if recombination has occurred between sequences in the data set, no single tree can correctly explain the data.
Some data sets can also confound the existing methods for tree construction. For example, a set of sequences consisting of mostly closely related sequences with a few very divergent sequences cannot be analyzed using parsimony or a distance method based on an improperly corrected distance matrix. These methods will systematically group the widely diverged sequences together as sister groups, even if they actually belong to different lineages. If you don't want to drop the diverged sequences from the analysis, you will need to add sequences to the alignment that bridge the distance between the more distant sequences and the group of closely related sequences, or use a distance method based on a properly corrected distance matrix.
Another consideration when computing distances between coding regions is whether to use all three nucleotides in each codon or just the first or second. The substitution rate at the third codon position is usually much higher than that at the other two positions because of the degeneracy of the genetic code. In these cases, it might be best to use just the first position or just the first two positions of each codon to compute the distances.
It is important to use the proper correction method when computing distances, unless the sequences are all very closely related. Some guidelines for choosing a correction method are listed under the SUGGESTIONS topic.
If the aligned sequences are not in an MSF file format, use Pretty to display the aligned sequences you pass to Distances. If they look properly aligned in the Pretty display, they will work sensibly with Distances.
To get the best nucleotide alignments of coding regions, you also should align the sequences at the protein level and adjust the nucleotide alignment to conform to the amino acid alignment. You can do this manually using LineUp or SeqLab.
One way of detecting the presence of recombination in your sequence set is to reconstruct trees from different sections of the alignment. If different trees are found for different sections, it's possible that recombination has occurred.
To check the distance distribution of your sequences, create an uncorrected distance matrix from the alignment (using Distances) and examine the contents. If there are mostly closely related sequences with a few very divergent sequences, you must either add sequences to the alignment to bridge the distance between the more distant sequences and the group of closely related sequences, or you must use a distance method based on a properly corrected distance matrix.
Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990), give a set of guidelines for choosing a distance correction method for nucleic acid sequences. Here is a summary of their suggestions.
First compute the distances using the Jukes-Cantor method. If all the distances are less than or equal to 10 substitutions per 100 bases, there is no need to use another method (all the correction methods calculate about the same distances for closely related sequences). If the distances are greater than 10 substitutions per 100 bases, choose a correction method based on the following criteria:
- If the Jukes-Cantor distances are between 10 and 30 substitutions and there is a difference in the transition and transversion rates, use the Kimura two-parameter distance.
- If the Jukes-Cantor distances are between 30 and 100 substitutions and there is evidence that the substitution rate varies extensively from site to site, use the Jin-Nei gamma distance with -APARAMeter=1.0. If the distances lie between 30 and 100 and the frequencies of the four nucleotides deviate substantially from equality, use the Tajima-Nei distance.
- If the Jukes-Cantor distance is greater than 100 for many pairs of sequences, the tree that will be constructed from the distance data will not be reliable. Depending on your data, and the reason that you are computing the distances, one of the following suggestions may help:
a. For coding regions, try using just the first codon position or the first and second codon positions when computing the distances.
b. For coding regions, align the protein sequences and compute the distances as amino acid substitutions.
c. If you know that a certain region of the sequence is evolving very rapidly compared to the rest of the sequence, edit the alignment with LineUp to eliminate this region, and recompute the distances.
If there is a strong G+C content bias as well as a difference in transition and transversion rates, use the Tamura distance.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % distances [-INfile=]hum_gtr.msf{*} -Default Prompted Parameters: [-OUTfile=]hum_gtr.distances names the output file Correction Methods for Nucleic Acid Sequences -MENu=1 uncorrected distance 2 Jukes-Cantor distance 3 Kimura 2-parameter distance 4 Jin-Nei gamma distance 5 Tajima-Nei distance 6 Tamura distance Correction Methods for Protein Sequences -MENu=1 uncorrected distance 2 Jukes-Cantor distance 3 Kimura protein distance Local Data Files: None Optional Parameters: -BEGin=1 -END=100 sets the range of interest -FILe=hum_gtr.report names the table of counts used to calculate distances -AMBIGuous considers partial matches between ambiguous symbols -POSition=5 sets base position(s) to consider -GAPweight=0.0 sets gap penalty (uncorrected and Jukes-Cantor only) -APARAMeter=1.0 sets 'a' parameter (Jin-Nei gamma distance only) -NOMONitor suppresses screen display of the progress of the analysis
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
sets the distance correction method to use. For nucleic acid sequences, these are (in order): uncorrected distance, Jukes-Cantor distance, Kimura 2-parameter distance, Jin-Nei gamma distance, Tajima-Nei distance, and Tamura distance. For protein sequences, these are: uncorrected distance, Jukes-Cantor distance, and Kimura protein distance.
sets the beginning position for all input sequences. When the beginning position is set from the command line, Distances ignores beginning positions specified for individual sequences in a list file.
sets the ending position for all input sequences. When the ending position is set from the command line, Distances ignores ending positions specified for sequences in a list file.
creates a table of the counts used to calculate the distances: number of positions scored, exact matches, ambiguous symbol matches, transitions, transversions, gap positions, etc.
considers partial matches between ambiguous symbols when calculating distances (uncorrected and Jukes-Cantor only).
allows you to consider a single specified codon position (1, 2, or 3), the first and second positions only (4), or all three codon positions (5) when calculating distances between nucleic acid sequences.
allows you to assign a gap penalty when using the Jukes-Cantor or uncorrected distance methods.
allows you to vary the value of the shape parameter a in the equation used by the Jin-Nei gamma distance correction method.
suppresses screen display of the progress of the analysis.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.