Tree Statistics, Metrics, Summarizations, and Other Calculations¶
Some general tree metrics that are calculated without reference to any particular model or data and general report some tree metadata (e.g., tree length, node ages, etc.) are available as instance methods. More specialized tree statistics, however, are available through functions in various other modules:
The
treemeasure
module provides for calculation of statistics that are typically calculated on a single tree.The
treecompare
module provides for calculation of statistics that are typically calculated between treesThe
treescore
module provides for statistics that typically score a tree under a model and with reference to some sort of data.The
coalescent
module provides for calcuations on trees under the coalescent model.
In addition, see the PhylogeneticDistanceMatrix class for statistics, operations, and inferences based on phylogenetic (taxon-to-taxon) distances, including rapid calculation of MRCA’s, patristics distances, neighbor-joining (NJ) and Unweighted Pair Group with Mathematical Average (UPGMA) trees, phylogenetic community statistics (such as the Mean Pairwise Distance, MPD, or Mean Nearest Taxon Distance, MNTD), and more.
Native Tree Statistic and Metric Methods¶
Basic meta-information about tree structure are available as native Tree
methods.
Tree Length¶
The length
method returns the sum of edge lengths of a Tree
object, with edges that do not have any length assigned being treated as edges with length 0.
The following example shows how to identify the “critical” value for an Archie-Faith-Cranston or PTP test from a sample of Tree
objects, i.e. a tree length equal to or greater than 95% of the trees in the sample:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
trees = dendropy.TreeList.get(
path="pythonidae.random.bd0301.tre",
schema="nexus")
tree_lengths = [tree.length() for tree in trees]
tree_lengths.sort()
crit_index_95 = int(0.95 * len(tree_lengths))
crit_index_99 = int(0.99 * len(tree_lengths))
print("95%% critical value: %s" % tree_lengths[crit_index_95])
print("99%% critical value: %s" % tree_lengths[crit_index_99])
Node Ages¶
The calc_node_ages
method calculates the age of a node (i.e., the sum of edge lengths from the node to a tip) and assigns it to the age
attribute. The following example iterates through the post-burnin of an MCMC sample of ultrametric trees, calculating the age of the MRCA of two taxa, and reports the mean age of the node.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
from dendropy.calculate import treemeasure
trees = dendropy.TreeList.get(
path="pythonidae.beast-mcmc.trees",
schema="nexus",
tree_offset=200)
maculosa_childreni_ages = []
for idx, tree in enumerate(trees):
tree.calc_node_ages()
taxon_labels = ["Antaresia maculosa","Antaresia childreni"]
mrca = tree.mrca(taxon_labels=taxon_labels)
maculosa_childreni_ages.append(mrca.age)
print("Mean age of MRCA of 'Antaresia maculosa' and 'Antaresia childreni': %s" \
% (float(sum(maculosa_childreni_ages))/len(maculosa_childreni_ages)))
Number of Lineages at a Particular Time and Lineage Through Time Plots¶
The num_lineages_at
method of the Tree
class returns the number of lineages at a particular time given in terms of distance from the root.
The following example extracts the number of lineages at fixed intervals along the length of the tree to use in an Lineage Through Time (LTT) plot:
import dendropy
tree = dendropy.Tree.get(
path="hiv1.nexus",
schema="nexus")
# Returns distance of node furthest from root, i.e., maximum time available on
# tree
total_time = tree.max_distance_from_root()
# Divide time span into 10 steps
step = float(total_time) / 10
# To store tuples of (time, number of lineages)
ltt = []
# Start at first time step
current_time = step
while current_time <= total_time:
# Get number of lineages at current time
num_lineages = tree.num_lineages_at(current_time)
# Store it
ltt.append( (current_time, num_lineages) )
# Move to next time step
current_time += step
# Get the final number of lineages
# Note: may not be the same as the number of tips if the tree has extinct
# tips/taxa; though, if this were the case, we would not be dealing with an
# ultrametric tree.
if current_time < total_time:
ltt.append( (total_time, tree.num_lineages_at(total_time)) )
# Print results
for t, num_lineages in ltt:
print("{:12.8f}\t{}".format(t, num_lineages))
Unary Tree Statistics and Metrics¶
Numerous specialized statistics and indexes of tree shape and structure (B1, Colless’ imbalance, Pybus-Harvey-Gamma, etc.) are available through the treemeasure
module:
import collections
import dendropy
from dendropy.calculate import treemeasure
from dendropy.calculate import statistics
# Since we do not want to waste memory by keeping the actual trees around
# after we are done calculating the statistics, we use the tree yielder
# instead of:
# dendropy.TreeList.get(
# path="pythonidae.beast-mcmc.trees",
# schema="nexus",
# tree_offset=200)
tree_stats = collections.defaultdict(list)
for tree_idx, tree in enumerate(dendropy.Tree.yield_from_files(
files=["pythonidae.beast-mcmc.trees"],
schema="nexus")):
if tree_idx < 200:
continue # burnin
tree_stats["B1"].append(treemeasure.B1(tree))
tree_stats["colless"].append(treemeasure.colless_tree_imbalance(tree))
tree_stats["PBH"].append(treemeasure.pybus_harvey_gamma(tree))
tree_stats["sackin"].append(treemeasure.sackin_index(tree))
tree_stats["treeness"].append(treemeasure.treeness(tree))
for key in tree_stats:
values = tree_stats[key]
mean, var = statistics.mean_and_sample_variance(values)
hpd = statistics.empirical_hpd(values)
print("{:15}: mean = {}, variance = {}, hpd = ({}, {})".format(key, mean, var, hpd[0], hpd[1]))
Pybus-Harvey Gamma¶
The Pybus-Harvey Gamma statistic is given by the pybus_harvey_gamma
instance method. The following example iterates through the post-burn-in of an MCMC sample of trees, reporting the mean Pybus-Harvey Gamma statistic:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
from dendropy.calculate import treemeasure
trees = dendropy.TreeList.get(
path="pythonidae.beast-mcmc.trees",
schema="nexus",
tree_offset=200)
pbhg = []
for idx, tree in enumerate(trees):
pbhg.append(treemeasure.pybus_harvey_gamma(tree))
print("Mean Pybus-Harvey-Gamma: %s" \
% (float(sum(pbhg))/len(pbhg)))
Patristic Distances¶
The PhylogeneticDistanceMatrix
is the most efficient way to calculate the patristic distances between taxa or leaves on a tree, when doing multiple such calculations.
The easiest way to get an object of this class for a particular tree is to call phylogenetic_distance_matrix
.
The object is callable, taking two Taxon
objects as arguments and returning the sum of edge lengths between the two. The following example reports the pairwise distances between all taxa on the input tree:
import dendropy
tree = dendropy.Tree.get(
path="pythonidae.mle.nex",
schema="nexus")
pdc = tree.phylogenetic_distance_matrix()
for i, t1 in enumerate(tree.taxon_namespace[:-1]):
for t2 in tree.taxon_namespace[i+1:]:
print("Distance between '%s' and '%s': %s" % (t1.label, t2.label, pdc(t1, t2)))
Note that the PhylogeneticDistanceMatrix
object does not automatically update if the original Tree
changes: it is essentially a snapshot of Tree
at the point in which it is instantiated.
If the original Tree
changes, you should create a new instance of the corresponding PhylogeneticDistanceMatrix
object.
Comparing and Summarizing Trees¶
Distances Between Trees¶
Unweighted Robinson-Foulds Distance¶
The unweighted Robinson-Foulds distance (often referred to as just the Robinson-Foulds distance) is given by the dendropy.calculate.treecompare.symmetric_difference
function:
import dendropy
from dendropy.calculate import treecompare
s1 = "(a,(b,(c,d)));"
s2 = "(a,(d,(b,c)));"
# establish common taxon namespace
tns = dendropy.TaxonNamespace()
# ensure all trees loaded use common namespace
tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)
## Unweighted Robinson-Foulds distance
print(treecompare.symmetric_difference(tree1, tree2))
Note that the two trees must share the same TaxonNamespace
reference, otherwise an error will be raised:
>> import dendropy
>> from dendropy.calculate import treecompare
>> s1 = "(a,(b,(c,d)));"
>> s2 = "(a,(d,(b,c)));"
>> tree1 = dendropy.Tree.get(data=s1, schema='newick')
>> tree2 = dendropy.Tree.get(data=s2, schema='newick')
>> print(treecompare.symmetric_difference(tree1, tree2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
print(treecompare.symmetric_difference(tree1, tree2))
File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 85, in symmetric_difference
is_bipartitions_updated=is_bipartitions_updated)
File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 221, in false_positives_and_negatives
raise error.TaxonNamespaceIdentityError(reference_tree, comparison_tree)
dendropy.utility.error.TaxonNamespaceIdentityError: Non-identical taxon namespace references: <TaxonNamespace object at 0x10052d310> is not <TaxonNamespace object at 0x101572210>
Note, too, that results very much depend on the rooting states of the tree:
import dendropy
from dendropy.calculate import treecompare
s1 = "(a,(b,(c,d)));"
s2 = "((a,b),(c,d));"
tns = dendropy.TaxonNamespace()
unrooted_tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
unrooted_tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)
rooted_tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
rooting="force-rooted",
taxon_namespace=tns)
rooted_tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
rooting="force-rooted",
taxon_namespace=tns)
## Unweighted Robinson-Foulds distance (rooted) = 2
print(treecompare.symmetric_difference(rooted_tree1, rooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted) = 0
print(treecompare.symmetric_difference(unrooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (rooted1 to unrooted2) = 3
print(treecompare.symmetric_difference(rooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted1 to rooted2) = 5
print(treecompare.symmetric_difference(unrooted_tree1, rooted_tree2))
Weighted Robinson-Foulds Distance¶
The weighted Robinson-Foulds distance takes edge lengths into account, and is given by the dendropy.calculate.treecompare.weighted_robinson_foulds_distance
:
import dendropy
from dendropy.calculate import treecompare
s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"
tns = dendropy.TaxonNamespace()
tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)
## Weighted Robinson-Foulds distance = 2.971031
print(treecompare.weighted_robinson_foulds_distance(tree1, tree2))
## Compare to unweighted Robinson-Foulds distance: 0
print(treecompare.symmetric_difference(tree1, tree2))
Euclidean Distance¶
The Euclidean distance, like the weighted Robinson-Foulds distance takes edge lengths into account, but squares the edge lengths instead of taking the absolute distance, and is given by the dendropy.calculate.treecompare.euclidean_distance
:
import dendropy
from dendropy.calculate import treecompare
s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"
tns = dendropy.TaxonNamespace()
tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)
## Euclidean distance = 2.22326363775
print(treecompare.euclidean_distance(tree1, tree2))
Majority-Rule Consensus Tree from a Collection of Trees¶
To get the majority-rule consensus tree of a TreeList
object, you can call the consensus
instance method.
You can specify the frequency threshold for the consensus tree by the min_freq
argument, which default to 0.5 (i.e., a 50% majority rule tree).
The following example aggregates the post-burn-in trees from four MCMC samples into a single TreeList
object, and prints the 95% majority-rule consensus as a Newick string:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
trees.read(
path=tree_file,
schema='nexus',
tree_offset=20)
con_tree = trees.consensus(min_freq=0.95)
print(con_tree.as_string(schema='newick'))
Frequency of a Split in a Collection of Trees¶
The frequency_of_split
method of a TreeList
object returns the frequency of occurrence of a single split across all the Tree
objects in the TreeList
.
The split can be specified by passing a split bitmask directly using the split_bitmask
keyword argument, as a list of Taxon
objects using the taxa
keyword argument, or as a list of taxon labels using the labels
keyword argument.
The following example shows how to calculate the frequency of a split defined by two taxa, “Morelia amethistina” and “Morelia tracyae”, from the post-burn-in trees aggregated across four MCMC samples:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
trees.read(
path=tree_file,
schema='nexus',
tree_offset=20)
split_leaves = ['Morelia amethistina', 'Morelia tracyae']
f = trees.frequency_of_bipartition(labels=split_leaves)
print('Frequency of split %s: %s' % (split_leaves, f))
The Maximum Clade Credibility Tree: The Tree that Maximizes the Product of Split Support¶
The Maximum Clade Credibility Tree (MCCT) is one that maximize the product of split support, and is returned for a collection of trees managed in a TreeList
instance by the maximum_product_of_split_support_tree
method:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
trees.read(
path=tree_file,
schema='nexus',
tree_offset=20)
mcct = trees.maximum_product_of_split_support_tree()
print("\nTree {} maximizes the product of split support (log product = {}): {}".format(trees.index(mcct), mcct.log_product_of_split_support, mcct))
msst = trees.maximum_sum_of_split_support_tree()
print("\nTree {} maximizes the sum of split support (sum = {}): {}".format(trees.index(msst), msst.sum_of_split_support, msst))
Unfortunately, terminology in usage and literature regarding this type of summary is very confusing, and sometimes the term “MCCT” is used to refer to the tree that maximizes the sum of split support and “MCT” to the tree that maximizes the product of split support.
If the tree that maximizes the sum of split support is the criteria required, then the maximum_sum_of_split_support_tree
method of the TreeList
object should be used.
Scoring Trees Under the Coalescent¶
Probability Under the Coalescent Model¶
The coalescent
module provides a range of methods for simulations and calculations under Kingman’s coalescent framework and related models. For example:
log_probability_of_coalescent_tree
Given a
Tree
object as the first argument, and the haploid population size as the second, returns the log probability of theTree
under the neutral coalescent.
Numbers of Deep Coalescences¶
reconciliation_discordance
Given two
Tree
objects sharing the same leaf-set, this returns the number of deep coalescences resulting from fitting the first tree (e.g., a gene tree) to the second (e.g., a species tree). This is based on the algorithm described Goodman, et al. (Goodman, et al., 1979. Fitting the gene lineage into its species lineage,a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 19: 99-113).monophyletic_partition_discordance
Given a
Tree
object as the first argument, and a list of lists ofTaxon
objects representing the expected monophyletic partitioning of theTaxonNamespace
of theTree
as the second argument, this returns the number of deep coalescences found in the relationships implied by theTree
object, conditional on the taxon groupings given by the second argument. This statistic corresponds to the Slatkin and Maddison (1989) s statistic, as described here.
Number of Deep Coalescences when Embedding One Tree in Another (e.g. Gene/Species Trees)¶
Imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios: one in which a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation. In this case, the containing tree and the embedded trees have different leaf sets, and there is a many-to-one mapping of embedded tree taxa to containing tree taxa.
The ContainingTree
class is designed to allow for counting deep coalescences in cases like this.
It requires a TaxonNamespaceMapping
object, which provides an association between the embedded taxa and the containing taxa.
The easiest way to get a TaxonNamespaceMapping
object is to call the special factory function create_contained_taxon_mapping
.
This will create a new TaxonNamespace
to manage the gene taxa, and create the associations between the gene taxa and the containing tree taxa for you.
It takes two arguments: the TaxonNamespace
of the containing tree, and the number of genes you want sampled from each species.
If the gene-species associations are more complex, e.g., different numbers of genes per species, we can pass in a list of values as the second argument to create_contained_taxon_mapping
.
This approach should be used with caution if we cannot be certain of the order of taxa (as is the case with data read in Newick formats). In these case, and in more complex cases, we might need to directly instantiate the TaxonNamespaceMapping
object. The API to describe the associations when constructing this object is very similar to that of the TaxonNamespacePartition
object: you can use a function, attribute or dictionary.
The ContainingTree
class has its own native contained coalescent simulator, embed_contained_kingman
, which simulates and embeds a contained coalescent tree at the same time.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"
# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
data=stepwise_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
data=frag_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
containing_taxon_namespace=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
# for each rep
for rep in range(num_reps):
stepwise_tree.embed_contained_kingman(default_pop_size=40000)
frag_tree.embed_contained_kingman(default_pop_size=40000)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
If you have used some other method to simulate your trees, you can use embed_tree
to embed the trees and count then number of deep coalescences.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"
# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
data=stepwise_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
data=frag_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
containing_taxon_namespace=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
# for each rep
for rep in range(num_reps):
gene_tree1 = treesim.contained_coalescent_tree(containing_tree=stepwise_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
stepwise_tree.embed_tree(gene_tree1)
gene_tree2 = treesim.contained_coalescent_tree(containing_tree=frag_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
frag_tree.embed_tree(gene_tree2)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
For more details on simulating contained coalescent trees and counting numbers of deep coalescences on them, see “Multispecies Coalescent (“Contained Coalescent” or “Censored Coalescent”) Trees” or “Simulating the Distribution of Number Deep Coalescences Under Different Phylogeographic History Scenarios”.