Collections of Trees¶
Collections of Trees: The TreeList
Class¶
TreeList
objects are collections of Tree
objects constrained to sharing the same TaxonNamespace
.
Any Tree
object added to a TreeList
will have its taxon_namespace
attribute assigned to the TaxonNamespace
object of the TreeList
, and all referenced Taxon
objects will be mapped to the same or corresponding Taxon
objects of this new TaxonNamespace
, with new Taxon
objects created if no suitable match is found.
Objects of the TreeList
class have an “annotations
” attribute, which is a AnnotationSet
object, i.e. a collection of Annotation
instances tracking metadata.
More information on working with metadata can be found in the “Working with Metadata Annotations” section.
Reading and Writing TreeList
Instances¶
The TreeList
class supports the “get
” factory class method for simultaneously instantiating and populating TreeList
instances, taking a data source as the first argument and a schema specification string (”nexus
”, “newick
”, “nexml
”, “fasta
”, or “phylip
”, etc.) as the second:
import dendropy
treelist = dendropy.TreeList.get(path='pythonidae.mcmc.nex', schema='nexus')
The “read
” instance method can be used to add trees from a data source to an existing TreeList
instance:
import warnings
import dendropy
warnings.warn(
"This example is known to be broken! "
"It will be fixed or removed in the future. "
"See https://github.com/jeetsukumaran/DendroPy/issues/160 for details. "
"Patch contributions are welcome.",
)
trees = dendropy.TreeList()
trees.read(path="sometrees.nex", schema="nexus", tree_offset=10)
trees.read(data="(A,(B,C));((A,B),C);", schema="newick")
A TreeList
object can be written to an external resource using the “write
” method:
import dendropy
treelist = dendropy.TreeList.get(
path="trees1.nex",
schema="nexus",
)
treelist.write(
path="trees1.newick",
schema="newick",
)
It can also be represented as a string using the “as_string
” method:
import dendropy
treelist = dendropy.TreeList.get(
path="trees1.nex",
schema="nexus",
)
print(treelist.as_string(schema="newick",)
More information on reading operations is available in the Reading and Writing Phylogenetic Data section.
Using and Managing the Collections of Trees¶
A TreeList
behaves very much like a list, supporting iteration, indexing, slices, removal, indexing, sorting, etc.:
import dendropy
from dendropy.calculate import treecompare
trees = dendropy.TreeList.get(
path="pythonidae.random.bd0301.tre",
schema="nexus")
for tree in trees:
print(tree.as_string("newick"))
print(len(trees))
print(trees[4].as_string("nexus"))
print(treecompare.robinson_foulds_distance(trees[0], trees[1]))
print(treecompare.weighted_robinson_foulds_distance(trees[0], trees[1]))
first_10_trees = trees[:10]
last_10_trees = trees[-10:]
# Note that the TaxonNamespace is propogated to slices
assert first_10_trees.taxon_namespace is trees.taxon_namespace
assert first_10_trees.taxon_namespace is trees.taxon_namespace
print(id(trees[4]))
print(id(trees[5]))
trees[4] = trees[5]
print(id(trees[4]))
print(id(trees[5]))
print(trees[4] in trees)
trees.remove(trees[-1])
tx = trees.pop()
print(trees.index(trees[0]))
trees.sort(key=lambda t:t.label)
trees.reverse()
trees.clear()
- The
TreeList
class supports the native Pythonlist
interface methods of adding individualTree
instances through append
,extend
,insert
, and other methods, but with the added aspect of taxon namespace migration:
import dendropy
from dendropy.calculate import treecompare
trees = dendropy.TreeList.get(
path="pythonidae.random.bd0301.tre",
schema="nexus")
print(len(trees))
tree = dendropy.Tree.get(path="pythonidae.mle.nex", schema="nexus")
# As we did not specify a |TaxonNamespace| instance to use above, by default
# 'tree' will get its own, distinct |TaxonNamespace|
original_tree_taxon_namespace = tree.taxon_namespace
print(id(original_tree_taxon_namespace))
assert tree.taxon_namespace is not trees.taxon_namespace
# This operation adds the |Tree|, 'tree', to the |TreeList|, 'trees',
# *and* migrates the |Taxon| objects of the tree over to the |TaxonNamespace|
# of 'trees'. This will break things if the tree is contained in another
# |TreeList| with a different |TaxonNamespace|!
trees.append(tree)
# In contrast to before, the |TaxonNamespace| of 'tree' is not the same
# as the |TaxonNamespace| of 'trees. The |Taxon| objects have been imported
# and/or remapped based on their label.
assert tree.taxon_namespace is trees.taxon_namespace
print(id(original_tree_taxon_namespace))
Cloning/Copying a TreeList
¶
You can make a shallow-copy of a TreeList
calling dendropy.datamodel.treecollectionmodel.TreeList.clone
with a “depth
” argument value of 0 or by slicing:
import dendropy
# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
data=s1,
schema="newick")
# shallow copy by calling Tree.clone(0)
treelist2 = treelist1.clone(depth=0)
# shallow copy by slicing
treelist3 = treelist1[:]
# same tree instances are shared
for t1, t2 in zip(treelist1, treelist2):
assert t1 is t2
for t1, t2 in zip(treelist1, treelist3):
assert t1 is t2
# note: (necessarily) sharing same TaxonNamespace
assert treelist2.taxon_namespace is treelist1.taxon_namespace
assert treelist3.taxon_namespace is treelist1.taxon_namespace
With a shallow-copy, the actual Tree
instances are shared between lists (as is the TaxonNamespace
).
For a taxon namespace-scoped deep-copy, on the other hand, i.e., where the Tree
instances are also cloned but the Taxon
and TaxonNamespace
references are preserved, you can call dendropy.datamodel.treecollectionmodel.TreeList.clone
with a “depth
” argument value of 1 or by copy construction:
import dendropy
# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
data=s1,
schema="newick")
# taxon namespace-scoped deep copy by calling Tree.clone(1)
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
treelist2 = treelist1.clone(depth=1)
# taxon namespace-scoped deep copy by copy-construction
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
treelist3 = dendropy.TreeList(treelist1)
# *different* tree instances
for t1, t2, t3 in zip(treelist1, treelist2, treelist3):
assert t1 is not t2
assert t1 is not t3
assert t2 is not t3
# Note: TaxonNamespace is still shared
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
assert treelist2.taxon_namespace is treelist1.taxon_namespace
assert treelist3.taxon_namespace is treelist1.taxon_namespace
Finally, for a true and complete deep-copy, where even the Taxon
and TaxonNamespace
references are copied, call copy.deepcopy
:
import copy
import dendropy
# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
data=s1,
schema="newick")
# Full deep copy by calling copy.deepcopy()
# I.e. Everything cloned including Taxon and TaxonNamespace instances
treelist2 = copy.deepcopy(treelist1)
# *different* tree instances
for t1, t2 in zip(treelist1, treelist2):
assert t1 is not t2
# Note: TaxonNamespace is also different
assert treelist2.taxon_namespace is not treelist1.taxon_namespace
for tx1 in treelist1.taxon_namespace:
assert tx1 not in treelist2.taxon_namespace
for tx2 in treelist2.taxon_namespace:
assert tx2 not in treelist1.taxon_namespace
Efficiently Iterating Over Trees in a File¶
If you need to process a collection of trees defined in a file source, you can, of course, read the trees into a TreeList
object and iterate over the resulting collection:
import dendropy
trees = dendropy.TreeList.get(path='pythonidae.beast-mcmc.trees', schema='nexus')
for tree in trees:
print(tree.as_string('newick'))
In the above, the entire data source is parsed and stored in the trees
object before being processed in the subsequent lines.
In some cases, you might not need to maintain all the trees in memory at the same time.
For example, you might be interested in calculating the distribution of a statistic over a collection of trees, but have no need to refer to any of the trees after the statistic has been calculated.
In this case, it will be more efficient to use the yield_from_files
function.
This takes a list or any other iterable of file-like objects or strings (giving filepaths) as the first argument (”files
”) and a mandatory schema specification string as the second argument (”schema
).
Additional keyword arguments to customize the parsing are the same as that for the general “get
” and “read
” methods.
For example, the following script reads a model tree from a file, and then iterates over a collection of MCMC trees in a set of files, calculating and storing the symmetric distance between the model tree and each of the MCMC trees one at time:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import dendropy
from dendropy.calculate import treecompare
distances = []
taxa = dendropy.TaxonNamespace()
mle_tree = dendropy.Tree.get(
path='pythonidae.mle.nex',
schema='nexus',
taxon_namespace=taxa)
burnin = 20
source_files = [
open("pythonidae.mcmc1.nex", "r"), # Note: for 'Tree.yield_from_files',
open("pythonidae.mcmc2.nex", "r"), # sources can be specified as file
"pythonidae.mcmc3.nex", # objects or strings, with strings
"pythonidae.mcmc4.nex", # assumed to specify file paths
]
tree_yielder = dendropy.Tree.yield_from_files(
files=source_files,
schema='nexus',
taxon_namespace=taxa,
)
for tree_idx, mcmc_tree in enumerate(tree_yielder):
if tree_idx < burnin:
# skip burnin
continue
distances.append(treecompare.symmetric_difference(mle_tree, mcmc_tree))
print("Mean symmetric distance between MLE and MCMC trees: %d"
% float(sum(distances)/len(distances)))
Note how a TaxonNamespace
object is created and passed to both the get
and the yield_from_files
functions using the taxon_namespace
keyword argument.
This is to ensure that the corresponding taxa in both sources get mapped to the same Taxon
objects in DendroPy object space, so as to enable comparisons of the trees.
If this was not done, then each tree would have its own distinct TaxonNamespace
object (and associated Taxon
objects), making comparisons impossible.
When the number of trees are large or the trees themselves are large or both, iterating over trees in files using yield_from_files
is almost always going to give the best performance, sometimes orders of magnitude faster.
This is due to avoiding the Python virtual machine itself from slowing down due to memory usage.