Data Sets¶
The DataSet
class provides for objects that allow you to manage multiple types of phylogenetic data.
It has three primary attributes:
taxon_namespaces
A list of all
TaxonNamespace
objects in theDataSet
, in the order that they were added or read, includeTaxonNamespace
objects added implicitly through being associated with addedTreeList
orCharacterMatrix
objects.tree_lists
A list of all
TreeList
objects in theDataSet
, in the order that they were added or read.char_matrices
A list of all
CharacterMatrix
objects in theDataSet
, in the order that they were added or read.
DataSet
Creation and Reading¶
Reading and Writing DataSet
Objects¶
You can use the get
factory class method for simultaneously instantiating and populating DataSet
object, taking a data source as the first argument and a schema specification string (”nexus
”, “newick
”, “nexml
”, “fasta
”, “phylip
”, etc.) as the second:
>>> import dendropy
>>> ds = dendropy.DataSet.get(
path='pythonidae.nex',
schema='nexus')
The read
instance method for reading additional data into existing objects are also supported, taking the same arguments (i.e., a data source, a schema specification string, as well as optional :keyword arguments to customize the parse behavior):
import dendropy
# Create the DataSet to store data
ds = dendropy.DataSet()
# Set it up to manage all data under a single taxon namespace.
# HIGHLY RECOMMENDED!
taxon_namespace = dendropy.TaxonNamespace()
ds.attach_taxon_namespace(taxon_namespace)
# Read from multiple sources
# Add a collection of trees
ds.read(
path='pythonidae.mle.nex',
schema='nexus',)
# Add a collection of characters from a Nexus source
ds.read(
path='pythonidae.chars.nexus',
schema='nexus',)
# Add a collection of characters from a FASTA source
# Note that with this format, we have to explicitly provide the type of data
ds.read(
path='pythonidae_cytb.fasta',
schema='fasta',
data_type="dna")
# Add a collection of characters from a PHYLIP source
# Note that with this format, we have to explicitly provide the type of data
ds.read(
path='pythonidae.chars.phylip',
schema='phylip',
data_type="dna")
# Add a collection of continuous characters from a NeXML source
ds.read(
path='pythonidae_continuous.chars.nexml',
schema='nexml',)
Note
Note how the attach_taxon_namespace
method is called before invoking any “read
” statements, to ensure that all the taxon references in the data sources get mapped to the same TaxonNamespace
instance.
It is HIGHLY recommended that you do this, i.e., manage all data with the same DataSet
instance under the same taxonomic namespace, unless you have a special reason to include multiple independent taxon “domains” in the same data set.
The “write
” method allows you to write the data of a DataSet
to a file-like object or a file path
The following example aggregates the post-burn in MCMC samples from a series of NEXUS-formatted tree files into a single TreeList
, then, adds the TreeList
as well as the original character data into a single DataSet
object, which is then written out as NEXUS-formatted file:
import dendropy
taxa = dendropy.TaxonNamespace()
trees = dendropy.TreeList(taxon_namespace=taxa)
trees.read(path='pythonidae.mb.run1.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run2.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run3.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run4.t', schema='nexus', tree_offset=10)
ds = dendropy.DataSet([trees])
ds.read(path='pythonidae_cytb.fasta',
schema='fasta',
data_type='dna',
)
ds.write(path='pythonidae_combined.nex', schema='nexus')
If you do not want to actually write to a file, but instead simply need a string representing the data in a particular format, you can call the instance method as_string
, passing a schema specification string as the first argument:
import dendropy
ds = dendropy.DataSet()
ds.read_from_path('pythonidae.cytb.fasta', 'dnafasta')
s = ds.as_string('nexus')
or:
dna1 = dendropy.DataSet.get(file=open("pythonidae.nex"), schema="nexus")
s = dna1.as_string(schema="fasta")
print(s)
In addition, fine-grained control over the reading and writing of data is available through various keyword arguments. More information on reading operations is available in the Reading and Writing Phylogenetic Data section.
Creating a New DataSet
from Existing TreeList
and CharacterMatrix
Objects¶
You can add independentally created or parsed data objects to a DataSet
by passing them as unnamed arguments to the constructor:
import dendropy
treelist1 = dendropy.TreeList.get(
path='pythonidae.mle.nex',
schema='nexus')
cytb = dendropy.DnaCharacterMatrix.get(
path='pythonidae_cytb.fasta',
schema='fasta')
ds = dendropy.DataSet([cytb, treelist1])
ds.unify_taxon_namespaces()
Note how we call the instance method unify_taxon_namespaces
after the creation of the DataSet
object.
This method will remove all existing TaxonNamespace
objects from the DataSet
, create and add a new one, and then map all taxon references in all contained TreeList
and CharacterMatrix
objects to this new, unified TaxonNamespace
.
Adding Data to an Exisiting DataSet
¶
You can add independentally created or parsed data objects to a DataSet
using the add
method:
.. literalinclude:: /examples/ds4.py
Here, again, we call the unify_taxon_namespaces
to map all taxon references to the same, common, unified TaxonNamespace
.
Taxon Management with Data Sets¶
The DataSet
object, representing a meta-collection of phylogenetic data, differs in one important way from all the other phylogenetic data objects discussed so far with respect to taxon management, in that it is not associated with any particular TaxonNamespace
object.
Rather, it maintains a list (in the property taxon_namespaces
) of all the TaxonNamespace
objects referenced by its contained TreeList
objects (in the property tree_lists
) and CharacterMatrix
objects (in the property char_matrices
).
With respect to taxon management, DataSet
objects operate in one of two modes: “detached taxon set” mode and “attached taxon set” mode.
Detached (Multiple) Taxon Set Mode¶
In the “detached taxon set” mode, which is the default, DataSet
object tracks all TaxonNamespace
references of their other data members in the property taxon_namespaces
, but no effort is made at taxon management as such.
Thus, every time a data source is read with a “detached taxon set” mode DataSet
object, by default, a new TaxonNamespace
object will be created and associated with the Tree
, TreeList
, or CharacterMatrix
objects created from each data source, resulting in multiple TaxonNamespace
independent references.
As such, “detached taxon set” mode DataSet
objects are suitable for handling data with multiple distinct sets of taxa.
For example:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="primates.nex", schema="nexus")
>>> ds.read(path="snakes.nex", schema="nexus")
The dataset, ds
, will now contain two distinct sets of TaxonNamespace
objects, one for the taxa defined in “primates.nex”, and the other for the taxa defined for “snakes.nex”.
In this case, this behavior is correct, as the two files do indeed refer to different sets of taxa.
However, consider the following:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="pythonidae_cytb.fasta", schema="fasta", data_type="dna")
>>> ds.read(path="pythonidae_aa.nex", schema="nexus")
>>> ds.read(path="pythonidae_morphological.nex", schema="nexus")
>>> ds.read(path="pythonidae.mle.tre", schema="nexus")
Here, even though all the data files refer to the same set of taxa, the resulting DataSet
object will actually have 4 distinct TaxonNamespace
objects, one for each of the independent reads, and a taxon with a particular label in the first file (e.g., “Python regius” of “pythonidae_cytb.fasta”) will map to a completely distinct Taxon
object than a taxon with the same label in the second file (e.g., “Python regius” of “pythonidae_aa.nex”).
This is incorrect behavior, and to achieve the correct behavior with a multiple taxon set mode DataSet
object, we need to explicitly pass a TaxonNamespace
object to each of the read_from_path
statements:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="pythonidae_cytb.fasta", schema="fasta", data_type="dna")
>>> ds.read(schema="pythonidae_aa.nex", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.read(schema="pythonidae_morphological.nex", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.read(schema="pythonidae.mle.tre", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.write_to_path("pythonidae_combined.nex", "nexus")
In the previous example, the first read
statement results in a new TaxonNamespace
object, which is added to the taxon_namespaces
property of the DataSet
object ds
.
This TaxonNamespace
object gets passed via the taxon_namespace
keyword to subsequent read_from_path
statements, and thus as each of the data sources are processed, the taxon references get mapped to Taxon
objects in the same, single, TaxonNamespace
object.
While this approach works to ensure correct taxon mapping across multiple data object reads and instantiation, in this context, it is probably more convenient to use the DataSet
in “attached taxon set” mode.
In fact, it is highly recommended that DataSet
instances always use the “attached taxon set” mode, as, conceptually there are very few cases where a collection of data should span multiple independent taxon namespaces.
Attached (Single) Taxon Set Mode¶
In the “attached taxon set” mode, DataSet
objects ensure that the taxon references of all data objects that are added to them are mapped to the same TaxonNamespace
object (at least one for each independent read or creation operation).
The “attached taxon set” mode is activated by calling the attach_taxon_namespace
method on a DataSet
and passing in the TaxonNamespace
to use:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> taxa = dendropy.TaxonNamespace(label="global")
>>> ds.attach_taxon_namespace(taxa)
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
Switching Between Attached and Detached Taxon Set Modes¶
As noted above, you can use the attached_taxon_namespace
method to switch a DataSet
object to attached taxon set mode.
To restore it to multiple taxon set mode, you would use the detach_taxon_namespace
method:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> taxa = dendropy.TaxonNamespace(label="global")
>>> ds.attach_taxon_namespace(taxa)
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
>>> ds.detach_taxon_namespace()
>>> ds.read_from_path("primates.nex", "nexus")
Here, the same TaxonNamespace
object is used to manage taxon references for data parsed from the first four files, while the data from the fifth and final file gets its own, distinct, TaxonNamespace
object and associated Taxon
object references.