Reading and Writing Phylogenetic Data¶
Creating New Objects From an External Data Source¶
The Tree
, TreeList
, CharacterMatrix
-derived (i.e., DnaCharacterMatrix
,
ProteinCharacterMatrix
, StandardCharacterMatrix
, etc.), and DataSet
classes all support a “get
” factory class-method that instantiates an object
of the given class from a data source. This method takes, at a minumum, two
keyword arguments that specify the source of the data and the schema (or
format) of the data.
The source must be specifed using one and exactly one of the following:
a path to a file (specified using the keyword argument “
path
”)a file or a file-like object opened for reading (specified using the keyword argument
"file"
)a string value giving the data directly (specified using the keyword argument
"data"
)or a URL (specified using the keyword argument
"url"
)
The schema is specified using the keyword argument "schema"
, and takes a string value that identifies the format of data.
This “schema specification string” can be one of: “fasta”, “newick”, “nexus”, “nexml”, or “phylip”.
Not all formats are supported for reading, and not all formats make sense for particular objects (for example, it would not make sense to try and instantiate a Tree
or TreeList
object from a FASTA-formatted data source).
For example:
import dendropy
tree1 = dendropy.Tree.get(path="mle.tre", schema="newick")
tree2 = dendropy.Tree.get(file=open("mle.nex", "r"), schema="nexus")
tree3 = dendropy.Tree.get(data="((A,B),(C,D));", schema="newick")
tree4 = dendropy.Tree.get(url="http://api.opentreeoflife.org/v2/study/pg_1144/tree/tree2324.nex", schema="nexus")
tree_list1 = dendropy.TreeList.get(path="pythonidae.mcmc.nex", schema="nexus")
tree_list2 = dendropy.TreeList.get(file=open("pythonidae.mcmc.nex", "r"), schema="nexus")
tree_list3 = dendropy.TreeList.get(data="(A,(B,C));((A,B),C);", "r"), schema="newick")
dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.fasta"), schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus", schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(file=open("pythonidae.dat"), schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(path="python_morph.nex", schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(data=">t1\n01011\n\n>t2\n11100", schema="fasta")
dataset1 = dendropy.DataSet.get(path="pythonidae.chars_and_trees.nex", schema="nexus")
dataset2 = dendropy.DataSet.get(url="http://purl.org/phylo/treebase/phylows/study/TB2:S1925?format=nexml", schema="nexml")
The “get
” method takes a number of other optional keyword arguments that provide control over how the data is interpreted and processed.
Some are general to all classes (e.g., the “label
” or “taxon_namespace
” arguments), while others specific to a given class (e.g. the “exclude_trees
” argument when instantiating data into a DataSet
object, or the “tree_offset
” argument when instantiating data into a Tree
or TreeList
object).
These are all covered in detail in the documentation of the respective methods for each class:
Other optional keyword arguments are specific to the schema or format (e.g., the “preserve_underscores
” argument when reading Newick or NEXUS data).
These are covered in detail in the DendroPy Schema Guide.
Note
The Tree
, TreeList
, CharacterMatrix
-derived, and DataSet
classes
also support a “get_from_*()
” family of factory class-methods that
can be seen as specializations of the “get
” method for various types of
sources (in fact, the “get
” method is actually a dispatcher that calls on
one of these methods below for implementation of the functionality):
get_from_stream(src, schema, **kwargs)
Takes a file or file-like object opened for reading the data source as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
get(file=src, schema=schema, ...)
”.get_from_path(src, schema, **kwargs)
Takes a string specifying the path to the the data source file as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
get(path=src, schema=schema, ...)
”.get_from_string(src, schema, **kwargs)
Takes a string containing the source data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
get(data=src, schema=schema, ...)
”.get_from_url(src, schema, **kwargs)
Takes a string containing the URL of the data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
get(url=src, schema=schema, ...)
”.
As with the “get
” method, the additional keyword arguments are specific to the given class or schema type.
Adding Data to Existing Objects from an External Data Source¶
In addition to the “get
” class factory method, the collection classes (TreeList
, TreeArray
and DataSet
) each support a “read
” instance method that add data from external sources to an existing object (as opposed to creating and returning a new object based on an external data source).
This “read
” instance method has a signature that parallels the “get
” factory method described above, requiring:
A specification of a source using one and exactly one of the following keyword arguments: “
path
”, “file
”, “data
”, “url
”.A specification of the schema or format of the data.
Optional keyword arguments to customize/control the parsing and interpretation of the data.
As with the “get
” method, the “read
” method takes a number of other optional keyword arguments that provide control over how the data is interpreted and processed, which are covered in more detail in the documentation of the respective methods for each class:
as well as schema-specific keyword arguments which are covered in detail in the DendroPy Schema Guide.
For example, the following accumulates post-burn-in trees from several different files into a single TreeList
object:
>>> import dendropy
>>> post_trees = dendropy.TreeList()
>>> post_trees.read(
... file=open("pythonidae.nex.run1.t", "r")
... schema="nexus",
... tree_offset=200)
>>> print(len(post_trees))
800
>>> post_trees.read(
... path="pythonidae.nex.run2.t",
... schema="nexus",
... tree_offset=200)
>>> print(len(post_trees))
1600
>>> s = open("pythonidae.nex.run3.t", "r").read()
>>> post_trees.read(
... data=s,
... schema="nexus",
... tree_offset=200)
>>> print(len(post_trees))
2400
while the following accumulates data from a variety of sources into a single DataSet
object under the same TaxonNamespace
to ensure that they all reference the same set of Taxon
objects:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> tns = ds.new_taxon_namespace()
>>> ds.attach_taxon_namespace(tns)
>>> ds.read(url="http://api.opentreeoflife.org/v2/study/pg_1144/tree/tree2324.nex",
... schema="nexus")
>>> ds.read(file=open("pythonidae.fasta"), schema="fasta")
>>> ds.read(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
... schema="nexus")
>>> ds.read(file=open("pythonidae.dat"), schema="phylip")
>>> ds.read(path="python_morph.nex", schema="nexus")
>>> ds.read(data=">t1\n01011\n\n>t2\n11100", schema="fasta")
Note
DendroPy 3.xx supported “read_from_*()
” methods on Tree
and CharacterMatrix
-derived classes. This is no longer supported in DendroPy 4 and above. Instead of trying to re-populate an existing Tree
or CharacterMatrix
-derived object by using “read_from_*()
”:
x = dendropy.Tree()
x.read_from_path("tree1.nex", "nexus")
.
.
.
x.read_from_path("tree2.nex", "nexus")
simply rebind the new object returned by “get
”:
x = dendropy.Tree.get(path="tree1.nex", schema="nexus")
.
.
.
x = dendropy.Tree.get(path="tree2.nex", schema="nexus")
Note
The TreeList
, TreeArray
, and DataSet
classes
also support a “read_from_*()
” family of instance methods that
can be seen as specializations of the “read
” method for various types of
sources (in fact, the “read
” method is actually a dispatcher that calls on
one of these methods below for implementation of the functionality):
read_from_stream(src, schema, **kwargs)
Takes a file or file-like object opened for reading the data source as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
read(file=src, schema=schema, ...)
”.read_from_path(src, schema, **kwargs)
Takes a string specifying the path to the the data source file as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
read(path=src, schema=schema, ...)
”.read_from_string(src, schema, **kwargs)
Takes a string containing the source data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
read(data=src, schema=schema, ...)
”.read_from_url(src, schema, **kwargs)
Takes a string containing the URL of the data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “
read(url=src, schema=schema, ...)
”.
As with the “read
” method, the additional keyword arguments are specific to the given class or schema type.
Writing Out Phylogenetic Data¶
The Tree
, TreeList
, CharacterMatrix
-derived (i.e., DnaCharacterMatrix
,
ProteinCharacterMatrix
, StandardCharacterMatrix
, etc.), and DataSet
classes all support a “write
” instance method for serialization of data to an
external data source.
This method takes two mandatory keyword arguments:
One and exactly one of the following to specify the destination: - a path to a file (specified using the keyword argument “
path
”) - a file or a file-like object opened for writing (specified using the keyword argument"file"
)A “schema specification string” given by the keyword argument “
schema
”, to identify the schema or format for the output.
Alternatively, the Tree
, TreeList
, CharacterMatrix
-derived, or DnaCharacterMatrix
objects may also be represented as a string by calling the “as_string()
” method, which requires at least one single mandatory argument, “schema
”, giving the “schema specification string” to identify the format of the output.
In either case, the “schema specification string” can be one of: “fasta”, “newick”, “nexus”, “nexml”, or “phylip”.
For example:
tree.write(path="output.tre", schema="newick")
dest = open("output.xml", "w")
tree_list.write(file=dest, schema="nexml")
print(dna_character_matrix.as_string(schema="fasta"))
As with the “get
” and “read
” methods, further keyword arguments can be specified to control behavior.
These are covered in detail in the “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” section.
Note
The Tree
, TreeList
, CharacterMatrix
-derived, and DataSet
classes also support a “write_to_*()
” family of instance methods that can be seen as specializations of the “write
” method for various types of destinations:
write_to_stream(dest, schema, **kwargs)
Takes a file or file-like object opened for writing the data as the first argument, and a string specifying the schema as the second.
write_to_path(dest, schema, **kwargs)
Takes a string specifying the path to the file as the first argument, and a string specifying the schema as the second.
as_string(schema, **kwargs)
Takes a string specifying the schema as the first argument, and returns a string containing the formatted-representation of the data.