Character Matrices¶
Types of Character Matrices¶
The CharacterMatrix
object represents character data in DendroPy.
In most cases, you will not deal with objects of the CharacterMatrix
class directly, but rather with objects of one of the classes specialized to handle specific data types:
DnaCharacterMatrix
, for DNA nucleotide sequence data
RnaCharacterMatrix
, for RNA nucleodtide sequence data
ProteinCharacterMatrix
, for amino acid sequence data
StandardCharacterMatrix
, for discrete-value data
ContinuousCharacterMatrix
, for continuous-valued data
The ContinuousCharacterMatrix
class represents its character values directly.
Typically, all other classes represent its character values as special StateIdentity
instances, not as strings.
So, for example, the DNA character “A” is modeled by a special StateIdentity
instance (created by the DendroPy library).
While it is represented by the string “A”, and can be converted to the string and back again, it is not the same as the string “A”.
Each discrete CharacterMatrix
instance has one or more StateAlphabet
instances associated with it that manage the collection of letters that make up the character data.
In the case of, e.g. DNA, RNA, protein and other specialized discrete data, this are pre-defined by DendroPy: dendropy.DNA_STATE_ALPHABET
, dendropy.RNA_STATE_ALPHABET
, etc.
In the case of “standard” character data, these are created for each matrix separately. Facilities are provided for the creation of custom state alphabets and for the sharing of state alphabets between different StandardCharacterMatrix
instances.
Reading and Writing Character Data¶
As with most other phylogenetic data objects, objects of the CharacterMatrix
-derived classes support the “get
” factory method to populate objects from a data source.
This method takes a data source as the first keyword argument and a schema specification string (”nexus
”, “newick
”, “nexml
”, “fasta
”, or “phylip
”, etc.) as the second:
import dendropy
dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.fasta"), schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus", schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(file=open("pythonidae.dat"), schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(path="python_morph.nex", schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(data=">t1\n01011\n\n>t2\n11100", schema="fasta")
The “write
” method allows you to write the data of a CharacterMatrix
to a file-like object or a file path:
dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.nex"), schema="nexus")
dna1.write(path="out.nexml", schema="nexml")
dna1.write(file=open("out.fasta", schema="fasta")
You can also represent the data as a string using the as_string
method:
dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.nex"), schema="nexus")
s = dna1.as_string(schema="fasta")
print(s)
In addition, fine-grained control over the reading and writing of data is available through various keyword arguments as described in the Reading and Writing Phylogenetic Data section.
Creating a Character Data Matrix from a Dictionary of Strings¶
The from_dict
factory method creates a new CharacterMatrix
from a dictionary mapping taxon labels to sequences represented as strings:
import dendropy
d = {
"s1" : "TCCAA",
"s2" : "TGCAA",
"s3" : "TG-AA",
}
dna = dendropy.DnaCharacterMatrix.from_dict(d)
Taxon Management with Character Matrices¶
Taxon management with CharacterMatrix
-derived objects work very much the same as it does with Tree
or TreeList
objects every time a CharacterMatrix
-derived object is independentally created or read, a new TaxonNamespace
is created, unless an existing one is specified.
Thus, again, if you are creating multiple character matrices that refer to the same set of taxa, you will want to make sure to pass each of them a common TaxonNamespace
reference:
import dendropy
taxa = dendropy.TaxonNamespace()
dna1 = dendropy.DnaCharacterMatrix.get(
path="pythonidae_cytb.fasta",
schema="fasta",
taxon_namespace=taxa)
prot1 = dendropy.ProteinCharacterMatrix.get(
path="pythonidae_morph.nex",
schema="nexus",
taxon_namespace=taxa)
trees = dendropy.TreeList.get(
path="pythonidae.trees.nex",
schema="nexus",
taxon_namespace=taxa)
Concatenating Multiple Data Matrices¶
- A new
CharacterMatrix
can be created from multiple existing matrices using theconcatentate
factory method, which takes a list or an iterable ofCharacterMatrix
instances as an argument. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.
For example:
import dendropy
taxa = dendropy.TaxonNamespace()
d1 = dendropy.DnaCharacterMatrix.get(
path="primates.chars.subsets-1stpos.nexus",
schema="nexus",
taxon_namespace=taxa)
print("d1: {} sequences, {} characters".format(len(d1), d1.max_sequence_size))
d2 = dendropy.DnaCharacterMatrix.get(
path="primates.chars.subsets-2ndpos.nexus",
schema="nexus",
taxon_namespace=taxa)
print("d2: {} sequences, {} characters".format(len(d2), d2.max_sequence_size))
d3 = dendropy.DnaCharacterMatrix.get(
path="primates.chars.subsets-3rdpos.nexus",
schema="nexus",
taxon_namespace=taxa)
print("d3: {} sequences, {} characters".format(len(d3), d3.max_sequence_size))
d_all = dendropy.DnaCharacterMatrix.concatenate([d1,d2,d3])
print("d_all: {} sequences, {} characters".format(len(d_all), d_all.max_sequence_size))
print("Subsets: {}".format(d_all.character_subsets))
results in
d1: 12 sequences, 231 characters
d2: 12 sequences, 231 characters
d3: 12 sequences, 231 characters
d_all: 12 sequences, 693 characters
Subsets: {'locus002': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d792d0>, 'locus000': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d79250>, 'locus001': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d79290>}
You can instantiate a concatenated matrix from multiple sources using the concatentate_from_paths
or concatentate_from_streams
factory methods:
import dendropy
taxa = dendropy.TaxonNamespace()
paths = [
"primates.chars.subsets-1stpos.nexus",
"primates.chars.subsets-2ndpos.nexus",
"primates.chars.subsets-3rdpos.nexus",
]
d_all = dendropy.DnaCharacterMatrix.concatenate_from_paths(
paths=paths,
schema="nexus")
print("d_all: {} sequences, {} characters".format(len(d_all), d_all.max_sequence_size))
print("Subsets: {}".format(d_all.character_subsets))
Sequence Management¶
A range of methods also exist for importing data from another matrix object.
These vary depending on how “new” and “existing” are treated. A “new”
sequence is a sequence in the other matrix associated with a Taxon
object for which there is no sequence defined in the current matrix. An
“existing” sequence is a sequence in the other matrix associated with a
Taxon
object for which there is a sequence defined in the
current matrix.
New Sequences: IGNORED |
New Sequences: ADDED |
|
---|---|---|
Existing Sequences: IGNORED |
[NO-OP] |
|
Existing Sequences: OVERWRITTEN |
||
Existing Sequences: EXTENDED |
More information cane be found in the source documentation:
In addition there are methods for selecting removing sequences:
As well as “filling out” a matrix by adding columns or rows:
Accessing Data¶
A CharacterMatrix
behaves very much like a dictionary, where the “keys” are Taxon
instances, which can be dereferenced using the instance itself, the taxon label, or the index of the taxon in the collection (note: this is not neccessarily the same as the accession index, which is the basis for bipartition collection).
For example:
import dendropy
dna = dendropy.DnaCharacterMatrix.get(
path="primates.chars.nexus",
schema="nexus")
# access by dereferencing taxon label
s1 = dna["Macaca sylvanus"]
# access by taxon index
s2 = dna[0]
s3 = dna[4]
s4 = dna[-2]
# access by taxon instance
t = dna.taxon_namespace.get_taxon(label="Macaca sylvanus")
s5 = dna[t]
You can also iterate over the matrix in a number of ways:
import dendropy
dna = dendropy.DnaCharacterMatrix.get(
path="primates.chars.nexus",
schema="nexus")
# iterate over taxa
for taxon in dna:
print("{}: {}".format(taxon.label, dna[taxon]))
# iterate over the sequences
for seq in dna.values():
print(seq)
# iterate over taxon/sequence pairs
for taxon, seq in dna.items():
print("{}: {}".format(taxon.label, seq))
The “values” return by dereferencing the “keys” of a CharacterMatrix
objects are CharacterDataSequence
objects.
Objects of this class behave very much like lists, where the elements are either numeric values for ContinuousCharacterMatrix
matrices:
import dendropy
cc = dendropy.ContinuousCharacterMatrix.get(
path="pythonidae_continuous.chars.nexml",
schema="nexml")
s1 = cc[0]
print(type(s1))
# <class 'dendropy.datamodel.charmatrixmodel.ContinuousCharacterDataSequence'>
print(len(s1))
# 100
for v in s1:
print("{}, {}".format(type(v), str(v)))
# <type 'float'>, -0.0230088801573
# <type 'float'>, -0.327376261257
# <type 'float'>, -0.483676644025
# ...
# ...
print(s1.values())
# [-0.0230088801573, -0.327376261257, -0.483676644025, ...
print(s1.symbols_as_list())
# ['-0.0230088801573', '-0.327376261257', '-0.483676644025', ...
print(s1.symbols_as_string())
# -0.0230088801573 -0.327376261257 -0.483676644025 0.0868649474847 ...
or StateIdentity
instances for all other types of matrices:
import dendropy
dna = dendropy.DnaCharacterMatrix.get(
path="primates.chars.nexus",
schema="nexus")
s1 = dna[0]
print(type(s1))
# <class 'dendropy.datamodel.charmatrixmodel.DnaCharacterDataSequence'>
print(len(s1))
# 898
for v in s1:
print("{}, {}".format(repr(v), str(v)))
# <<StateIdentity at 0x10134a290: 'A'>, A
# <<StateIdentity at 0x10134a290: 'A'>, A
# <<StateIdentity at 0x10134a350: 'G'>, G
# ...
# ...
print(s1.values())
# [<StateIdentity at 0x101b4a290: 'A'>, <StateIdentity at 0x101b4a290: 'A'>, <StateIdentity at 0x101b4a350: 'G'>, ...
print(s1.symbols_as_list())
# ['A', 'A', 'G', 'C', 'T', 'T', 'C', 'A', 'T', ...
print(s1.symbols_as_string())
# AAGCTTCATAGGAGCAACCATTCT ...
As can be seen, you can use values
to get a list of the values of the sequence directly, symbols_as_list
to get a list of the values represented as strings, and symbols_as_string
to get the string representation of the whole sequence.