Character Matrices

Types of Character Matrices

The CharacterMatrix object represents character data in DendroPy. In most cases, you will not deal with objects of the CharacterMatrix class directly, but rather with objects of one of the classes specialized to handle specific data types:

The ContinuousCharacterMatrix class represents its character values directly. Typically, all other classes represent its character values as special StateIdentity instances, not as strings. So, for example, the DNA character “A” is modeled by a special StateIdentity instance (created by the DendroPy library). While it is represented by the string “A”, and can be converted to the string and back again, it is not the same as the string “A”. Each discrete CharacterMatrix instance has one or more StateAlphabet instances associated with it that manage the collection of letters that make up the character data. In the case of, e.g. DNA, RNA, protein and other specialized discrete data, this are pre-defined by DendroPy: dendropy.DNA_STATE_ALPHABET, dendropy.RNA_STATE_ALPHABET, etc. In the case of “standard” character data, these are created for each matrix separately. Facilities are provided for the creation of custom state alphabets and for the sharing of state alphabets between different StandardCharacterMatrix instances.

Reading and Writing Character Data

As with most other phylogenetic data objects, objects of the CharacterMatrix-derived classes support the “get” factory method to populate objects from a data source. This method takes a data source as the first keyword argument and a schema specification string (”nexus”, “newick”, “nexml”, “fasta”, or “phylip”, etc.) as the second:

import dendropy
dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.fasta"), schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus", schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(file=open("pythonidae.dat"), schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(path="python_morph.nex", schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(data=">t1\n01011\n\n>t2\n11100", schema="fasta")

The “write” method allows you to write the data of a CharacterMatrix to a file-like object or a file path:

dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.nex"), schema="nexus")
dna1.write(path="out.nexml", schema="nexml")
dna1.write(file=open("out.fasta", schema="fasta")

You can also represent the data as a string using the as_string method:

dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.nex"), schema="nexus")
s = dna1.as_string(schema="fasta")
print(s)

In addition, fine-grained control over the reading and writing of data is available through various keyword arguments as described in the Reading and Writing Phylogenetic Data section.

Creating a Character Data Matrix from a Dictionary of Strings

The from_dict factory method creates a new CharacterMatrix from a dictionary mapping taxon labels to sequences represented as strings:

import dendropy
d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = dendropy.DnaCharacterMatrix.from_dict(d)

Taxon Management with Character Matrices

Taxon management with CharacterMatrix-derived objects work very much the same as it does with Tree or TreeList objects every time a CharacterMatrix-derived object is independentally created or read, a new TaxonNamespace is created, unless an existing one is specified. Thus, again, if you are creating multiple character matrices that refer to the same set of taxa, you will want to make sure to pass each of them a common TaxonNamespace reference:

import dendropy
taxa = dendropy.TaxonNamespace()
dna1 = dendropy.DnaCharacterMatrix.get(
    path="pythonidae_cytb.fasta",
    schema="fasta",
    taxon_namespace=taxa)
prot1 = dendropy.ProteinCharacterMatrix.get(
    path="pythonidae_morph.nex",
    schema="nexus",
    taxon_namespace=taxa)
trees = dendropy.TreeList.get(
    path="pythonidae.trees.nex",
    schema="nexus",
    taxon_namespace=taxa)

Concatenating Multiple Data Matrices

A new CharacterMatrix can be created from multiple existing matrices using the concatentate factory method, which takes a list or an iterable of CharacterMatrix instances as an argument.

All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

For example:

import dendropy
taxa = dendropy.TaxonNamespace()
d1 = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.subsets-1stpos.nexus",
        schema="nexus",
        taxon_namespace=taxa)
print("d1: {} sequences, {} characters".format(len(d1), d1.max_sequence_size))
d2 = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.subsets-2ndpos.nexus",
        schema="nexus",
        taxon_namespace=taxa)
print("d2: {} sequences, {} characters".format(len(d2), d2.max_sequence_size))
d3 = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.subsets-3rdpos.nexus",
        schema="nexus",
        taxon_namespace=taxa)
print("d3: {} sequences, {} characters".format(len(d3), d3.max_sequence_size))
d_all = dendropy.DnaCharacterMatrix.concatenate([d1,d2,d3])
print("d_all: {} sequences, {} characters".format(len(d_all), d_all.max_sequence_size))
print("Subsets: {}".format(d_all.character_subsets))

results in

d1: 12 sequences, 231 characters
d2: 12 sequences, 231 characters
d3: 12 sequences, 231 characters
d_all: 12 sequences, 693 characters
Subsets: {'locus002': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d792d0>, 'locus000': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d79250>, 'locus001': <dendropy.datamodel.charmatrixmodel.CharacterSubset object at 0x101d79290>}

You can instantiate a concatenated matrix from multiple sources using the concatentate_from_paths or concatentate_from_streams factory methods:

import dendropy
taxa = dendropy.TaxonNamespace()
paths = [
        "primates.chars.subsets-1stpos.nexus",
        "primates.chars.subsets-2ndpos.nexus",
        "primates.chars.subsets-3rdpos.nexus",
        ]
d_all = dendropy.DnaCharacterMatrix.concatenate_from_paths(
        paths=paths,
        schema="nexus")
print("d_all: {} sequences, {} characters".format(len(d_all), d_all.max_sequence_size))
print("Subsets: {}".format(d_all.character_subsets))

Sequence Management

A range of methods also exist for importing data from another matrix object. These vary depending on how “new” and “existing” are treated. A “new” sequence is a sequence in the other matrix associated with a Taxon object for which there is no sequence defined in the current matrix. An “existing” sequence is a sequence in the other matrix associated with a Taxon object for which there is a sequence defined in the current matrix.

New Sequences: IGNORED

New Sequences: ADDED

Existing Sequences: IGNORED

[NO-OP]

add_sequences

Existing Sequences: OVERWRITTEN

replace_sequences

update_sequences

Existing Sequences: EXTENDED

extend_sequences

extend_matrix

More information cane be found in the source documentation:

In addition there are methods for selecting removing sequences:

As well as “filling out” a matrix by adding columns or rows:

Accessing Data

A CharacterMatrix behaves very much like a dictionary, where the “keys” are Taxon instances, which can be dereferenced using the instance itself, the taxon label, or the index of the taxon in the collection (note: this is not neccessarily the same as the accession index, which is the basis for bipartition collection).

For example:

import dendropy

dna = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.nexus",
        schema="nexus")

# access by dereferencing taxon label
s1 = dna["Macaca sylvanus"]

# access by taxon index
s2 = dna[0]
s3 = dna[4]
s4 = dna[-2]

# access by taxon instance
t = dna.taxon_namespace.get_taxon(label="Macaca sylvanus")
s5 = dna[t]

You can also iterate over the matrix in a number of ways:

import dendropy

dna = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.nexus",
        schema="nexus")

# iterate over taxa
for taxon in dna:
    print("{}: {}".format(taxon.label, dna[taxon]))

# iterate over the sequences
for seq in dna.values():
    print(seq)

# iterate over taxon/sequence pairs
for taxon, seq in dna.items():
    print("{}: {}".format(taxon.label, seq))

The “values” return by dereferencing the “keys” of a CharacterMatrix objects are CharacterDataSequence objects. Objects of this class behave very much like lists, where the elements are either numeric values for ContinuousCharacterMatrix matrices:

import dendropy

cc = dendropy.ContinuousCharacterMatrix.get(
        path="pythonidae_continuous.chars.nexml",
        schema="nexml")

s1 = cc[0]

print(type(s1))
# <class 'dendropy.datamodel.charmatrixmodel.ContinuousCharacterDataSequence'>

print(len(s1))
# 100

for v in s1:
    print("{}, {}".format(type(v), str(v)))
# <type 'float'>, -0.0230088801573
# <type 'float'>, -0.327376261257
# <type 'float'>, -0.483676644025
# ...
# ...

print(s1.values())
# [-0.0230088801573, -0.327376261257, -0.483676644025, ...

print(s1.symbols_as_list())
# ['-0.0230088801573', '-0.327376261257', '-0.483676644025', ...

print(s1.symbols_as_string())
# -0.0230088801573 -0.327376261257 -0.483676644025 0.0868649474847 ...


or StateIdentity instances for all other types of matrices:

import dendropy

dna = dendropy.DnaCharacterMatrix.get(
        path="primates.chars.nexus",
        schema="nexus")

s1 = dna[0]

print(type(s1))
# <class 'dendropy.datamodel.charmatrixmodel.DnaCharacterDataSequence'>

print(len(s1))
# 898

for v in s1:
    print("{}, {}".format(repr(v), str(v)))
# <<StateIdentity at 0x10134a290: 'A'>, A
# <<StateIdentity at 0x10134a290: 'A'>, A
# <<StateIdentity at 0x10134a350: 'G'>, G
# ...
# ...

print(s1.values())
# [<StateIdentity at 0x101b4a290: 'A'>, <StateIdentity at 0x101b4a290: 'A'>, <StateIdentity at 0x101b4a350: 'G'>, ...

print(s1.symbols_as_list())
# ['A', 'A', 'G', 'C', 'T', 'T', 'C', 'A', 'T', ...

print(s1.symbols_as_string())
# AAGCTTCATAGGAGCAACCATTCT ...


As can be seen, you can use values to get a list of the values of the sequence directly, symbols_as_list to get a list of the values represented as strings, and symbols_as_string to get the string representation of the whole sequence.