dendropy.datamodel.charmatrixmodel: Character Sequences and Matrices

Character Sequences

class dendropy.datamodel.charmatrixmodel.CharacterDataSequence(character_values=None, character_types=None, character_annotations=None)[source]

A sequence of character values or values for a particular taxon or entry in a data matrix.

Objects of this class can be (almost) treated as simple lists, where the elements are the values of characters (typically, real values in the case of continuous data, and special instances of StateIdentity objects in the case of discrete data.

Character type data (represented by CharacterType instances) and metadata annotations (represented by AnnotationSet instances), if any, are maintained in a parallel list that need to be accessed separately using the index of the value to which the data correspond. So, for example, the AnnotationSet object containing the metadata annotations for the first value in a sequence, s[0], is available through s.annotations_at(0), while the character type information for that first element is available through s.character_type_at(0) and can be set through s.set_character_type_at(0, c).

In most cases where metadata annotations and character type information are not needed, treating objects of this class as a simple list provides all the functionality needed. Where metadata annotations or character type information are required, all the standard list mutation methods (e.g., CharacterDataSequence.insert, CharacterDataSequence.append, CharacterDataSequence.extend) also take optional character_type and character_annotations argument in addition to the primary character_value argument, thus allowing for setting of the value, character type, and annotation set simultaneously. While iteration over character values are available through the standard list iteration interface, the method CharacterDataSequence.cell_iter() provides for iterating over <character-value, character-type, character-annotation-set> triplets.

Parameters:

character_values (iterable of values) – A set of values for this sequence.

annotations_at(idx)[source]

Return metadata annotations of character at idx.

Parameters:

idx (integer) – Index of element annotations to return.

Returns:

c (|AnnotationSet|) – AnnotationSet representing metadata annotations of character at index idx.

append(character_value, character_type=None, character_annotations=None)[source]

Adds a value to self.

Parameters:
  • character_value (object) – Value to be stored.

  • character_type (CharacterType) – Description of character value.

  • character_annotations (AnnotationSet) – Metadata annotations associated with this character.

cell_iter()[source]

Iterate over triplets of character values and associated CharacterType and AnnotationSet instances.

character_type_at(idx)[source]

Return type of character at idx.

Parameters:

idx (integer) – Index of element character type to return.

Returns:

c (|CharacterType|) – CharacterType associated with character index idx.

extend(character_values, character_types=None, character_annotations=None)[source]

Extends self with values.

Parameters:
  • character_values (iterable of objects) – Values to be stored.

  • character_types (iterable of CharacterType objects) – Descriptions of character values.

  • character_annotations (iterable AnnotationSet objects) – Metadata annotations associated with characters.

has_annotations_at(idx)[source]

Return True if character at idx has metadata annotations.

Parameters:

idx (integer) – Index of element annotations to check.

Returns:

b (bool) – True if character at idx has metadata annotations, False otherwise.

insert(idx, character_value, character_type=None, character_annotations=None)[source]

Insert value and associated character type and metadata annotations for element at idx.

Parameters:
  • idx (integer) – Index of element to set.

  • character_value (object) – Value to be stored.

  • character_type (CharacterType) – Description of character value.

  • character_annotations (AnnotationSet) – Metadata annotations associated with this character.

set_annotations_at(idx, annotations)[source]

Set metadata annotations of character at idx.

Parameters:

idx (integer) – Index of element annotations to set.

set_at(idx, character_value, character_type=None, character_annotations=None)[source]

Set value and associated character type and metadata annotations for element at idx.

Parameters:
  • idx (integer) – Index of element to set.

  • character_value (object) – Value to be stored.

  • character_type (CharacterType) – Description of character value.

  • character_annotations (AnnotationSet) – Metadata annotations associated with this character.

set_character_type_at(idx, character_type)[source]

Set type of character at idx.

Parameters:

idx (integer) – Index of element character type to set.

symbols_as_list()[source]

Returns list of string representation of values of this vector.

Returns:

v (list) – List of string representation of values making up this vector.

symbols_as_string(sep='')[source]

Returns values of this vector as a single string, with individual value elements separated by sep.

Returns:

s (string) – String representation of values making up this vector.

value_at(idx)[source]

Return value of character at idx.

Parameters:

idx (integer) – Index of element value to return.

Returns:

c (object) – Value of character at index idx.

values()[source]

Returns list of values of this vector.

Returns:

v (list) – List of values making up this vector.

Character Types

class dendropy.datamodel.charmatrixmodel.CharacterType(label=None, state_alphabet=None)[source]

A character format or type of a particular column: i.e., maps a particular set of character state definitions to a column in a character matrix.

property state_alphabet

The StateAlphabet representing the state alphabet for this column: i.e., the collection of symbols and the state identities to which they map.

taxon_namespace_scoped_copy(memo=None)[source]

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

Character Subsets

class dendropy.datamodel.charmatrixmodel.CharacterSubset(label=None, character_indices=None)[source]

Tracks definition of a subset of characters.

Parameters:
  • label (str) – Name of this subset.

  • character_indices (iterable of int) – Iterable of 0-based (integer) indices of column positions that constitute this subset.

Character Matrices

The CharacterMatrix Class

class dendropy.datamodel.charmatrixmodel.CharacterMatrix(*args, **kwargs)[source]

A data structure that manages assocation of operational taxononomic unit concepts to sequences of character state identities or values.

This is a base class that provides general functionality; derived classes specialize for particular data types. You will not be using the class directly, but rather one of the derived classes below, specialized for data types such as DNA, RNA, continuous, etc.

This class and derived classes behave like a dictionary where the keys are Taxon objects and the values are CharacterDataSequence objects. Access to sequences based on taxon labels as well as indexes are also provided. Numerous methods are provided to manipulate and iterate over sequences. Character partitions can be managed through CharacterSubset objects, while management of detailed metadata on character types are available through CharacterType objects.

Objects can be instantiated by reading data from external sources through the usual get_from_stream(), get_from_path(), or get_from_string() functions. In addition, a single matrix object can be instantiated from multiple matrices (concatenate()) or data sources (concatenate_from_paths).

A range of methods also exist for importing data from another matrix object. These vary depending on how “new” and “existing” are treated. A “new” sequence is a sequence in the other matrix associated with a Taxon object for which there is no sequence defined in the current matrix. An “existing” sequence is a sequence in the other matrix associated with a Taxon object for which there is a sequence defined in the current matrix.

New Sequences: IGNORED

New Sequences: ADDED

Existing Sequences: IGNORED

[NO-OP]

CharacterMatrix.add_sequences

Existing Sequences: OVERWRITTEN

CharacterMatrix.replace_sequences

CharacterMatrix.update_sequences

Existing Sequences: EXTENDED

CharacterMatrix.extend_sequences

CharacterMatrix.extend_matrix

If character subsets have been defined, these subsets can be exported to independent matrices.

__delitem__(key)[source]

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)[source]

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()[source]

Returns an iterator over character map’s ordered keys.

__len__()[source]

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)[source]

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)[source]

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)[source]

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of CharacterDataSequence

clear()[source]

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)[source]

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)[source]

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)[source]

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)[source]

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)[source]

Returns description of object, up to level depth.

discard_sequences(taxa)[source]

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)[source]

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)[source]

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)[source]

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)[source]

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)[source]

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()[source]

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)[source]

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)[source]

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()[source]

Returns character map key, value pairs in key-order.

keep_sequences(taxa)[source]

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)[source]

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)[source]

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)[source]

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)[source]

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)[source]

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()[source]

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remove_sequences(taxa)[source]

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)[source]

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()[source]

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)[source]

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

update_sequences(other_matrix)[source]

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()[source]

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()[source]

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

ContinuousCharacterMatrix: Continuous Data

class dendropy.datamodel.charmatrixmodel.ContinuousCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for continuous data.

Sequences stored using ContinuousCharacterDataSequence, with values of elements assumed to be float .

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of ContinuousCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

DnaCharacterMatrix: DNA Data

class dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for DNA data.

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of DnaCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

RnaCharacterMatrix: RNA Data

class dendropy.datamodel.charmatrixmodel.RnaCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for DNA data.

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of RnaCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

ProteinCharacterMatrix: Protein (Amino Acid) Data

class dendropy.datamodel.charmatrixmodel.ProteinCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for protein or amino acid data.

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of ProteinCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

RestrictionSitesCharacterMatrix: Restriction Sites Data

class dendropy.datamodel.charmatrixmodel.RestrictionSitesCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for restriction site data.

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of RestrictionSitesCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

InfiniteSitesCharacterMatrix : Infinite Sites Data

class dendropy.datamodel.charmatrixmodel.InfiniteSitesCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for infinite sites data.

__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of InfiniteSitesCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.

StandardCharacterMatrix: “Standard” Data

class dendropy.datamodel.charmatrixmodel.StandardCharacterMatrix(*args, **kwargs)[source]

Specializes CharacterMatrix for “standard” data (i.e., generic discrete character data).

A default state alphabet consisting of state symbols of 0-9 will automatically be created unless the default_state_alphabet=None is passed in. To specify a different default state alphabet:

default_state_alphabet=dendropy.new_standard_state_alphabet("abc")
default_state_alphabet=dendropy.new_standard_state_alphabet("ij")
__delitem__(key)

Removes sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

__getitem__(key)

Retrieves sequence for key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

Returns:

s (CharacterDataSequence) – A sequence associated with the Taxon instance referenced by key.

__iter__()

Returns an iterator over character map’s ordered keys.

__len__()

Number of sequences in matrix.

Returns:

n (Number of sequences in matrix.)

__setitem__(key, values)

Assigns sequence values to taxon specified by key, which can be a index or a label of a Taxon instance in the current taxon namespace, or a Taxon instance directly.

If no sequence is currently associated with specified Taxon, a new one will be created. Note that the Taxon object must have already been defined in the curent taxon namespace.

Parameters:

key (integer, string, or Taxon) – If an integer, assumed to be an index of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. If a string, assumed to be a label of a Taxon object in the current TaxonNamespace object of self.taxon_namespace. Otherwise, assumed to be Taxon instance directly. In all cases, the Taxon object must be (already) defined in the current taxon namespace.

add_character_subset(char_subset)

Adds a CharacterSubset object. Raises an error if one already exists with the same label.

add_sequences(other_matrix)

Adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to add sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self as a shallow-copy.

  4. All other sequences will be ignored.

as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

character_sequence_type

alias of StandardCharacterDataSequence

clear()

Removes all sequences from matrix.

clone(depth=1)

Creates and returns a copy of self.

Parameters:

depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).

  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.

  • 2: Exhaustive deep-copy: all objects are cloned.

coerce_values(values)[source]

Converts elements of values to type of matrix.

This method is called by CharacterMatrix.from_dict to create sequences from iterables of values. This method should be overridden by derived classes to ensure that values consists of types compatible with the particular type of matrix. For example, a CharacterMatrix type with a fixed state alphabet (such as DnaCharacterMatrix) would dereference the string elements of values to return a list of StateIdentity objects corresponding to the symbols represented by the strings. If there is no value-type conversion done, then values should be returned as-is. If no value-type conversion is possible (e.g., when the type of a value is dependent on positionaly information), then a TypeError should be raised.

Parameters:

values (iterable) – Iterable of values to be converted.

Returns:

v (list of values.)

classmethod concatenate(char_matrices)

Creates and returns a single character matrix from multiple CharacterMatrix objects specified as a list, ‘char_matrices’. All the CharacterMatrix objects in the list must be of the same type, and share the same TaxonNamespace reference. All taxa must be present in all alignments, all all alignments must be of the same length. Component parts will be recorded as character subsets.

classmethod concatenate_from_paths(paths, schema, **kwargs)

Read a character matrix from each file path given in paths, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the and return the combined character matrix. Component parts will be recorded as character subsets.

classmethod concatenate_from_streams(streams, schema, **kwargs)

Read a character matrix from each file object given in streams, assuming data format/schema schema, and passing any keyword arguments down to the underlying specialized reader. Merge the character matrices and return the combined character matrix. Component parts will be recorded as character subsets.

copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.

  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.

deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

description(depth=1, indent=0, itemize='', output=None)

Returns description of object, up to level depth.

discard_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa if they exist.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

export_character_indices(indices)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the 0-based indices in indices. Note that this new matrix will still reference the same taxon set.

export_character_subset(character_subset)

Returns a new CharacterMatrix (of the same type) consisting only of columns given by the CharacterSubset, character_subset. Note that this new matrix will still reference the same taxon set.

extend_matrix(other_matrix)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appending to the sequence currently associated with that Taxon reference in self.

  4. Each sequence associated with a Taxon reference in other_matrix that is also in self will replace the sequence currently associated with that Taxon reference in self.

extend_sequences(other_matrix, is_add_new_sequences=False)

Extends sequences in self with characters associated with corresponding Taxon objects in other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to extend sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix that is also in self will be appended to the sequence currently associated with that Taxon reference in self.

  4. All other sequences will be ignored.

fill(value, size=None, append=True)

Pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

fill_taxa()

Adds a new (empty) sequence for each Taxon instance in current taxon namespace that does not have a sequence.

folded_site_frequency_spectrum(is_pad_vector_to_unfolded_length=False)

Returns the folded or minor site/allele frequency spectrum.

Given $N$ chromosomes, the site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_N)$, where the value $f_i$ is the number of sites where $i$ derived alleles are segregating in the sample: 0 alleles, 1 allele, 2 alleles, etc.

The folded site frequency spectrum is a vector $(f_0, f_1, f_2, …, f_m), m = ceil{frac{N}{2}}$, where the values are the number of minor alleles in the site.

Parameters:

is_pad_vector_to_unfolded_length (bool) – If False, then the vector length will be $ceil{frac{N}{2}}$, where $N$ is the number of taxa. Otherwise, by default, True, length of vector will be number of taxa + 1, with the first element the number of monomorphic sites not contributing to the site frequency spectrum.

Returns:

v (list[int]) – A vector of integers representing the folded site frequency spectrum.

classmethod from_dict(source_dict, char_matrix=None, case_sensitive_taxon_labels=False, **kwargs)

Populates character matrix from dictionary (or similar mapping type), creating Taxon objects and sequences as needed.

Keys must be strings representing labels Taxon objects or Taxon objects directly. If key is specified as string, then it will be dereferenced to the first existing Taxon object in the current taxon namespace with the same label. If no such Taxon object can be found, then a new Taxon object is created and added to the current namespace. If a key is specified as a Taxon object, then this is used directly. If it is not in the current taxon namespace, it will be added.

Values are the sequences (more generally, iterable of values). If values are of type CharacterDataSequence, then they are added as-is. Otherwise CharacterDataSequence instances are created for them. Values may be coerced into types compatible with particular matrices. The classmethod coerce_values() will be called for this.

Examples

The following creates a DnaCharacterMatrix instance with three sequences:

d = {
        "s1" : "TCCAA",
        "s2" : "TGCAA",
        "s3" : "TG-AA",
}
dna = DnaCharacterMatrix.from_dict(d)

Three Taxon objects will be created, corresponding to the labels ‘s1’, ‘s2’, ‘s3’. Each associated string sequence will be converted to a CharacterDataSequence, with each symbol (“A”, “C”, etc.) being replaced by the DNA state represented by the symbol.

Parameters:
  • source_dict (dict or other mapping type) – Keys must be strings representing labels Taxon objects or Taxon objects directly. Values are sequences. See above for details.

  • char_matrix (CharacterMatrix) – Instance of CharacterMatrix to populate with data. If not specified, a new one will be created using keyword arguments specified by kwargs.

  • case_sensitive_taxon_labels (boolean) – If True, matching of string labels specified as keys in d will be matched to Taxon objects in current taxon namespace with case being respected. If False, then case will be ignored.

  • **kwargs (keyword arguments, optional) – Keyword arguments to be passed to constructor of CharacterMatrix when creating new instance to populate, if no target instance is provided via char_matrix.

Returns:

char_matrix (|CharacterMatrix|) – CharacterMatrix populated by data from d.

classmethod get(**kwargs)

Instantiate and return a new character matrix object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object of data opened for reading.

  • path (str) – Path to file of data.

  • url (str) – URL of data.

  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • label (str) – Name or identifier to be assigned to the new object; if not given, will be assigned the one specified in the data source, or None otherwise.

  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.

  • matrix_offset (int) – 0-based index of character block or matrix in source to be parsed. If not specified then the first matrix (offset = 0) is assumed.

  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dna1 = dendropy.DnaCharacterMatrix.get(
        file=open("pythonidae.fasta"),
        schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(
        url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
        schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(
        file=open("pythonidae.dat"),
        schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(
        path="python_morph.nex",
        schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(
        data=">t1\n01011\n\n>t2\n11100",
        schema="fasta")
classmethod get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

classmethod get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.

  • schema (string) – Specification of data format (e.g., “nexus”).

  • kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.

Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

items()

Returns character map key, value pairs in key-order.

keep_sequences(taxa)

Discards all sequences not associated with any of the Taxon instances.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

property max_sequence_size

Maximum number of characters across all sequences in matrix.

Returns:

n (integer) – Maximum number of characters across all sequences in matrix.

migrate_taxon_namespace(taxon_namespace, unify_taxa_by_label=True, taxon_mapping_memo=None)

Move this object and all members to a new operational taxonomic unit concept namespace scope.

Current self.taxon_namespace value will be replaced with value given in taxon_namespace if this is not None, or a new TaxonNamespace object. Following this, reconstruct_taxon_namespace() will be called: each distinct Taxon object associated with self or members of self that is not alread in taxon_namespace will be replaced with a new Taxon object that will be created with the same label and added to self.taxon_namespace. Calling this method results in the object (and all its member objects) being associated with a new, independent taxon namespace.

Label mapping case sensitivity follows the self.taxon_namespace.is_case_sensitive setting. If False and unify_taxa_by_label is also True, then the establishment of correspondence between Taxon objects in the old and new namespaces with be based on case-insensitive matching of labels. E.g., if there are four Taxon objects with labels ‘Foo’, ‘Foo’, ‘FOO’, and ‘FoO’ in the old namespace, then all objects that reference these will reference a single new Taxon object in the new namespace (with a label some existing casing variant of ‘foo’). If True: if unify_taxa_by_label is True, Taxon objects with labels identical except in case will be considered distinct.

Parameters:
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace into the scope of which this object will be moved.

  • unify_taxa_by_label (boolean, optional) – If True, then references to distinct Taxon objects with identical labels in the current namespace will be replaced with a reference to a single Taxon object in the new namespace. If False: references to distinct Taxon objects will remain distinct, even if the labels are the same.

  • taxon_mapping_memo (dictionary) – Similar to memo of deepcopy, this is a dictionary that maps Taxon objects in the old namespace to corresponding Taxon objects in the new namespace. Mostly for interal use when migrating complex data to a new namespace. Note that any mappings here take precedence over all other options: if a Taxon object in the old namespace is found in this dictionary, the counterpart in the new namespace will be whatever value is mapped, regardless of, e.g. label values.

Examples

Use this method to move an object from one taxon namespace to another.

For example, to get a copy of an object associated with another taxon namespace and associate it with a different namespace:

# Get handle to the new TaxonNamespace
other_taxon_namespace = some_other_data.taxon_namespace

# Get a taxon-namespace scoped copy of a tree
# in another namespace
t2 = Tree(t1)

# Replace taxon namespace of copy
t2.migrate_taxon_namespace(other_taxon_namespace)

You can also use this method to get a copy of a structure and then move it to a new namespace:

t2 = Tree(t1) t2.migrate_taxon_namespace(TaxonNamespace())

# Note: the same effect can be achived by: t3 = copy.deepcopy(t1)

new_character_subset(label, character_indices)

Defines a set of character (columns) that make up a character set. Raises an error if one already exists with the same label. Column indices are 0-based.

new_sequence(taxon, values=None)

Creates a new CharacterDataSequence associated with Taxon taxon, and populates it with values in values.

Parameters:
  • taxon (Taxon) – Taxon instance with which this sequence is associated.

  • values (iterable or None) – An initial set of values with which to populate the new character sequence.

Returns:

s (CharacterDataSequence) – A new CharacterDataSequence associated with Taxon taxon.

pack(value=None, size=None, append=True)

Adds missing sequences for all Taxon instances in current namespace, and then pads out all sequences in self by adding value to each sequence until its length is size long or equal to the length of the longest sequence if size is not specified. A combination of CharacterMatrix.fill_taxa and CharacterMatrix.fill.

Parameters:
  • value (object) – A valid value (e.g., a numeric value for continuous characters, or a StateIdentity for discrete character).

  • size (integer or None) – The size (length) up to which the sequences will be padded. If None, then the maximum (longest) sequence size will be used.

  • append (boolean) – If True (default), then new values will be added to the end of each sequence. If False, then new values will be inserted to the front of each sequence.

poll_taxa(taxa=None)

Returns a set populated with all of Taxon instances associated with self.

Parameters:

taxa (set()) – Set to populate. If not specified, a new one will be created.

Returns:

taxa (set[|Taxon|]) – Set of taxa associated with self.

purge_taxon_namespace()

Remove all Taxon instances in self.taxon_namespace that are not associated with self or any item in self.

reconstruct_taxon_namespace(unify_taxa_by_label=True, taxon_mapping_memo=None)

See TaxonNamespaceAssociated.reconstruct_taxon_namespace.

reindex_subcomponent_taxa()

Synchronizes Taxon objects of map to taxon_namespace of self.

reindex_taxa(taxon_namespace=None, clear=False)

DEPRECATED: Use migrate_taxon_namespace() instead. Rebuilds taxon_namespace from scratch, or assigns Taxon objects from given TaxonNamespace object taxon_namespace based on label values.

remap_to_default_state_alphabet_by_symbol(purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to the default state alphabet, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in the default state alphabet that has the same symbol. Raises ValueError if no matching symbol can be found.

remap_to_state_alphabet_by_symbol(state_alphabet, purge_other_state_alphabets=True)

All entities with any reference to a state alphabet will be have the reference reassigned to state alphabet sa, and all entities with any reference to a state alphabet element will be have the reference reassigned to any state alphabet element in sa that has the same symbol. Raises KeyError if no matching symbol can be found.

remove_sequences(taxa)

Removes sequences associated with Taxon instances specified in taxa. A KeyError is raised if a Taxon instance is specified for which there is no associated sequences.

Parameters:

taxa (iterable[Taxon]) – List or some other iterable of Taxon instances.

replace_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to replace sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

  4. All other sequences will be ignored.

property sequence_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

sequences()

List of all sequences in self.

Returns:

s (list of CharacterDataSequence objects in self)

taxon_namespace_scoped_copy(memo=None)

Cloning level: 1. Taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon objects: these are preserved as references.

taxon_state_sets_map(char_indices=None, gaps_as_missing=True, gap_state=None, no_data_state=None)

Returns a dictionary that maps taxon objects to lists of sets of fundamental state indices.

Parameters:
  • char_indices (iterable of ints) – An iterable of indexes of characters to include (by column). If not given or None [default], then all characters are included.

  • gaps_as_missing (boolean) – If True [default] then gap characters will be treated as missing data values. If False, then they will be treated as an additional (fundamental) state.`

Returns:

d (dict) – A dictionary with class:Taxon objects as keys and a list of sets of fundamental state indexes as values.

E.g., Given the following matrix of DNA characters:

T1 AGN T2 C-T T3 GC?

Return with gaps_as_missing==True

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([0,1,2,3]),  set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3]) ],
}

Return with gaps_as_missing==False

{
    <T1> : [ set([0]), set([2]),        set([0,1,2,3]) ],
    <T2> : [ set([1]), set([4]),        set([3]) ],
    <T3> : [ set([2]), set([1]),        set([0,1,2,3,4]) ],
}

Note that when gaps are treated as a fundamental state, not only does ‘-’ map to a distinct and unique state (4), but ‘?’ (missing data) maps to set consisting of all bases and the gap state, whereas ‘N’ maps to a set of all bases but not including the gap state.

When gaps are treated as missing, on the other hand, then ‘?’ and ‘N’ and ‘-’ all map to the same set, i.e. of all the bases.

update_sequences(other_matrix)

Replaces sequences for Taxon objects shared between self and other_matrix and adds sequences for Taxon objects that are in other_matrix but not in self.

Parameters:

other_matrix (CharacterMatrix) – Matrix from which to update sequences.

Notes

  1. other_matrix must be of same type as self.

  2. other_matrix must have the same TaxonNamespace as self.

  3. Each sequence associated with a Taxon reference in other_matrix but not in self will be added to self.

  4. Each sequence in self associated with a Taxon that is also represented in other_matrix will be replaced with a shallow-copy of the corresponding sequence from other_matrix.

update_taxon_namespace()

All Taxon objects in self that are not in self.taxon_namespace will be added.

values()

Iterates values (i.e. sequences) in this matrix.

property vector_size

Number of characters in first sequence in matrix.

Returns:

n (integer) – Number of sequences in matrix.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.

  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

# Using a file path:
d.write(path="path/to/file.dat", schema="nexus")

# Using an open file:
with open("path/to/file.dat", "w") as f:
    d.write(file=f, schema="nexus")
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.