dendropy.calculate.phylogeneticdistance
: Phylogenetic Distance Calculations and Operations¶
The PhylogeneticDistanceMatrix
Class¶
- class dendropy.calculate.phylogeneticdistance.PhylogeneticDistanceMatrix(is_store_path_edges=False)[source]¶
Calculates and maintains patristic distance information of taxa on a tree.
- assemblage_membership_definitions_from_csv(src, default_data_type=<class 'float'>, **csv_reader_kwargs)[source]¶
Convenience method to return list of community sets from a delimited file that lists taxon (labels) in columns and community presence/absences or abundances in rows.
- compile_from_tree(tree)[source]¶
Calculates the distances. Note that the path length (in number of steps) between taxa that span the root will be off by one if the tree is unrooted.
- distance(taxon1, taxon2, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)[source]¶
Returns distance between taxon1 and taxon2.
- distances(is_weighted_edge_distances=True, is_normalize_by_tree_size=False)[source]¶
Returns list of patristic distances.
- distinct_taxon_pair_iter(filter_fn=None)[source]¶
Iterates over all distinct pairs of taxa in matrix.
- classmethod from_csv(src, taxon_namespace=None, is_allow_new_taxa=None, is_first_row_column_names=True, is_first_column_row_names=True, default_data_type=<class 'float'>, label_transform_fn=None, **csv_reader_kwargs)[source]¶
Instantiates a new PhylogeneticDistanceMatrix instance with data from an external source.
- Parameters:
src (file or file-like) – Source of data. This is a token delimited-file (e.g., a comma-delimited or tab-delimited file) providing a table which lists taxon labels in both rows and columns. The cells of the table are numeric (typically real) values that indicate the distance between the taxa of the current row and column. Note that only the upper right section of the table is considered. The diagonals values are typically zeroes and, in either case, ignored along with the lower diagonal. Despite being ignored by the PhylogeneticDistanceMatrix object, the values are parsed by the underlying reader and thus have to be valid numerical values.
taxon_namespace (
TaxonNamespace
instance) – The taxon namespace with which to manage taxa. If this has not already been pre-populated with the taxon names, thenis_allow_new_taxa
should be set toTrue
.is_allow_new_taxa (bool) – If
False
: we do not expect to encounter any new taxa in the data file, and it is an error if we do. IfTrue
: we do expect to encounter new taxa in the data file. The default value of this depends on the value passed totaxon_namespace
. Iftaxon_namespace
isNone
or an emptyTaxonNamespace
instance, then unless explicitly set toFalse
,is_allow_new_taxa
will default toTrue
: allowing of creation of new taxa corresponding to labels found in the data source. On the other hand, iftaxon_namespace
is not None and its value is aTaxonNamespace
instance with at least one taxon, unless explicitly set toTrue
,is_allow_new_taxa
will default toFalse
, and it will be an error if taxon labels are found in the data source that do not correspond (exactly) toTaxon
objects defined in the taxon namespace. This is to err on the side of caution, to avoid (or rather, highlight) problems due to incorrect or mismatching labels between the data source and the current taxon namespace.is_first_row_column_names (bool) – By default
True
: assumes that first row lists the taxon names. Set toFalse
if there is no header row.is_first_column_row_names (bool) – By default
True
: assumes that first column lists the taxon names. Set toFalse
if there is now row name column.label_transform_fn (function object) – If not None, this should be a function object that takes a string as an argument and returns another string. This function will be applied to row and column labels before they are matched to taxon labels in the
TaxonNamespace
instance given bytaxon_namespace
.**csv_reader_kwargs (keyword arguments) – This arguments will be passed to the underlying CSV reader. The most important one is probably ‘delimiter’.
- Returns:
pdm (A |PhylogeneticDistanceMatrix| instance)
Examples
import dendropy pdm1 = dendropy.PhylogeneticDistanceMatrix.from_csv( src=open("data.csv"), delimiter=",") pdm2 = dendropy.PhylogeneticDistanceMatrix.from_csv( src=open("data.tsv"), delimiter="\t")
- classmethod from_tree(tree, *args, **kwargs)[source]¶
Creates and returns a
PhylogeneticDistanceMatrix
based on the given tree.Note that this creates a “snapshot” of the current state of the tree. Subsequent changes to the tree will not be reflected in
PhylogeneticDistanceMatrix
instances previously created.Also note that syntactically you may prefer to use:
pdm = tree.phylogenetic_distance_matrix()
instead of:
pdm = PhylogeneticDistanceMatrix.from_tree(tree)
- Parameters:
tree (a
Tree
instance) – TheTree
from which to get the phylogenetic distances.- Returns:
pdm (A |PhylogeneticDistanceMatrix| instance)
Examples
import dendropy tree = dendropy.Tree.get(path="tree.nex", schema="nexus") pdm1 = dendropy.PhylogeneticDistanceMatrix.from_tree(tree) # following is equivalent to above and probably preferred: pdm2 = tree.phylogenetic_distance_matrix()
- mean_nearest_taxon_distance(filter_fn=None, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)[source]¶
Calculates the phylogenetic ecology statistic “MNTD”[1,2] for the tree (only considering taxa for which
filter_fn
returns True when applied iffilter_fn
is specified).The mean nearest taxon distance (mntd) is given by:
\[mntd = \frac{ \sum_{i}^{n} min(\delta_{i,j}) }{n},\]where \(i \neq j\), \(\delta_{i,j}\) is the phylogenetic distance between species \(i\) and \(j\), and \(n\) is the number of species in the sample.
- Parameters:
filter_fn (function object or None) – If
None
, then all leaves will be considered. Otherwise should be a function object that takes a Taxon instance as an argument and returnsTrue
if it is to be included in the calculation orFalse
otherwise. In trees sampled from multiple communites,filter_fn
can be used to restrict the calculation to only one community based on some criteria.is_weighted_edge_distances (bool) – If
True
then the edge-weighted distance, i.e., considering edge lengths, is returned. Otherwise the the path steps or the number of edges rather then the sum of is_weighted_edge_distances edges, connecting two taxa is considered.is_normalize_by_tree_size (bool) – If
True
then the results are normalized by the total tree length or steps/edges (depending on whether edge-weighted or unweighted distances are used, respectively). Otherwise, raw distances are used.
- Returns:
mntd (float) – The Mean Nearest Taxon Distance (MNTD) statistic for the daata.
Examples
import dendropy tree = dendropy.Tree.get(path="data.nex", schema="nexus") pdm = dendropy.PhylogeneticDistanceMatrix(tree) # consider all tips mntd = pdm.mean_nearest_taxon_distance() # only tips within the same community, based on the node annotation # "community" mntds_by_community = {} for community_label in ("1", "2", "3",): filter_fn = lambda x: x.annotations["community"] == community_label mntd = pdm.mean_pairwise_distance(filter_fn=filter_fn) mntds_by_community[community_label] = mntd
References
[1] Webb, C.O. 2000. Exploring the phylogenetic structure of ecological communities: An example for rainforest trees. The American Naturalist 156: 145-155.
[2] Swenson, N.G. Functional and Phylogenetic Ecology in R.
- mean_pairwise_distance(filter_fn=None, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)[source]¶
Calculates the phylogenetic ecology statistic “MPD”[1,2] for the tree (only considering taxa for which
filter_fn
returns True when applied iffilter_fn
is specified).The mean pairwise distance (mpd) is given by:
\[mpd = \frac{ \sum_{i}^{n} \sum_{j}^{n} \delta_{i,j} }{n \choose 2},\]where \(i \neq j\), \(\delta_{i,j}\) is the phylogenetic distance between species \(i\) and \(j\), and \(n\) is the number of species in the sample.
- Parameters:
filter_fn (function object or None) – If
None
, then all leaves will be considered. Otherwise should be a function object that takes a Taxon instance as an argument and returnsTrue
if it is to be included in the calculation orFalse
otherwise. In trees sampled from multiple communites,filter_fn
can be used to restrict the calculation to only one community based on some criteria.is_weighted_edge_distances (bool) – If
True
then the edge-weighted distance, i.e., considering edge lengths, is returned. Otherwise the the path steps or the number of edges rather then the sum of is_weighted_edge_distances edges, connecting two taxa is considered.is_normalize_by_tree_size (bool) – If
True
then the results are normalized by the total tree length or steps/edges (depending on whether edge-weighted or unweighted distances are used, respectively). Otherwise, raw distances are used.
- Returns:
mpd (float) – The Mean Pairwise Distance (MPD) statistic for the daata.
Examples
import dendropy tree = dendropy.Tree.get(path="data.nex", schema="nexus") pdm = dendropy.PhylogeneticDistanceMatrix(tree) # consider all tips mpd1 = pdm.mean_pairwise_distance() # only tips within the same community, based on the node annotation # "community" mpds_by_community = {} for community_label in ("1", "2", "3",): filter_fn = lambda x: x.annotations["community"] == community_label mpd = pdm.mean_pairwise_distance(filter_fn=filter_fn) mpds_by_community[community_label] = mpd
References
[1] Webb, C.O. 2000. Exploring the phylogenetic structure of ecological communities: An example for rainforest trees. The American Naturalist 156: 145-155.
[2] Swenson, N.G. Functional and Phylogenetic Ecology in R.
- nj_tree(is_weighted_edge_distances=True, tree_factory=None)[source]¶
Returns an Neighbor-Joining (NJ) tree based on the distances in the matrix.
Calculates and returns a tree under the Neighbor-Joining algorithm of Saitou and Nei (1987) for the data in the matrix.
- Parameters:
is_weighted_edge_distances (bool) – If
True
then edge lengths will be considered for distances. Otherwise, just the number of edges.- Returns:
t (|Tree|) – A
Tree
instance corresponding to the Neighbor-Joining (NJ) tree for this data.
Examples
import dendropy # Read data from a CSV file into a PhylogeneticDistanceMatrix # object with open("distance_matrix.csv") as src: pdm = dendropy.PhylogeneticDistanceMatrix.from_csv( src, is_first_row_column_names=True, is_first_column_row_names=True, is_allow_new_taxa=True, delimiter=",", ) # Calculate the tree nj_tree = pdm.nj_tree() # Print it print(nj_tree.as_string("nexus"))
References
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4: 406-425.
- path_edge_count(taxon1, taxon2, is_normalize_by_tree_size=False)[source]¶
Returns the number of edges between two taxon objects.
- patristic_distance(taxon1, taxon2, is_normalize_by_tree_size=False)[source]¶
Returns patristic distance between two taxon objects.
- shuffle_taxa(is_shuffle_phylogenetic_distances=True, is_shuffle_phylogenetic_path_steps=True, is_shuffle_mrca=True, rng=None)[source]¶
Randomly shuffles taxa in-situ.
- standardized_effect_size_mean_nearest_taxon_distance(assemblage_memberships, num_randomization_replicates=1000, is_weighted_edge_distances=True, is_normalize_by_tree_size=False, is_skip_single_taxon_assemblages=False, null_model_type='taxa.label', rng=None)[source]¶
Returns the standardized effect size value for the MNTD statistic under a null model under various community compositions.
The S.E.S. is given by:
\[SES(statistic) = \frac{observed - mean(model_{null})}{sd(model_{null})}\]This removes any bias associated with the decrease in variance in the MPD statistic value as species richness increases to the point where communities become saturated. Equivalent to -1 times the Nearest Taxon Index when using phylogenetic distances.
In contrast to the function calculating the non-standardized effect size version of this statistic, which uses filter function to specify the subset of taxa to be considerd, here a collection of (multiple) sets of taxa constituting a community is specified. This to allow calculation of the null model statistic across all community sets for each randomization replicate.
- Parameters:
assemblage_memberships (iterable of iterable of
Taxon
objects) – A collection of collections, e.g. a list of sets, with the elements of each set beingTaxon
instances. Each set specifies the composition of a community. The standardized effect size of this statistic will be calculated for each community as specified by a set ofTaxon
instances.num_randomization_replicates (int) – Number of randomization replicates.
is_weighted_edge_distances (bool) – If
True
then edge lengths will be considered for distances. Otherwise, just the number of edges.
- Returns:
r (list of results) – A list of results, with each result corresponding to a community set given in
assemblage_memberships
. Each result consists of a named tuple with the following elements:obs : the observed value of the statistic
- null_model_meanthe mean value of the statistic under the null
model
- null_model_sdthe standard deviation of the statistic under
the null model
- zthe standardized effect value of the statistic
(= SES as defined in [1] above)
- pthe p-value of the observed value of the
statistic under the null model.
Examples
import dendropy tree = dendropy.Tree.get_from_path( src="data/community.tree.newick", schema="newick", rooting="force-rooted") pdm = dendropy.PhylogeneticDistanceMatrix.from_tree(tree) assemblage_memberships = pdm.assemblage_membership_definitions_from_csv("data/comm1.csv") results = pdm.standardized_effect_size_mean_nearest_taxon_distance(assemblage_memberships=assemblage_memberships) print(results)
- standardized_effect_size_mean_pairwise_distance(assemblage_memberships, num_randomization_replicates=1000, is_weighted_edge_distances=True, is_normalize_by_tree_size=False, is_skip_single_taxon_assemblages=False, null_model_type='taxa.label', rng=None)[source]¶
Returns the standardized effect size value for the MPD statistic under a null model under various community compositions.
The S.E.S. is given by:
\[SES(statistic) = \frac{observed - mean(model_{null})}{sd(model_{null})}\]This removes any bias associated with the decrease in variance in the MPD statistic value as species richness increases to the point where communities become saturated. Equivalent to -1 times the Nearest Relative Index (NRI) when using phylogenetic distances.
In contrast to the function calculating the non-standardized effect size version of this statistic, which uses filter function to specify the subset of taxa to be considerd, here a collection of (multiple) sets of taxa constituting a community is specified. This to allow calculation of the null model statistic across all community sets for each randomization replicate.
- Parameters:
assemblage_memberships (iterable of iterable of
Taxon
objects) – A collection of collections, e.g. a list of sets, with the elements of each set beingTaxon
instances. Each set specifies the composition of a community. The standardized effect size of this statistic will be calculated for each community as specified by a set ofTaxon
instances.num_randomization_replicates (int) – Number of randomization replicates.
is_weighted_edge_distances (bool) – If
True
then edge lengths will be considered for distances. Otherwise, just the number of edges.
- Returns:
r (list of results) – A list of results, with each result corresponding to a community set given in
assemblage_memberships
. Each result consists of a named tuple with the following elements:obs : the observed value of the statistic
- null_model_meanthe mean value of the statistic under the null
model
- null_model_sdthe standard deviation of the statistic under
the null model
- zthe standardized effect value of the statistic
(= SES as defined in [1] above)
p : the p-value of the observed value of the
Examples
import dendropy tree = dendropy.Tree.get_from_path( src="data/community.tree.newick", schema="newick", rooting="force-rooted") pdm = tree.phylogenetic_distance_matrix() assemblage_membership_definitions = pdm.assemblage_membership_definitions_from_csv("data/comm1.csv") results = pdm.standardized_effect_size_mean_pairwise_distance(assemblage_memberships=assemblage_membership_definitions.values()) print(results)
- sum_of_distances(is_weighted_edge_distances=True, is_normalize_by_tree_size=False)[source]¶
Returns sum of patristic distances on tree.
- taxon_iter(filter_fn=None)[source]¶
Iterates over taxa in matrix. Note that this could be a subset of the taxa in the associated taxon namespace.
- upgma_tree(is_weighted_edge_distances=True, tree_factory=None)[source]¶
Returns an Unweighted Pair Group Method with Arithmetic Mean (UPGMA) tree based on the distances in the matrix.
- Parameters:
is_weighted_edge_distances (bool) – If
True
then edge lengths will be considered for distances. Otherwise, just the number of edges.- Returns:
t (|Tree|) – A
Tree
instance corresponding to the UPGMA tree for this data.
Examples
import dendropy # Read data from a CSV file into a PhylogeneticDistanceMatrix # object with open("distance_matrix.csv") as src: pdm = dendropy.PhylogeneticDistanceMatrix.from_csv( src, is_first_row_column_names=True, is_first_column_row_names=True, is_allow_new_taxa=True, delimiter=",", ) # Calculate the tree upgma_tree = pdm.upgma_tree() # Print it print(upgma_tree.as_string("nexus"))