Taxonomy — Graphs

Taxonomy tree graphs

taxon_subtree materialises a rooted taxonomic subtree as an explicit directed graph — a TaxonomyTree wrapping a Graphs.jl SimpleDiGraph. Where child_taxa and parent_taxa return flat name lists, taxon_subtree preserves the full parent–child structure and enables graph algorithms, tree traversals, and subtree visualisation.

using PaleobiologyDB, PaleobiologyDB.Taxonomy
import Graphs

Building a subtree

# Full descendant subtree of Carnivora (every rank, every node)
tree = taxon_subtree("Carnivora")

Graphs.nv(tree.graph)    # → total number of taxa
Graphs.ne(tree.graph)    # → total number of parent → child edges

# The root node
root_taxon(tree).name    # → "Carnivora"
root_taxon(tree).rank    # → "order"

Truncating at a leaf rank

Pass leaf_rank to stop the traversal at a given rank. By default (strict_leaf_rank = true), only nodes at exactly leaf_rank become leaves; any taxa at finer ranks that are direct children of a coarser node (an PBDB data pattern common in less well-resolved groups) are excluded. The intermediate ranks between the root and leaf_rank are retained as interior nodes.

# Carnivora subtree truncated at family level (strict default)
#   interior nodes: order, suborder, …, superfamily
#   leaf nodes: family-rank taxa only; orphaned genera/species excluded
tree = taxon_subtree("Carnivora"; leaf_rank = "family")
Graphs.nv(tree.graph)   # order + suborders + superfamilies + families
Graphs.ne(tree.graph)   # one edge per parent → child pair

# leaf_taxa returns exclusively family-rank nodes
all(n.rank == "family" for n in leaf_taxa(tree))   # → true

# Genus-level subtree of Canidae (strict default)
#   interior: family, tribe, subfamily, …
#   leaves: genus-rank taxa only
t2 = taxon_subtree("Canidae"; leaf_rank = "genus")

# The leaf nodes are exactly the genera
leaf_taxa(t2) .|> (n -> n.name)
# → ["Borophagus", "Canis", "Urocyon", "Vulpes", …]

# Non-strict: also include orphaned finer-ranked taxa as leaves
t3 = taxon_subtree("Pterosauria"; leaf_rank = "family", strict_leaf_rank = false)

# Without leaf_rank → full descendant tree to finest available rank
t4 = taxon_subtree("Canis"; leaf_rank = nothing)
leaf_taxa(t4) |> length   # number of species under Canis

Accessor functions

tree = taxon_subtree("Carnivora"; leaf_rank = "family")

# Root node
r = root_taxon(tree)
r.name      # → "Carnivora"
r.rank      # → "order"
r.pbdb_id   # → PBDB orig_no

# Leaf nodes (sorted by name)
leaves = leaf_taxa(tree)
leaves .|> (n -> n.name)   # → ["Ailuridae", "Amphicyonidae", "Canidae", …]

# All nodes at a specific rank (sorted by name)
taxa_at_rank(tree, "family")    # same as leaf_taxa when leaf_rank = "family"

# In a full (untruncated) tree, taxa_at_rank selects any rank
full_tree = taxon_subtree("Carnivora")
taxa_at_rank(full_tree, "genus")    |> length   # number of genera
taxa_at_rank(full_tree, "species")  |> length   # number of species

TaxonNode fields

Every node in the tree is a TaxonNode carrying the full set of fields from the PBDB taxa list snapshot:

node = root_taxon(tree)

node.name        # accepted taxon name (String)
node.rank        # rank string (String, e.g. "order")
node.pbdb_id     # PBDB orig_no (Int)
node.accepted_id # PBDB accepted_no (Union{Int,Missing})
                 #   == pbdb_id  for accepted (non-synonym) taxa
                 #   != pbdb_id  for synonyms — points to the valid name's orig_no
                 #   missing     when not recorded in the snapshot
node.parent_id   # parent orig_no, or missing for the subtree root (Union{Int,Missing})

Using the graph with Graphs.jl

The .graph field is a standard Graphs.SimpleDiGraph{Int}, so any algorithm from Graphs.jl that accepts an AbstractGraph works directly:

import Graphs

tree = taxon_subtree("Carnivora"; leaf_rank = "family")
g    = tree.graph

# Basic metrics
Graphs.nv(g)    # number of vertices
Graphs.ne(g)    # number of edges
Graphs.is_directed(g)   # → true (edges run parent → child)

# Neighbours of the root
Graphs.outneighbors(g, tree.root)   # vertex indices of root's children
Graphs.inneighbors(g, tree.root)    # → Int[] (root has no parent in subtree)

# Traverse from any vertex
v = tree.vertex_of[tree.taxa[1].pbdb_id]   # vertex for a node by pbdb_id

Connecting the tree to an occurrence DataFrame

A common workflow is to build a clade tree and then map occurrences onto it to see which sub-groups are represented in a dataset:

using PaleobiologyDB, PaleobiologyDB.Taxonomy
import Graphs

# Fetch occurrences and build the family tree
df   = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", show = "full")
tree = taxon_subtree("Carnivora"; leaf_rank = "family")

# Which families appear in the occurrence data?
occ_families = Set(skipmissing(df.family))

sampled = [n for n in leaf_taxa(tree) if n.name in occ_families]
missing_ = [n for n in leaf_taxa(tree) if n.name ∉ occ_families]

println("$(length(sampled)) of $(length(leaf_taxa(tree))) families sampled in the Miocene")
PaleobiologyDB.Taxonomy.TaxonNodeType
TaxonNode

A single node in a taxonomic tree, carrying the full set of identity fields from the PBDB taxa list snapshot.

Fields

  • name::String — accepted taxon name (as in PBDB)
  • rank::String — taxonomic rank (e.g. "genus", "family")
  • pbdb_id::Int — PBDB orig_no integer identifier
  • accepted_id::Union{Int,Missing} — PBDB accepted_no; equals pbdb_id for accepted (non-synonym) taxa, points to the valid name for synonyms, missing when not recorded in the snapshot
  • parent_id::Union{Int,Missing}orig_no of the parent node, or missing when this node is the root of the subtree

Construction

TaxonNode("Canis", "genus", 41045, 41045, 2)
#          name     rank     pbdb_id accepted_id parent_id

See also TaxonomyTree, taxon_subtree.

source
PaleobiologyDB.Taxonomy.TaxonomyTreeType
TaxonomyTree

A rooted, directed tree representing a taxonomic subtree extracted from the PBDB taxa list snapshot.

Fields

  • graph::Graphs.SimpleDiGraph{Int} — directed graph; edges run parent → child. Vertices are integers in 1 .. Graphs.nv(graph).
  • taxa::Vector{TaxonNode}taxa[v] is the TaxonNode for vertex v.
  • vertex_of::Dict{Int,Int} — maps PBDB orig_no → vertex index, allowing O(1) lookup by numeric PBDB identifier.
  • root::Int — vertex index of the root node (always 1).

Working with Graphs.jl

TaxonomyTree wraps a standard Graphs.SimpleDiGraph, so any function from Graphs.jl that accepts an AbstractGraph works directly on tree.graph:

using Graphs

t = taxon_subtree("Carnivora"; leaf_rank = "family")

Graphs.nv(t.graph)                         # number of nodes
Graphs.ne(t.graph)                         # number of edges
Graphs.outneighbors(t.graph, t.root)       # vertex indices of root's children
Graphs.is_tree(t.graph)                    # always true for a valid subtree

See also taxon_subtree, TaxonNode.

source
PaleobiologyDB.Taxonomy.taxon_subtreeFunction
taxon_subtree(taxon_name; leaf_rank=nothing, strict_leaf_rank=true) -> TaxonomyTree

Build and return a TaxonomyTree rooted at taxon_name, descending through the taxonomic hierarchy down to (and including) leaf_rank.

The tree is derived from the Scratch-cached PBDB taxa list snapshot; no network requests are made. The snapshot is downloaded on first use and refreshed automatically when older than 30 days.

Arguments

  • taxon_name::AbstractString — accepted taxon name exactly as it appears in PBDB (e.g. "Carnivora", "Canidae", "Canis").

  • leaf_rank::Union{AbstractString,Nothing} (keyword, default nothing) — the rank at which to stop recursing. Must be one of:

    "subspecies" "species" "subgenus" "genus" "subtribe" "tribe"
    "subfamily" "family" "superfamily" "infraorder" "suborder" "order"
    "superorder" "infraclass" "subclass" "class" "superclass"
    "subphylum" "phylum" "kingdom"

    When nothing (default), the entire descendant subtree is collected. When given, nodes at leaf_rank become leaves of the returned tree; their children in the full PBDB tree are not included. Intermediate ranks between the root rank and leaf_rank are included as interior nodes.

  • strict_leaf_rank::Bool (keyword, default true) — controls how nodes at ranks finer than leaf_rank are handled.

    In real PBDB data, taxa at finer ranks (e.g. a genus or species) are sometimes direct children of coarse-rank nodes (e.g. an order) without an intervening family. When strict_leaf_rank = true (default), such nodes are excluded entirely from the returned tree — the leaves will all be at exactly leaf_rank (or at an unranked-clade rank if no ranked leaf was reachable). When strict_leaf_rank = false, finer-ranked nodes are included as leaf nodes, preserving all PBDB parent–child edges.

    Has no effect when leaf_rank is nothing.

Returns

A TaxonomyTree rooted at the named taxon. Returns a single-node tree (root only, no edges) when taxon_name is not found in the snapshot.

Throws ArgumentError if leaf_rank is not a valid PBDB rank string.

Examples

using PaleobiologyDB, PaleobiologyDB.Taxonomy
import Graphs

# Full subtree of Carnivora (every descendant at every rank)
t = taxon_subtree("Carnivora")
Graphs.nv(t.graph)           # thousands of nodes
root_taxon(t).rank           # "order"

# Strict default: leaves are exactly families; orphaned genera excluded
t2 = taxon_subtree("Carnivora"; leaf_rank = "family")
leaf_taxa(t2) .|> (n -> n.name)   # ["Ailuridae", "Canidae", "Felidae", …]
all(n.rank == "family" for n in leaf_taxa(t2))   # true

# Non-strict: orphaned genera/species included as leaves
t3 = taxon_subtree("Pterosauria"; leaf_rank = "family", strict_leaf_rank = false)
# taxa at genus or species rank parented directly to order appear as leaves

# Genus-level subtree of Canidae
t4 = taxon_subtree("Canidae"; leaf_rank = "genus")

# Unknown taxon → single-node tree
t5 = taxon_subtree("INVALID")
Graphs.nv(t5.graph)          # 1

See also root_taxon, leaf_taxa, taxa_at_rank, child_taxa, TaxonomyTree.

source
PaleobiologyDB.Taxonomy.leaf_taxaFunction
leaf_taxa(tree::TaxonomyTree) -> Vector{TaxonNode}

Return all leaf nodes of tree — vertices with no outgoing edges (no children in the subtree) — sorted by name.

When taxon_subtree was called with a leaf_rank, these are typically all nodes at exactly leaf_rank. The exception is interior nodes at coarser ranks that have no leaf_rank-level descendants: those nodes are included in the tree but have no children, so they also appear as leaves. Without leaf_rank, they are the most finely resolved taxa included in the tree.

Examples

t = taxon_subtree("Carnivora"; leaf_rank = "family")
leaf_taxa(t) .|> (n -> n.name)   # ["Ailuridae", "Canidae", …]

See also root_taxon, taxa_at_rank.

source
PaleobiologyDB.Taxonomy.taxa_at_rankFunction
taxa_at_rank(tree::TaxonomyTree, rank::AbstractString) -> Vector{TaxonNode}

Return all nodes in tree whose rank field equals rank, sorted by name.

Returns an empty vector when no nodes at rank are present.

Throws ArgumentError if rank is not a valid PBDB rank string.

Examples

t = taxon_subtree("Carnivora")
taxa_at_rank(t, "family") .|> (n -> n.name)  # ["Ailuridae", "Canidae", …]
taxa_at_rank(t, "genus")  |> length          # number of genera in Carnivora

See also root_taxon, leaf_taxa.

source