Taxonomy — Filtering

The Taxonomy submodule provides tools for validating and cleaning palaeobiological data against the PBDB taxonomic authority.

using PaleobiologyDB.Taxonomy

Combined taxonomic quality filter

drop_unqualified_taxa is the top-level function for cleaning occurrence DataFrames. It applies two independent filters in sequence: a resolution check (is the identification specific enough?) and a name-validity check (is the name actually in the PBDB taxonomy?).

using PaleobiologyDB, PaleobiologyDB.Taxonomy

df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")

# Keep rows resolved and recognized at genus level
df_genus = drop_unqualified_taxa(df, "genus")

# Keep rows resolved and recognized at species level
# ("species" maps to the accepted_name column for the name check)
df_species = drop_unqualified_taxa(df, "species")

# In-place variant (modifies df directly)
drop_unqualified_taxa!(df, "family")

# Use live API validation instead of the local snapshot
df_clean = drop_unqualified_taxa(df, "genus"; validation_authority = :query)
PaleobiologyDB.Taxonomy.drop_unqualified_taxaFunction
drop_unqualified_taxa(df, taxonomic_resolution) -> DataFrame

Return a filtered copy of df keeping only rows that are both resolved and recognized at the requested taxonomic level.

This is a convenience wrapper that combines two independent filters in sequence:

  1. Resolution filter (drop_unresolved_taxa) — keeps rows where accepted_rank is at least as specific as taxonomic_resolution. For example, "genus" accepts rows whose accepted_rank is "genus", "species", or "subspecies". Rows with a missing accepted_rank are always dropped. If df has a column named after the rank (e.g. genus) that column must also be non-missing and non-empty.

  2. Name-validity filter (drop_unrecognized_taxa) — keeps rows where the taxon name in the relevant column is found in the PBDB taxonomic authority. Names are checked against a locally cached snapshot of the full PBDB taxa list (downloaded on first use, refreshed every 30 days); pass validation_authority = :query to use live API calls instead.

Column mapping

Most ranks map taxonomic_resolution directly to both the rank string and the DataFrame column of the same name:

taxonomic_resolutionrank stringtaxon column
"genus""genus":genus
"family""family":family
"order""order":order

The one exception is "species", where the PBDB stores the full binomial in the accepted_name column rather than a column called species:

taxonomic_resolutionrank stringtaxon column
"species""species":accepted_name

Keyword arguments

  • validation_authority — passed through to drop_unrecognized_taxa. :snapshot (default) uses the local cache; :query calls the live API.

Examples

using PaleobiologyDB, PaleobiologyDB.Taxonomy

df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")

# Keep only rows resolved and recognized at genus level
df_genus = drop_unqualified_taxa(df, "genus")

# Keep only rows resolved and recognized at species level
# (checks accepted_rank ∈ {"species","subspecies"} AND accepted_name ∈ PBDB)
df_species = drop_unqualified_taxa(df, "species")

# Keep only rows resolved to family level, using live API validation
df_family = drop_unqualified_taxa(df, "family"; validation_authority = :query)

See also drop_unqualified_taxa! for the in-place variant, drop_unresolved_taxa, drop_unrecognized_taxa.

source
PaleobiologyDB.Taxonomy.drop_unqualified_taxa!Function
drop_unqualified_taxa!(df, taxonomic_resolution) -> DataFrame

In-place variant of drop_unqualified_taxa.

Applies both the resolution filter and the name-validity filter directly to df (no copy is made) and returns df.

The two filters applied in order are:

  1. Resolution filter — rows whose accepted_rank is coarser than taxonomic_resolution, or that have a missing accepted_rank, are removed. If a column named after the rank exists (e.g. genus), rows where that column is missing or empty are also removed.

  2. Name-validity filter — rows where the taxon name in the relevant column is not found in the PBDB taxonomic authority are removed. By default this uses a locally cached snapshot; pass validation_authority = :query to use live API calls.

Column mapping

"species" maps to the accepted_name column; all other string values map to the column of the same name (e.g. "genus":genus).

Keyword arguments

  • validation_authority:snapshot (default) or :query.

Examples

using PaleobiologyDB, PaleobiologyDB.Taxonomy

df = pbdb_occurrences(base_name = "Felidae", interval = "Pleistocene", show = "full")

# Filter in-place to genus-level resolved and recognized rows
drop_unqualified_taxa!(df, "genus")

# Filter in-place to species-level (uses accepted_name column for name check)
drop_unqualified_taxa!(df, "species")

See also drop_unqualified_taxa for the non-mutating variant.

source

Taxonomy resolution filter

These functions check that each row is identified to at least a given taxonomic rank, based on the accepted_rank column.

using PaleobiologyDB.Taxonomy

# Keep rows where accepted_rank is "genus", "species", or "subspecies"
df_resolved = drop_unresolved_taxa(df, "genus")

# Equivalent shorthand using a column symbol
df_resolved = drop_unresolved_taxa(df, :genus)

# :accepted_name maps to "species" resolution
df_resolved = drop_unresolved_taxa(df, :accepted_name)

# In-place variant
drop_unresolved_taxa!(df, "family")
PaleobiologyDB.Taxonomy.drop_unresolved_taxaFunction
drop_unresolved_taxa(df, taxonomic_rank) -> DataFrame

Return a filtered copy of df containing only rows that meet the minimum taxonomic resolution specified by taxonomic_rank.

Two criteria are applied:

  1. The accepted_rank column must be at taxonomic_rank or finer (more specific). For example, "genus" accepts "genus", "subgenus", "species", and "subspecies"; "family" additionally accepts "subfamily", "tribe", "subtribe". Rows with a missing accepted_rank are dropped.
  2. If df contains a column whose name matches taxonomic_rank (e.g. a genus or family column), that column must also be non-missing and non-empty.

Examples

# Keep only rows identified to genus level or finer
df_clean = drop_unresolved_taxa(df, "genus")

# Keep only rows identified to family level or finer
df_clean = drop_unresolved_taxa(df, "family")

# Works for any rank in the Linnaean hierarchy
df_clean = drop_unresolved_taxa(df, "order")
source
drop_unresolved_taxa(df, taxon_field::Symbol) -> DataFrame

Convenience form that accepts a DataFrame column name instead of a rank string.

The taxonomic rank is technically a data value (:accepted_rank == "genus"), while taxon_field is the column that carries the identification result for that rank (:genus == "Tyrannosaurus"). Passing :genus here is therefore a shortcut for drop_unresolved_taxa(df, "genus"): keep rows resolved to the same taxonomic level as the data in the given taxon_field

The one special case is :accepted_name, which holds the full species binomial and so maps to "species" resolution.

Examples


# Same as: df_clean = drop_unresolved_taxa(df, "genus")
df_clean = drop_unresolved_taxa(df, :genus)

# Same as: df_clean = drop_unresolved_taxa(df, "family")
df_clean = drop_unresolved_taxa(df, :family)

# Same as: df_clean = drop_unresolved_taxa(df, "species")
df_clean = drop_unresolved_taxa(df, :accepted_name)
source

Taxonomy name-validity filter

These functions check taxon names against the PBDB taxonomy using either a local Scratch-managed snapshot (default, O(1) lookups after the initial download) or live API queries.

using PaleobiologyDB.Taxonomy

# Single-name check
istaxon("Pliosauridae")            # → true
istaxon("NO_FAMILY_SPECIFIED")     # → false

# Audit a DataFrame column
mask = audit_taxonomy(df, :family)
df[mask, :]

# Filter to recognized taxa only (non-mutating)
df_clean = drop_unrecognized_taxa(df, :family)

# In-place variant
drop_unrecognized_taxa!(df, :family)
PaleobiologyDB.Taxonomy.istaxonFunction
istaxon(taxon_name; validation_authority=:snapshot)

Return true if taxon_name is a non-empty string recognised by the PBDB taxonomy.

Keyword arguments

  • validation_authority:snapshot (default) or :query.

    • :snapshot: looks up the name in a locally cached copy of the full PBDB taxa list (~200 MB, Scratch-managed). The snapshot is downloaded on first use and refreshed automatically when older than 30 days. After the initial load, lookups are O(1).
    • :query: calls pbdb_taxon(; name = taxon_name) directly. Results for valid names are cached by DataCaches. Slower for bulk use but always current.

Examples

istaxon("Pliosauridae")                                  # → true
istaxon("NO_FAMILY_SPECIFIED")                           # → false
istaxon("Pliosauridae"; validation_authority = :query)  # live API call
source
PaleobiologyDB.Taxonomy.audit_taxonomyFunction
audit_taxonomy(df, taxon_field; validation_authority=:snapshot)

Return a Vector{Bool} of length nrow(df) where true means the value in taxon_field for that row is a valid PBDB taxon name (non-missing, non-empty, and found in the database).

The result can be used directly with df[mask, :] or passed to drop_unrecognized_taxa.

Keyword arguments

  • validation_authority — passed to istaxon.

Example

mask = audit_taxonomy(df, :family)
df[mask, :]
source
PaleobiologyDB.Taxonomy.drop_unrecognized_taxaFunction
drop_unrecognized_taxa(df, taxon_field; validation_authority=:snapshot)

Return a filtered copy of df keeping only rows where taxon_field contains a PBDB-recognised taxon name (non-missing, non-empty, found in the database).

See audit_taxonomy for keyword argument semantics. See also drop_unrecognized_taxa! for the in-place variant.

Example

df_clean = drop_unrecognized_taxa(df, :family)
source

Taxonomy augmentation

augment_taxonomy enriches an occurrences DataFrame with the full taxonomic hierarchy for each row, resolved from the Scratch-cached PBDB taxa list.

using PaleobiologyDB, PaleobiologyDB.Taxonomy

df = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", limit = 500)

# Add taxonomy_genus, taxonomy_family, …, taxonomy_kingdom, taxonomy_clades columns
df2 = augment_taxonomy(df)

# Filter for a specific subfamily
df2[.!ismissing.(df2.taxonomy_subfamily) .&& df2.taxonomy_subfamily .== "Borophaginae", :]

# Inspect a taxonomy string
df2.taxonomy_clades[1]
# → "Animalia > Chordata > Mammalia > Carnivora > Canidae > Borophaginae > Epicyon"
PaleobiologyDB.Taxonomy.augment_taxonomyFunction
augment_taxonomy(df; nodata=missing, fieldname_prefix="taxonomy_", taxonomy_separator=" > ") -> DataFrame

Return a copy of df with one column per taxonomic rank in PBDB_RANK_HIERARCHY and a combined taxonomy string column, all resolved from the Scratch-cached PBDB taxa list snapshot.

New columns (using default prefix "taxonomy_")

One column per rank, from most specific to most general:

taxonomy_subspecies  taxonomy_species  taxonomy_subgenus  taxonomy_genus
taxonomy_subtribe    taxonomy_tribe    taxonomy_subfamily taxonomy_family
taxonomy_superfamily taxonomy_infraorder taxonomy_suborder taxonomy_order
taxonomy_superorder  taxonomy_infraclass taxonomy_subclass taxonomy_class
taxonomy_superclass  taxonomy_subphylum  taxonomy_phylum   taxonomy_kingdom

Plus a summary column:

taxonomy_clades — non-missing/non-empty rank values joined by `taxonomy_separator`,
                  ordered from most general (kingdom) to most specific (subspecies).

Data source

Each row is resolved by looking up its accepted_name value in the hierarchy index built from the Scratch-managed PBDB taxa list snapshot (same file used by drop_unrecognized_taxa). The snapshot is downloaded on first use and refreshed automatically when older than 30 days. If accepted_name is missing or unrecognised, all new columns for that row are set to nodata.

Keyword arguments

  • nodata — value written for unknown/unresolvable ranks (default: missing)
  • fieldname_prefix — prefix applied to every new column name (default: "taxonomy_")
  • taxonomy_separator — string used to join rank values in the taxonomy column (default: " > ")

Examples

using PaleobiologyDB, PaleobiologyDB.Taxonomy

df = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", limit = 500)

df2 = augment_taxonomy(df)

# Filter for a specific subfamily
borophaginae = df2[
    .!ismissing.(df2.taxonomy_subfamily) .&& df2.taxonomy_subfamily .== "Borophaginae",
    :,
]

# Inspect a taxonomy string
df2.taxonomy_clades[1]
# → "Animalia > Chordata > Mammalia > Carnivora > Canidae > Borophaginae > Epicyon"

# Use a different fill value
df3 = augment_taxonomy(df; nodata = "")

See also PBDB_RANK_HIERARCHY.

source

Taxonomic rank hierarchy