Taxonomy — Filtering
The Taxonomy submodule provides tools for validating and cleaning palaeobiological data against the PBDB taxonomic authority.
using PaleobiologyDB.TaxonomyCombined taxonomic quality filter
drop_unqualified_taxa is the top-level function for cleaning occurrence DataFrames. It applies two independent filters in sequence: a resolution check (is the identification specific enough?) and a name-validity check (is the name actually in the PBDB taxonomy?).
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# Keep rows resolved and recognized at genus level
df_genus = drop_unqualified_taxa(df, "genus")
# Keep rows resolved and recognized at species level
# ("species" maps to the accepted_name column for the name check)
df_species = drop_unqualified_taxa(df, "species")
# In-place variant (modifies df directly)
drop_unqualified_taxa!(df, "family")
# Use live API validation instead of the local snapshot
df_clean = drop_unqualified_taxa(df, "genus"; validation_authority = :query)PaleobiologyDB.Taxonomy.drop_unqualified_taxa — Function
drop_unqualified_taxa(df, taxonomic_resolution) -> DataFrameReturn a filtered copy of df keeping only rows that are both resolved and recognized at the requested taxonomic level.
This is a convenience wrapper that combines two independent filters in sequence:
Resolution filter (
drop_unresolved_taxa) — keeps rows whereaccepted_rankis at least as specific astaxonomic_resolution. For example,"genus"accepts rows whoseaccepted_rankis"genus","species", or"subspecies". Rows with a missingaccepted_rankare always dropped. Ifdfhas a column named after the rank (e.g.genus) that column must also be non-missing and non-empty.Name-validity filter (
drop_unrecognized_taxa) — keeps rows where the taxon name in the relevant column is found in the PBDB taxonomic authority. Names are checked against a locally cached snapshot of the full PBDB taxa list (downloaded on first use, refreshed every 30 days); passvalidation_authority = :queryto use live API calls instead.
Column mapping
Most ranks map taxonomic_resolution directly to both the rank string and the DataFrame column of the same name:
taxonomic_resolution | rank string | taxon column |
|---|---|---|
"genus" | "genus" | :genus |
"family" | "family" | :family |
"order" | "order" | :order |
| … | … | … |
The one exception is "species", where the PBDB stores the full binomial in the accepted_name column rather than a column called species:
taxonomic_resolution | rank string | taxon column |
|---|---|---|
"species" | "species" | :accepted_name |
Keyword arguments
validation_authority— passed through todrop_unrecognized_taxa.:snapshot(default) uses the local cache;:querycalls the live API.
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# Keep only rows resolved and recognized at genus level
df_genus = drop_unqualified_taxa(df, "genus")
# Keep only rows resolved and recognized at species level
# (checks accepted_rank ∈ {"species","subspecies"} AND accepted_name ∈ PBDB)
df_species = drop_unqualified_taxa(df, "species")
# Keep only rows resolved to family level, using live API validation
df_family = drop_unqualified_taxa(df, "family"; validation_authority = :query)See also drop_unqualified_taxa! for the in-place variant, drop_unresolved_taxa, drop_unrecognized_taxa.
PaleobiologyDB.Taxonomy.drop_unqualified_taxa! — Function
drop_unqualified_taxa!(df, taxonomic_resolution) -> DataFrameIn-place variant of drop_unqualified_taxa.
Applies both the resolution filter and the name-validity filter directly to df (no copy is made) and returns df.
The two filters applied in order are:
Resolution filter — rows whose
accepted_rankis coarser thantaxonomic_resolution, or that have a missingaccepted_rank, are removed. If a column named after the rank exists (e.g.genus), rows where that column is missing or empty are also removed.Name-validity filter — rows where the taxon name in the relevant column is not found in the PBDB taxonomic authority are removed. By default this uses a locally cached snapshot; pass
validation_authority = :queryto use live API calls.
Column mapping
"species" maps to the accepted_name column; all other string values map to the column of the same name (e.g. "genus" → :genus).
Keyword arguments
validation_authority—:snapshot(default) or:query.
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Felidae", interval = "Pleistocene", show = "full")
# Filter in-place to genus-level resolved and recognized rows
drop_unqualified_taxa!(df, "genus")
# Filter in-place to species-level (uses accepted_name column for name check)
drop_unqualified_taxa!(df, "species")See also drop_unqualified_taxa for the non-mutating variant.
Taxonomy resolution filter
These functions check that each row is identified to at least a given taxonomic rank, based on the accepted_rank column.
using PaleobiologyDB.Taxonomy
# Keep rows where accepted_rank is "genus", "species", or "subspecies"
df_resolved = drop_unresolved_taxa(df, "genus")
# Equivalent shorthand using a column symbol
df_resolved = drop_unresolved_taxa(df, :genus)
# :accepted_name maps to "species" resolution
df_resolved = drop_unresolved_taxa(df, :accepted_name)
# In-place variant
drop_unresolved_taxa!(df, "family")PaleobiologyDB.Taxonomy.drop_unresolved_taxa — Function
drop_unresolved_taxa(df, taxonomic_rank) -> DataFrameReturn a filtered copy of df containing only rows that meet the minimum taxonomic resolution specified by taxonomic_rank.
Two criteria are applied:
- The
accepted_rankcolumn must be attaxonomic_rankor finer (more specific). For example,"genus"accepts"genus","subgenus","species", and"subspecies";"family"additionally accepts"subfamily","tribe","subtribe". Rows with a missingaccepted_rankare dropped. - If
dfcontains a column whose name matchestaxonomic_rank(e.g. agenusorfamilycolumn), that column must also be non-missing and non-empty.
Examples
# Keep only rows identified to genus level or finer
df_clean = drop_unresolved_taxa(df, "genus")
# Keep only rows identified to family level or finer
df_clean = drop_unresolved_taxa(df, "family")
# Works for any rank in the Linnaean hierarchy
df_clean = drop_unresolved_taxa(df, "order")drop_unresolved_taxa(df, taxon_field::Symbol) -> DataFrameConvenience form that accepts a DataFrame column name instead of a rank string.
The taxonomic rank is technically a data value (:accepted_rank == "genus"), while taxon_field is the column that carries the identification result for that rank (:genus == "Tyrannosaurus"). Passing :genus here is therefore a shortcut for drop_unresolved_taxa(df, "genus"): keep rows resolved to the same taxonomic level as the data in the given taxon_field
The one special case is :accepted_name, which holds the full species binomial and so maps to "species" resolution.
Examples
# Same as: df_clean = drop_unresolved_taxa(df, "genus")
df_clean = drop_unresolved_taxa(df, :genus)
# Same as: df_clean = drop_unresolved_taxa(df, "family")
df_clean = drop_unresolved_taxa(df, :family)
# Same as: df_clean = drop_unresolved_taxa(df, "species")
df_clean = drop_unresolved_taxa(df, :accepted_name)
PaleobiologyDB.Taxonomy.drop_unresolved_taxa! — Function
drop_unresolved_taxa!(df, taxonomic_rank) -> DataFrameIn-place version of drop_unresolved_taxa. Modifies df directly and returns it.
Taxonomy name-validity filter
These functions check taxon names against the PBDB taxonomy using either a local Scratch-managed snapshot (default, O(1) lookups after the initial download) or live API queries.
using PaleobiologyDB.Taxonomy
# Single-name check
istaxon("Pliosauridae") # → true
istaxon("NO_FAMILY_SPECIFIED") # → false
# Audit a DataFrame column
mask = audit_taxonomy(df, :family)
df[mask, :]
# Filter to recognized taxa only (non-mutating)
df_clean = drop_unrecognized_taxa(df, :family)
# In-place variant
drop_unrecognized_taxa!(df, :family)PaleobiologyDB.Taxonomy.istaxon — Function
istaxon(taxon_name; validation_authority=:snapshot)Return true if taxon_name is a non-empty string recognised by the PBDB taxonomy.
Keyword arguments
validation_authority—:snapshot(default) or:query.:snapshot: looks up the name in a locally cached copy of the full PBDB taxa list (~200 MB, Scratch-managed). The snapshot is downloaded on first use and refreshed automatically when older than 30 days. After the initial load, lookups are O(1).:query: callspbdb_taxon(; name = taxon_name)directly. Results for valid names are cached by DataCaches. Slower for bulk use but always current.
Examples
istaxon("Pliosauridae") # → true
istaxon("NO_FAMILY_SPECIFIED") # → false
istaxon("Pliosauridae"; validation_authority = :query) # live API callPaleobiologyDB.Taxonomy.audit_taxonomy — Function
audit_taxonomy(df, taxon_field; validation_authority=:snapshot)Return a Vector{Bool} of length nrow(df) where true means the value in taxon_field for that row is a valid PBDB taxon name (non-missing, non-empty, and found in the database).
The result can be used directly with df[mask, :] or passed to drop_unrecognized_taxa.
Keyword arguments
validation_authority— passed toistaxon.
Example
mask = audit_taxonomy(df, :family)
df[mask, :]PaleobiologyDB.Taxonomy.drop_unrecognized_taxa — Function
drop_unrecognized_taxa(df, taxon_field; validation_authority=:snapshot)Return a filtered copy of df keeping only rows where taxon_field contains a PBDB-recognised taxon name (non-missing, non-empty, found in the database).
See audit_taxonomy for keyword argument semantics. See also drop_unrecognized_taxa! for the in-place variant.
Example
df_clean = drop_unrecognized_taxa(df, :family)PaleobiologyDB.Taxonomy.drop_unrecognized_taxa! — Function
drop_unrecognized_taxa!(df, taxon_field; validation_authority=:snapshot)In-place variant of drop_unrecognized_taxa. Removes rows from df where taxon_field is missing, empty, or not found in the PBDB taxonomy. Returns df.
Example
drop_unrecognized_taxa!(df, :family)Taxonomy augmentation
augment_taxonomy enriches an occurrences DataFrame with the full taxonomic hierarchy for each row, resolved from the Scratch-cached PBDB taxa list.
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", limit = 500)
# Add taxonomy_genus, taxonomy_family, …, taxonomy_kingdom, taxonomy_clades columns
df2 = augment_taxonomy(df)
# Filter for a specific subfamily
df2[.!ismissing.(df2.taxonomy_subfamily) .&& df2.taxonomy_subfamily .== "Borophaginae", :]
# Inspect a taxonomy string
df2.taxonomy_clades[1]
# → "Animalia > Chordata > Mammalia > Carnivora > Canidae > Borophaginae > Epicyon"PaleobiologyDB.Taxonomy.augment_taxonomy — Function
augment_taxonomy(df; nodata=missing, fieldname_prefix="taxonomy_", taxonomy_separator=" > ") -> DataFrameReturn a copy of df with one column per taxonomic rank in PBDB_RANK_HIERARCHY and a combined taxonomy string column, all resolved from the Scratch-cached PBDB taxa list snapshot.
New columns (using default prefix "taxonomy_")
One column per rank, from most specific to most general:
taxonomy_subspecies taxonomy_species taxonomy_subgenus taxonomy_genus
taxonomy_subtribe taxonomy_tribe taxonomy_subfamily taxonomy_family
taxonomy_superfamily taxonomy_infraorder taxonomy_suborder taxonomy_order
taxonomy_superorder taxonomy_infraclass taxonomy_subclass taxonomy_class
taxonomy_superclass taxonomy_subphylum taxonomy_phylum taxonomy_kingdomPlus a summary column:
taxonomy_clades — non-missing/non-empty rank values joined by `taxonomy_separator`,
ordered from most general (kingdom) to most specific (subspecies).Data source
Each row is resolved by looking up its accepted_name value in the hierarchy index built from the Scratch-managed PBDB taxa list snapshot (same file used by drop_unrecognized_taxa). The snapshot is downloaded on first use and refreshed automatically when older than 30 days. If accepted_name is missing or unrecognised, all new columns for that row are set to nodata.
Keyword arguments
nodata— value written for unknown/unresolvable ranks (default:missing)fieldname_prefix— prefix applied to every new column name (default:"taxonomy_")taxonomy_separator— string used to join rank values in the taxonomy column (default:" > ")
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", limit = 500)
df2 = augment_taxonomy(df)
# Filter for a specific subfamily
borophaginae = df2[
.!ismissing.(df2.taxonomy_subfamily) .&& df2.taxonomy_subfamily .== "Borophaginae",
:,
]
# Inspect a taxonomy string
df2.taxonomy_clades[1]
# → "Animalia > Chordata > Mammalia > Carnivora > Canidae > Borophaginae > Epicyon"
# Use a different fill value
df3 = augment_taxonomy(df; nodata = "")See also PBDB_RANK_HIERARCHY.
Taxonomic rank hierarchy
PaleobiologyDB.Taxonomy.PBDB_RANK_HIERARCHY — Constant
PBDB_RANK_HIERARCHYVector of PBDB accepted_rank values ordered from most specific to most general: "subspecies", "species", "subgenus", "genus", …, "kingdom".
Used internally to resolve "at least as specific as X" queries and to define the columns added by augment_taxonomy. Use taxonomic_ranks to obtain a mutable copy.