Taxonomy — Search
Taxon occurrence search: taxon_occursin
taxon_occursin searches for taxonomic patterns across multiple columns. It comes in two forms:
- 2-arg
taxon_occursin(pattern, df)→Vector{Bool}— searches across all taxonomy columns; use fordf[mask, :]filtering. - 1-arg
taxon_occursin(pattern)→ByRowpredicate — for use directly withsubset(df, :col => taxon_occursin(pattern)).
By placing the pattern first, this function works naturally with piping and functional composition.
Vector inputs (AbstractVector{<:AbstractString} or AbstractVector{<:Regex}) accept a combine keyword (all by default): combine=all requires all elements to match (AND); combine=any requires any to match (OR).
using PaleobiologyDB, PaleobiologyDB.Taxonomy: taxon_occursin
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# 2-arg: multi-column boolean mask
df[taxon_occursin("Canis", df), :]
df[taxon_occursin(r"^Canis\b", df), :]
# 2-arg: AND — every name must appear in some column (default combine=all)
df[taxon_occursin(["Canis", "Mammalia"], df), :]
# 2-arg: OR — any name matches any column
df[taxon_occursin(["Canis", "Vulpes"], df; combine=any), :]
# 2-arg: AND patterns — each regex must match at least one column
df[taxon_occursin([r"Canidae", r"Canis"], df), :]
# 1-arg: use directly with subset
df2 = augment_taxonomy(df)
subset(df2, :taxonomy_genus => taxon_occursin("Canis"))
subset(df2, :taxonomy_clades => taxon_occursin(r"Borophaginae"))
# 1-arg: regex AND on composite column (default combine=all)
# rows where taxonomy_clades contains BOTH patterns
subset(df2, :taxonomy_clades => taxon_occursin([r"Canidae", r"lupus"]))
# 1-arg: regex OR
subset(df2, :taxonomy_clades => taxon_occursin([r"^Canis\b", r"^Vulpes\b"]; combine=any))
# 1-arg: string OR
subset(df2, :taxonomy_genus => taxon_occursin(["Canis", "Vulpes"]; combine=any))
# Chain with subset (Chain.jl)
using Chain
@chain df begin
augment_taxonomy
subset(:taxonomy_family => taxon_occursin("Canidae"))
subset(:taxonomy_clades => taxon_occursin([r"Canis", r"lupus"]))
endPaleobiologyDB.Taxonomy.taxon_occursin — Function
taxon_occursin(name, df; autoaugment=true) -> Vector{Bool}
taxon_occursin(name) -> ByRow predicateSearch for patterns in taxonomic columns. This function searches across multiple taxonomic rank columns to find rows matching specified patterns in the Paleobiology Database taxonomy columns.
Pattern-First Argument Order
By placing the search pattern first, this function works naturally with piping and partial application, allowing easy composition of multiple pattern searches in data processing workflows.
Two forms:
- 2-arg
taxon_occursin(pattern, df)— returns aVector{Bool}of lengthnrow(df)where each element indicates whether the corresponding row matches the pattern across all relevant taxonomic columns indf. - 1-arg
taxon_occursin(pattern)— returns aByRow(predicate)function for use directly withsubset():subset(df, :column => taxon_occursin(pattern)).
Method signatures
# 2-arg: multi-column mask
taxon_occursin(name::Regex, df; autoaugment=true)
taxon_occursin(name::AbstractString, df; autoaugment=true)
taxon_occursin(names::AbstractVector{<:AbstractString}, df; autoaugment=true, combine=all)
taxon_occursin(names::AbstractVector{<:Regex}, df; autoaugment=true, combine=all)
# 1-arg: ByRow predicate for subset
taxon_occursin(name::Regex)
taxon_occursin(name::AbstractString)
taxon_occursin(names::AbstractVector{<:AbstractString}; combine=all)
taxon_occursin(names::AbstractVector{<:Regex}; combine=all)Matching semantics
Regex—occursin(name, value).AbstractString— exact equality (==), case-sensitive.AbstractVector{<:AbstractString}— controlled bycombine:combine=all(default) — AND: every name must appear in at least one column.combine=any— OR: any name matching any column is sufficient.
AbstractVector{<:Regex}— controlled bycombine:combine=all(default) — AND: every pattern must match at least one column.combine=any— OR: any pattern matching any column is sufficient.
Column selection (2-arg form)
- Augmented columns already present — if
dfhas anytaxonomy_<rank>ortaxonomy_cladescolumn (added byaugment_taxonomy), those are searched. - Auto-augmentation — if no augmented columns exist,
autoaugment=true(default), and:accepted_nameis present,augment_taxonomyis called on a copy ofdfand its columns are searched. - Fallback — any column whose name matches a rank in
PBDB_RANK_HIERARCHYplus:accepted_name, restricted to those present indf.
Note: :taxonomy_clades is a composite string ("Animalia > … > Canis"). Regex patterns match it; exact strings (e.g. "Canis") do not — use the per-rank column (e.g. taxonomy_genus) for exact matching.
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy: taxon_occursin
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# 2-arg: exact string across all taxonomy columns
df[taxon_occursin("Canis", df), :]
# 2-arg: regex
df[taxon_occursin(r"^Canis", df), :]
# 2-arg: AND — each name must appear in a separate column
df[taxon_occursin(["Canis", "Mammalia"], df), :]
# 2-arg: OR — any name matches any column
df[taxon_occursin(["Canis", "Vulpes"], df; combine=any), :]
# 2-arg: AND patterns — each regex must match at least one column
df[taxon_occursin([r"Canidae", r"Canis"], df), :]
# 1-arg: subset with exact string
subset(df, :taxonomy_genus => taxon_occursin("Canis"))
# 1-arg: subset with regex AND (default) on composite column
subset(df, :taxonomy_clades => taxon_occursin([r"Canidae", r"lupus"]))
# 1-arg: subset with regex OR
subset(df, :taxonomy_clades => taxon_occursin([r"^Canis", r"^Vulpes"]; combine=any))
# 1-arg: subset with string OR
subset(df, :taxonomy_genus => taxon_occursin(["Canis", "Vulpes"]; combine=any))
# Suppress auto-augmentation for a pre-augmented DataFrame
df2 = augment_taxonomy(df)
df2[taxon_occursin("Canidae", df2; autoaugment=false), :]See also augment_taxonomy, child_taxa, parent_taxa, registered_taxa.
taxon_occursin(name::AbstractString, df; autoaugment=true) -> Vector{Bool}Exact-string variant of taxon_occursin. Returns true for rows where any relevant taxonomic column equals name (case-sensitive).
taxon_occursin(names::AbstractVector{<:AbstractString}, df; autoaugment=true, combine=all) -> Vector{Bool}Multi-name variant of taxon_occursin.
combine=all(default) — every name innamesmust appear in at least one relevant column (AND semantics across columns).combine=any— any name matching any column is sufficient (OR/set-membership).
Note: combine=all is only meaningful for the multi-column (2-arg) form. In a single-column subset context a single value cannot equal two different strings, so combine=all is always impractical for length(names) > 1; use combine=any there.
taxon_occursin(names::AbstractVector{<:Regex}, df; autoaugment=true, combine=all) -> Vector{Bool}Multi-pattern variant of taxon_occursin.
combine=all(default) — every pattern innamesmust match at least one relevant column value (AND semantics). Useful for narrowing a search across multiple criteria, e.g.[r"Canidae", r"Canis"]finds rows resolved to genus within that family.combine=any— any pattern matching any column is sufficient (OR semantics).
taxon_occursin(name) -> ByRow predicateSingle-argument form of taxon_occursin for use with subset:
subset(df, :col => taxon_occursin(pattern))subset(df, :col => f) passes the whole column vector to f and expects Vector{Bool}. The returned ByRow(predicate) broadcasts a scalar predicate element-wise, satisfying that contract. Missing and empty values always return false.
Method signatures
taxon_occursin(name::Regex)
taxon_occursin(name::AbstractString)
taxon_occursin(names::AbstractVector{<:AbstractString}; combine=all)
taxon_occursin(names::AbstractVector{<:Regex}; combine=all)combine keyword (vector forms only)
combine=all(default) — AND: all names/patterns must match the column value. For strings, rarely practical whenlength(names) > 1(a single field value cannot equal two different strings). For regex, useful on composite columns such astaxonomy_clades(e.g.[r"Canidae", r"lupus"]narrows to species within that family).combine=any— OR: any name/pattern matching is sufficient.
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy
df = pbdb_occurrences(base_name = "Carnivora", interval = "Miocene", show = "full")
df2 = augment_taxonomy(df)
# Exact string on a single column
subset(df2, :taxonomy_genus => taxon_occursin("Canis"))
# Regex on a single column
subset(df2, :taxonomy_clades => taxon_occursin(r"Borophaginae"))
# Regex AND (default): taxonomy_clades must contain both patterns
subset(df2, :taxonomy_clades => taxon_occursin([r"Canidae", r"lupus"]))
# Regex OR: either pattern matches
subset(df2, :taxonomy_clades => taxon_occursin([r"^Canis", r"^Vulpes"]; combine=any))
# String OR (combine=any): genus is Canis or Vulpes
subset(df2, :taxonomy_genus => taxon_occursin(["Canis", "Vulpes"]; combine=any))
# @chain
using Chain
@chain df begin
augment_taxonomy
subset(:taxonomy_family => taxon_occursin("Canidae"))
subset(:taxonomy_clades => taxon_occursin([r"Canis", r"lupus"]))
endSee also augment_taxonomy, child_taxa, parent_taxa, registered_taxa.
taxon_occursin(name::AbstractString) -> ByRow predicateExact-string 1-arg form of taxon_occursin.
taxon_occursin(names::AbstractVector{<:AbstractString}; combine=all) -> ByRow predicateMulti-name 1-arg form of taxon_occursin. See that docstring for combine semantics.
taxon_occursin(names::AbstractVector{<:Regex}; combine=all) -> ByRow predicateMulti-pattern 1-arg form of taxon_occursin. See that docstring for combine semantics.
Taxon occurrence search: contains_taxon
contains_taxon provides an alternative syntax to taxon_occursin with the DataFrame as the first argument. It comes in the same two forms:
- 2-arg
contains_taxon(df, pattern)→Vector{Bool}— searches across all taxonomy columns; use fordf[mask, :]filtering. - 1-arg
contains_taxon(pattern)→ByRowpredicate — for use directly withsubset(df, :col => contains_taxon(pattern)).
By placing the DataFrame first, this function is more natural for statement chaining and method calls where data flows from left to right.
All matching semantics, column selection, and keywords are identical to taxon_occursin.
using PaleobiologyDB, PaleobiologyDB.Taxonomy: contains_taxon
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# 2-arg: multi-column boolean mask (DataFrame first)
df[contains_taxon(df, "Canis"), :]
df[contains_taxon(df, r"^Canis\b"), :]
# 2-arg: AND — every name must appear in some column (default combine=all)
df[contains_taxon(df, ["Canis", "Mammalia"]), :]
# 2-arg: OR — any name matches any column
df[contains_taxon(df, ["Canis", "Vulpes"]; combine=any), :]
# 2-arg: AND patterns — each regex must match at least one column
df[contains_taxon(df, [r"Canidae", r"Canis"]), :]
# 1-arg: use directly with subset
df2 = augment_taxonomy(df)
subset(df2, :taxonomy_genus => contains_taxon("Canis"))
subset(df2, :taxonomy_clades => contains_taxon(r"Borophaginae"))
# 1-arg: regex AND on composite column (default combine=all)
# rows where taxonomy_clades contains BOTH patterns
subset(df2, :taxonomy_clades => contains_taxon([r"Canidae", r"lupus"]))
# 1-arg: regex OR
subset(df2, :taxonomy_clades => contains_taxon([r"^Canis\b", r"^Vulpes\b"]; combine=any))
# 1-arg: string OR
subset(df2, :taxonomy_genus => contains_taxon(["Canis", "Vulpes"]; combine=any))
# Chain with subset (Chain.jl)
using Chain
@chain df begin
augment_taxonomy
subset(:taxonomy_family => contains_taxon("Canidae"))
subset(:taxonomy_clades => contains_taxon([r"Canidae", r"lupus"]))
endPaleobiologyDB.Taxonomy.contains_taxon — Function
contains_taxon(df, name; autoaugment=true) -> Vector{Bool}
contains_taxon(name) -> ByRow predicateSearch for patterns in taxonomic columns with DataFrame as the first argument. This provides an alternative to taxon_occursin with haystack-first argument order: the DataFrame being searched comes first, and the pattern (needle) to search for comes second.
DataFrame-First Argument Order
Unlike taxon_occursin which places the search pattern first, contains_taxon puts the DataFrame first, making it more natural for method chaining and piping operations where data transformations flow from left to right.
Two forms:
- 2-arg
contains_taxon(df, pattern)— returns aVector{Bool}of lengthnrow(df)where each element indicates whether the corresponding row matches the pattern across all relevant taxonomic columns indf. - 1-arg
contains_taxon(pattern)— returns aByRow(predicate)function for use directly withsubset():subset(df, :column => contains_taxon(pattern)).
Method signatures
# 2-arg: multi-column mask (DataFrame first)
contains_taxon(df, name::Regex; autoaugment=true)
contains_taxon(df, name::AbstractString; autoaugment=true)
contains_taxon(df, names::AbstractVector{<:AbstractString}; autoaugment=true, combine=all)
contains_taxon(df, names::AbstractVector{<:Regex}; autoaugment=true, combine=all)
# 1-arg: ByRow predicate for subset
contains_taxon(name::Regex)
contains_taxon(name::AbstractString)
contains_taxon(names::AbstractVector{<:AbstractString}; combine=all)
contains_taxon(names::AbstractVector{<:Regex}; combine=all)Comparison with taxon_occursin
All functionality is identical; only the argument order differs:
| Task | taxon_occursin | contains_taxon |
|---|---|---|
| 2-arg exact string | taxon_occursin("Canis", df) | contains_taxon(df, "Canis") |
| 2-arg regex | taxon_occursin(r"^Canis", df) | contains_taxon(df, r"^Canis") |
| 1-arg in subset | subset(df, :col => taxon_occursin("Canis")) | subset(df, :col => contains_taxon("Canis")) |
contains_taxon(df, pattern) is semantically equivalent to taxon_occursin(pattern, df).
Examples
using PaleobiologyDB, PaleobiologyDB.Taxonomy: contains_taxon
df = pbdb_occurrences(base_name = "Canidae", interval = "Miocene", show = "full")
# 2-arg: exact string across all taxonomy columns (DataFrame first)
df[contains_taxon(df, "Canis"), :]
# 2-arg: regex
df[contains_taxon(df, r"^Canis\\b"), :]
# 2-arg: AND — each name must appear in a separate column
df[contains_taxon(df, ["Canis", "Mammalia"]), :]
# 2-arg: OR — any name matches any column
df[contains_taxon(df, ["Canis", "Vulpes"]; combine=any), :]
# 1-arg: subset with exact string
subset(df, :taxonomy_genus => contains_taxon("Canis"))
# 1-arg: subset with regex AND on composite column
subset(df, :taxonomy_clades => contains_taxon([r"Canidae", r"lupus"]))
# 1-arg: subset with regex OR
subset(df, :taxonomy_clades => contains_taxon([r"^Canis\b", r"^Vulpes\b"]; combine=any))See also taxon_occursin, augment_taxonomy, child_taxa, parent_taxa, registered_taxa.
Choosing between taxonoccursin and containstaxon
Both taxon_occursin and contains_taxon are functionally identical and support all the same patterns, keywords, and use cases. The choice is purely stylistic:
| Preference | Function | Usage |
|---|---|---|
| Pattern-first (functional style) | taxon_occursin | df[taxon_occursin("Canis", df), :] |
| DataFrame-first (method chaining style) | contains_taxon | df[contains_taxon(df, "Canis"), :] |
Use whichever feels more natural for your workflow. Both are equally idiomatic and supported.