Caching

PBDB queries make HTTP requests that can be slow or rate-limited. PaleobiologyDB.jl provides three complementary caching mechanisms via the re-exported DataCaches.jl package.

set_autocaching! — global automatic caching

set_autocaching! enables transparent caching on every API call without requiring @filecache wrappers.

using PaleobiologyDB

# Enable for ALL autocache-enabled functions
PaleobiologyDB.set_autocaching!(true)

occs = pbdb_occurrences(base_name = "Canidae", interval = "Miocene")  # live fetch + cached
occs = pbdb_occurrences(base_name = "Canidae", interval = "Miocene")  # instant cache hit

# Disable
PaleobiologyDB.set_autocaching!(false)

Per-function control:

# Cache only occurrence queries
PaleobiologyDB.set_autocaching!(true, pbdb_occurrences)

# Cache occurrences and taxa
PaleobiologyDB.set_autocaching!(true, [pbdb_occurrences, pbdb_taxa])

# Remove a function from the autocache list
PaleobiologyDB.set_autocaching!(false, pbdb_occurrences)

Custom cache store:

using DataCaches
my_cache = DataCache(joinpath(homedir(), "Downloads", "dat"))
set_autocaching!(true; cache = my_cache)

Using @filecache explicitly while autocache is on is safe — autocache is suppressed for that call so the result is written exactly once.

@memcache — in-memory session memoization

@memcache caches results in RAM for the duration of the current Julia session. No files are written; the cache is lost when Julia exits.

occs = @memcache pbdb_occurrences(base_name = "Canidae", show = "full")
taxa = @memcache pbdb_taxa(name = "Dinosauria")

PaleobiologyDB.memcache_clear!()   # discard all in-memory cached results

Autocache-enabled functions

All 29 PBDB API functions and the two PhyloPic enrichment functions support set_autocaching!. Pass the function itself as the second argument to target a specific function, or omit it to affect all of them at once.

PBDB API functions

Each function caches by its endpoint path plus the full set of keyword arguments.

CategoryFunctions
Occurrencespbdb_occurrence, pbdb_occurrences, pbdb_ref_occurrences
Collectionspbdb_collection, pbdb_collections, pbdb_collections_geo, pbdb_ref_collections
Taxapbdb_taxon, pbdb_taxa, pbdb_taxa_auto, pbdb_ref_taxa, pbdb_opinions_taxa
Intervalspbdb_interval, pbdb_intervals
Scalespbdb_scale, pbdb_scales
Stratapbdb_strata, pbdb_strata_auto
Referencespbdb_reference, pbdb_references
Specimenspbdb_specimen, pbdb_specimens, pbdb_ref_specimens, pbdb_measurements
Opinionspbdb_opinion, pbdb_opinions
Configpbdb_config

PhyloPic enrichment functions

acquire_phylopic and augment_phylopic (from PaleobiologyDB.Taxonomy) are also autocache-enabled. The cache operates at the per-taxon-name level rather than the whole-DataFrame level: each unique taxon name is cached independently, keyed on (taxon_name, phylopic_build).

This means two DataFrames that share taxa produce zero redundant network requests on the second call, regardless of how many rows they have or how the rows are ordered.

Important: both acquire_phylopic and augment_phylopic are controlled through the same acquire_phylopic function reference, because the per-taxon cache is wired inside the shared internal pipeline. To enable caching for either or both, pass acquire_phylopic:

using PaleobiologyDB, PaleobiologyDB.Taxonomy

# Enable per-taxon caching for all PhyloPic lookups
PaleobiologyDB.set_autocaching!(true, acquire_phylopic)

# String variant — fetches once, cached by (taxon_name, build)
rec1 = acquire_phylopic("Tyrannosaurus")
rec2 = acquire_phylopic("Tyrannosaurus")   # instant cache hit
@assert rec1 == rec2

# DataFrame variant — unique taxa shared across DataFrames
df1 = pbdb_occurrences(base_name = "Tyrannosauridae", limit = 50)
df2 = pbdb_occurrences(base_name = "Tyrannosauridae", limit = 100)

pics1 = acquire_phylopic(df1)   # fetches each unique taxon, caches results
pics2 = acquire_phylopic(df2)   # all taxa already cached — no new requests

# augment_phylopic benefits automatically (calls acquire_phylopic internally)
enriched = augment_phylopic(df1)   # all lookups are cache hits

PaleobiologyDB.set_autocaching!(false, acquire_phylopic)

Note: set_autocaching!(true, augment_phylopic) alone has no effect on the per-taxon cache, because the cache is keyed on acquire_phylopic. Always use acquire_phylopic as the function reference when targeting PhyloPic caching.

Autocaching performance benefits


julia> tt = Taxonomy.taxon_subtree("Ursidae"; leaf_rank = "species");

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = false); fig
  0.016077 seconds (136.06 k allocations: 14.557 MiB)

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = true); fig
 58.754672 seconds (8.65 M allocations: 607.436 MiB, 1.54% gc time, 1345 lock conflicts, 7.07% compilation time: 4% of which was recompilation)

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = true); fig
 45.769064 seconds (1.91 M allocations: 276.219 MiB, 0.09% gc time, 1344 lock conflicts)

julia> set_autocaching!(true)

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = true); fig
 57.396719 seconds (100.20 M allocations: 6.731 GiB, 4.57% gc time, 1180 lock conflicts, 5.82% compilation time)

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = true); fig
 11.860014 seconds (108.21 M allocations: 5.300 GiB, 6.64% gc time, 0.14% compilation time)

julia> fig, ax, plt = @time taxonomytreeplot(tt; 
           leaf_rank = "species", show_phylopic = true); fig
 13.554290 seconds (114.35 M allocations: 5.548 GiB, 16.37% gc time)

julia>