A Complete Worked Example: Lionepha¶

(Software Prerequisites)¶

For this exercise, in addition to DELINEATE you will need the following programs installed on your execution host:

BP&P

StarBeast2

DendroPy (which provides SumTrees)

For meta-analysis and visualization, you should have the following installed on your local machine:

Tracer

FigTree

In addition, of course (as always, for anything and everything), you will want a *good* text editor.

(Data Files)¶

All data files for this exercise are available for download here:

lionepha.zip

Introduction¶

We are going to to carry out a full species delimitation analysis, or, actually, a series of species delimitation analyses. We are going to use the the Lionepha system, using a subset of the data published by David R. Maddison and John S. Sproul:

David R Maddison, John S Sproul. Species delimitation, classical taxonomy and genome skimming: a review of the ground beetle genus Lionepha (Coleoptera: Carabidae), Zoological Journal of the Linnean Society, , zlz167, https://doi.org/10.1093/zoolinnean/zlz167.

The original data is available in the following Dryad repository:

https://datadryad.org/stash/dataset/doi:10.5061/dryad.2jm63xsjq

The work by Maddison and Sproul (2020) nicely lays out the basic alpha taxonomy of this system, organizing these 143 lineages into 12 nominal species. This alpha taxonomy is based on a thorough analyses integrating morphological as well as genomic information, and thus we already have a very clear understanding of the species boundaries in this group, precluding the need for a species delimitation analysis.

However, for the purposes of this exercise we are going to pretend that this is not the case. Instead, we are going to consider that while the species identities of some of the population lineages are well known, in this sample there are number of lineages for which this is not the case.

Background¶

Study Design¶

We can imagine our study sampled individuals from an extensive range of localities, supplementing it with samples from previous collections, to develop a comprehensive data set spanning a broad taxonomic as well as geographic range. Preliminary phylogenetic analysis, which included genomic data from known, stable, species, as well as comparative morphological analyses, has confidentally established the species identities of some of these lineages. However, a number of lineages in the sample, while coming out to sister to known species, are sufficiently different in some respects (e.g., genetic distance or morphology), to make us question whether or not we are looking at different populations of the same species or perhaps new species altogether.

Alternatively, we might imagine a study to delineate species units in a particularly problematic clade. Here, in addition to samples from multiple populations from the clade, we would also include populations from a broader range of species in which the focal clade is a nested subtree. The analytical workflow for this sort of study will be identical to that discussed here, with only the distribution of population lineages of unknown identity being different — clustered into a single subtree on the larger phylogeny as opposed to scattered into multiple smaller subtrees (or single lineages).

Data¶

Here we will focus on just a subset of the data from Maddison and Sproul (2020), focusing only on lineages in the genus Lionepha itself. This data consists of 8 gene alignments for 143 samples from multiple individuals from multiple populations from multiple species:

lionepha/00-alignments/18s.nex
lionepha/00-alignments/28s.nex
lionepha/00-alignments/argk.nex
lionepha/00-alignments/cad.nex
lionepha/00-alignments/coi.nex
lionepha/00-alignments/msp.nex
lionepha/00-alignments/topo.nex
lionepha/00-alignments/wg.nex

Stage I: Identification of Population Units using the Multipopulation Coalescent Model¶

The first stage of our analysis will be delimitation of population units, i.e. assignment of population identities to individual lineages, under the “multipopulation coalescent”. This model was first published under the name “censored coalescent”, but has since become more popularly known as the “Multispecies Coalescent” or “MSC” model. It is a very powerful and elegant model that allows us to identify structure in genomic data arising from disruptions in Wright-Fisher panmixia due to restrictions in gene flow.

We will be using BP&P to identify our population units, and StarBeast2 to obtain a guide tree for the BP&P analysis. Note that these programs use the term “species” to reference the higher-level units they analyze rather than populations (as in “species delimitation” or “species tree inference”, rather than “population delimitation” and “population tree inference”). This can be correct in some contexts: e.g., where the data sampling does, either coincidentally or by design, indeed represent a 1-to-1 correspondence between populations and species, or, equivalently the species being studied do indeed not have any within-species lineage structure. However, more generally, this is, at best, misleading. These programs use the MSC as their underlying model. From a statistical perspective, it is uncontroversial that the biological units that these programs model through the “censored” coalescent/MSC that model that underlies them are (Wright-Fisher) populations. Strictly speaking, when used for species delimitation or model species trees, programs such as BP&P and StarBeast2 are used to identify population units under the MSC, and then these population units are interpreted (with or without justification) by the investigator to correspond on a 1-to-1 basis to species units. Thus, when using the MSC for “species delimitation” or “species tree inference”, the difference between “population” and “species” is purely lexical rather than statistical.

Here, however, we are going to interpret higher-level units of organization as exactly what they are as modeled by the MSC — populations, no more and no less. Unfortunately, this may result in some confusion as both BP&P and StarBeast2 refer to the higher-level units they target as “species” at least some of the time in the various program options and documentation. (In fact, BP&P, in some of the program documentation as well as n some of the various papers presenting or referencing the theory behind it acutally use the term “species” and “populations” interchangeably). This is simply the cost of doing business.

Candidate Population Units¶

A BP&P analysis requires us to identify “population” lineages as input a priori, some of which it will then collapse together to form “species” lineages. As we have noted (and as we will note again!), this terminological choice is not generally correct (it may apply in some special cases, such as single-population species systems, single-population-sample-per-species datasets, or where the data are too weak to detect population structure). We will instead consider these to be “candidate population” and “actual population” lineages respectively. That is, we will provide BP&P with the finest-grain units that could possibly be distinct populations as candidate population lineages, then use the power of the MSC model to accurately merge together our candidate populations into distinct populations (“species”, in BP&P terminology). For this analysis, we will err on the side of caution, not hestitating our over-split our candidate populations, as we can rely on the MSC to collapse them if there is insufficient gene flow restriction between them to form population boundaries. As such, we will consider every distinct geographical sample to be a distinct candidate population.

Individual	Candidate Population Assignment
L_australerasa_CA_Carson_Spur_3839 L_australerasa_CA_Carson_Spur_3840 L_australerasa_CA_Carson_Spur_3841	L_australerasa_CA_Carson_Spur
L_australerasa_CA_Homewood_Canyon_5214	L_australerasa_CA_Homewood_Canyon
L_australerasa_CA_Martin_Meadow_3838	L_australerasa_CA_Martin_Meadow
L_australerasa_CA_Mill_Creek_5212 L_australerasa_CA_Mill_Creek_5213	L_australerasa_CA_Mill_Creek
L_australerasa_CA_Nanny_Creek_3864 L_australerasa_CA_Nanny_Creek_3896	L_australerasa_CA_Nanny_Creek
L_australerasa_CA_Oyster_Lake_3844 L_australerasa_CA_Oyster_Lake_3845	L_australerasa_CA_Oyster_Lake
L_australerasa_OR_Crater_Lake_4984 L_australerasa_OR_Crater_Lake_4986	L_australerasa_OR_Crater_Lake
L_casta_AK_Ketchikan_4894	L_casta_AK_Ketchikan
L_casta_AK_Prince_of_Wales_Island_4523	L_casta_AK_Prince_of_Wales_Island
L_casta_CA_Mt_Tamalpais_3830	L_casta_CA_Mt_Tamalpais
L_casta_CA_Soda_Creek_4049	L_casta_CA_Soda_Creek
L_casta_CA_West_Branch_Mill_Creek_3703	L_casta_CA_West_Branch_Mill_Creek
L_casta_OR_Lost_Prairie_5204	L_casta_OR_Lost_Prairie
L_casta_OR_Marys_Peak_2545	L_casta_OR_Marys_Peak
L_casta_OR_School_Creek_3041	L_casta_OR_School_Creek
L_casta_WA_Taneum_Creek_1400	L_casta_WA_Taneum_Creek
L_disjuncta_BC_Summit_Creek_1896 L_disjuncta_BC_Summit_Creek_3090	L_disjuncta_BC_Summit_Creek
L_disjuncta_CA_Emerson_Creek_4122	L_disjuncta_CA_Emerson_Creek
L_disjuncta_CA_Lily_Lake_3069	L_disjuncta_CA_Lily_Lake
L_disjuncta_CA_Salmon_Creek_4133	L_disjuncta_CA_Salmon_Creek
L_disjuncta_CA_Trinity_Alps_4115	L_disjuncta_CA_Trinity_Alps
L_disjuncta_ID_Salmon_River_4780	L_disjuncta_ID_Salmon_River
L_disjuncta_MT_Mill_Creek_4716	L_disjuncta_MT_Mill_Creek
L_disjuncta_OR_Lostine_River_3848	L_disjuncta_OR_Lostine_River
L_disjuncta_OR_Mt_Hood_4143	L_disjuncta_OR_Mt_Hood
L_erasa_AK_Thompson_Pass_4059	L_erasa_AK_Thompson_Pass
L_erasa_BC_Cherryville_4002	L_erasa_BC_Cherryville
L_erasa_OR_Lost_Prairie_5197 L_erasa_OR_Lost_Prairie_5199 L_erasa_OR_Lost_Prairie_5200 L_erasa_OR_Lost_Prairie_5201	L_erasa_OR_Lost_Prairie
L_erasa_OR_Marys_Peak_2575 L_erasa_OR_Marys_Peak_2586 L_erasa_OR_Marys_Peak_2615 L_erasa_OR_Marys_Peak_2616	L_erasa_OR_Marys_Peak
L_erasa_OR_Mount_Hebo_3013 L_erasa_OR_Mount_Hebo_3016	L_erasa_OR_Mount_Hebo
L_erasa_OR_Mt_Hood_4144	L_erasa_OR_Mt_Hood
L_erasa_OR_Prairie_Peak_2580	L_erasa_OR_Prairie_Peak
L_kavanaughi_MT_Bitterroot_River_4646	L_kavanaughi_MT_Bitterroot_River
L_kavanaughi_MT_Lost_Horse_Creek_4648	L_kavanaughi_MT_Lost_Horse_Creek
L_kavanaughi_OR_Little_Philips_Creek_4998	L_kavanaughi_OR_Little_Philips_Creek
L_kavanaughi_OR_Lostine_River_4996	L_kavanaughi_OR_Lostine_River
L_kavanaughi_OR_Lostine_River_Valley_4990 L_kavanaughi_OR_Lostine_River_Valley_4992 L_kavanaughi_OR_Lostine_River_Valley_4993 L_kavanaughi_OR_Lostine_River_Valley_5000 L_kavanaughi_OR_Lostine_River_Valley_5002 L_kavanaughi_OR_Lostine_River_Valley_5006 L_kavanaughi_OR_Lostine_River_Valley_5008 L_kavanaughi_OR_Lostine_River_Valley_5010	L_kavanaughi_OR_Lostine_River_Valley
L_lindrothi_CA_Deadman_Creek_4140	L_lindrothi_CA_Deadman_Creek
L_lindrothi_CA_East_Fork_Kaweah_River_4120	L_lindrothi_CA_East_Fork_Kaweah_River
L_lindrothi_CA_Emerald_Lake_4116 L_lindrothi_CA_Emerald_Lake_4117 L_lindrothi_CA_Emerald_Lake_4118	L_lindrothi_CA_Emerald_Lake
L_lindrothi_CA_Kaiser_Pass_4121	L_lindrothi_CA_Kaiser_Pass
L_lindrothi_CA_Long_Valley_Creek_5072	L_lindrothi_CA_Long_Valley_Creek
L_lindrothi_CA_Sonora_Pass_4134	L_lindrothi_CA_Sonora_Pass
L_lindrothi_CA_South_Fork_Bishop_Creek_3568	L_lindrothi_CA_South_Fork_Bishop_Creek
L_lindrothi_CA_Tioga_Lake_4132	L_lindrothi_CA_Tioga_Lake
L_osculans_CA_Carson_Spur_3164	L_osculans_CA_Carson_Spur
L_osculans_CA_Cold_Creek_1387 L_osculans_CA_Cold_Creek_1390	L_osculans_CA_Cold_Creek
L_osculans_CA_Los_Padres_NF_3162	L_osculans_CA_Los_Padres_NF
L_osculans_CA_Nanny_Creek_3721	L_osculans_CA_Nanny_Creek
L_osculans_CA_Pike_County_Gulch_3846	L_osculans_CA_Pike_County_Gulch
L_osculans_CA_Rainbow_1401	L_osculans_CA_Rainbow
L_osculans_CA_Stanislaus_NF_3157	L_osculans_CA_Stanislaus_NF
L_osculans_CA_Strawberry_Creek_3163	L_osculans_CA_Strawberry_Creek
L_osculans_CA_Warner_Range_3161	L_osculans_CA_Warner_Range
L_osculans_OR_Eugene_4593	L_osculans_OR_Eugene
L_osculans_OR_Goodman_Creek_3158	L_osculans_OR_Goodman_Creek
L_osculans_OR_Little_Philips_Creek_5001	L_osculans_OR_Little_Philips_Creek
L_osculans_OR_Middle_Fork_Berry_Creek_3095	L_osculans_OR_Middle_Fork_Berry_Creek
L_osculans_OR_School_Creek_2638	L_osculans_OR_School_Creek
L_osculans_OR_Walton_Lake_4743	L_osculans_OR_Walton_Lake
L_probata_BC_Summit_Creek_3720	L_probata_BC_Summit_Creek
L_probata_CA_Algoma_Camp_3855	L_probata_CA_Algoma_Camp
L_probata_CA_Ellery_Lake_4138	L_probata_CA_Ellery_Lake
L_probata_CA_Middle_Martis_Creek_1161 L_probata_CA_Middle_Martis_Creek_1970	L_probata_CA_Middle_Martis_Creek
L_probata_CA_Nanny_Creek_3895	L_probata_CA_Nanny_Creek
L_probata_CA_Sherman_Pass_3730	L_probata_CA_Sherman_Pass
L_probata_CA_South_Fork_Bishop_Creek_3686	L_probata_CA_South_Fork_Bishop_Creek
L_probata_CA_Squaw_Valley_Resort_5211	L_probata_CA_Squaw_Valley_Resort
L_probata_CA_Strawberry_Creek_3832	L_probata_CA_Strawberry_Creek
L_probata_CA_Tamarack_Lake_4137	L_probata_CA_Tamarack_Lake
L_probata_CA_Warner_Range_3863	L_probata_CA_Warner_Range
L_probata_CA_White_Mountains_3833	L_probata_CA_White_Mountains
L_probata_ID_Baker_Creek_3865	L_probata_ID_Baker_Creek
L_probata_ID_Galena_Summit_3722	L_probata_ID_Galena_Summit
L_probata_ID_Park_Creek_3866	L_probata_ID_Park_Creek
L_probata_MT_Mill_Creek_4713	L_probata_MT_Mill_Creek
L_probata_MT_Prospect_Creek_4645	L_probata_MT_Prospect_Creek
L_probata_NV_Ruby_Mountains_3684	L_probata_NV_Ruby_Mountains
L_probata_OR_Little_Philips_Creek_4995	L_probata_OR_Little_Philips_Creek
L_probata_OR_Lonesome_Spring_4744	L_probata_OR_Lonesome_Spring
L_probata_OR_Lost_Prairie_3723	L_probata_OR_Lost_Prairie
L_probata_OR_Lostine_River_Valley_4991	L_probata_OR_Lostine_River_Valley
L_probata_OR_Lostine_River_Valley_5004	L_probata_OR_Lostine_River_Valley
L_probata_OR_Mt_Ashland_3165	L_probata_OR_Mt_Ashland
L_probata_OR_Odell_Creek_3867	L_probata_OR_Odell_Creek
L_probata_OR_Steens_Mountains_2724 L_probata_OR_Steens_Mountains_3717	L_probata_OR_Steens_Mountains
L_probata_UT_Shingle_Creek_4198	L_probata_UT_Shingle_Creek
L_probata_UT_Stansbury_Mtns_3601 L_probata_UT_Stansbury_Mtns_3685	L_probata_UT_Stansbury_Mtns
L_probata_UT_Tushar_Mountains_5037	L_probata_UT_Tushar_Mountains
L_probata_WA_Blue_Mountains_3854	L_probata_WA_Blue_Mountains
L_probata_WA_Taneum_Creek_1320	L_probata_WA_Taneum_Creek
L_pseudoerasa_CA_Kaiser_Pass_4139	L_pseudoerasa_CA_Kaiser_Pass
L_pseudoerasa_CA_Lily_Lake_3073	L_pseudoerasa_CA_Lily_Lake
L_pseudoerasa_CA_Sherman_Pass_3599 L_pseudoerasa_CA_Sherman_Pass_3688	L_pseudoerasa_CA_Sherman_Pass
L_pseudoerasa_CA_Strawberry_Creek_3072 L_pseudoerasa_CA_Strawberry_Creek_3083 L_pseudoerasa_CA_Strawberry_Creek_3086 L_pseudoerasa_CA_Strawberry_Creek_3087	L_pseudoerasa_CA_Strawberry_Creek
L_pseudoerasa_CA_Trinity_Alps_4114	L_pseudoerasa_CA_Trinity_Alps
L_sequoiae_CA_Bridal_Veil_Falls_3078	L_sequoiae_CA_Bridal_Veil_Falls
L_sequoiae_CA_Nanny_Creek_3702	L_sequoiae_CA_Nanny_Creek
L_sequoiae_CA_Strawberry_Creek_3075 L_sequoiae_CA_Strawberry_Creek_3085	L_sequoiae_CA_Strawberry_Creek
L_sequoiae_OR_Oakridge_2647	L_sequoiae_OR_Oakridge
L_sequoiae_OR_School_Creek_2614	L_sequoiae_OR_School_Creek
L_tuulukwa_CA_Trinity_Alps_4113	L_tuulukwa_CA_Trinity_Alps
L_tuulukwa_OR_Knowles_Creek_3700 L_tuulukwa_OR_Knowles_Creek_3701	L_tuulukwa_OR_Knowles_Creek
L_tuulukwa_OR_Marys_Peak_2581 L_tuulukwa_OR_Marys_Peak_2635 L_tuulukwa_OR_Marys_Peak_2636 L_tuulukwa_OR_Marys_Peak_2637 L_tuulukwa_OR_Marys_Peak_2642 L_tuulukwa_OR_Marys_Peak_2643 L_tuulukwa_OR_Marys_Peak_3782	L_tuulukwa_OR_Marys_Peak

Generating a Guide Tree for Population Delimitation¶

We will provide BP&P with a guide tree for its population delimitation analysis. We will use StarBeast2 to generate this guide tree.

The full StarBeast2 configuration file, generated using BEAUTi, can be found at in lionepha/01-guidetree-estimation/sb00500M.xml. The alignments listed above were imported, and the following “traits” file was used to map alignment sequences to canidate population units: lionepha/01-guidetree-estimation/traits.txt.

We used a single strict clock model across all genes, and a HKY+G model of substitution.

We ran four replicates of this analysis for 500 million generations each, sampling every 500,000 generations for a total of 1000 samples from each replicates. The first 250 samples were discarded from the 100 samples of each replicates. Convergence was diagnosed through inspection of traces for each parameter as well as the likelihood and posterior. ESS values for each parameter were established to be more than 250, and distributions of parameter values were compared to a “null” run (i.e., a run without data where just the prior was sampled) to confirm that the analysis learned from the data sufficiently to shift the posterior away from the prior.

The post-burn in samples from the posterior were summarized using SumTrees, with the following command:

$ sumtrees.py -b 250 \
            -s mcct \
            -e clear \
            -l clear \
            --force-rooted \
            --suppress-annotations \
            -r -o summary.guide.tre \
            run1/species.trees \
            run2/species.trees \
            run3/species.trees \
            run4/species.trees

This selects the Maximum Clade Credibility Tree (MCCT) tree for the summary topolgy, stripping all branch lengths and metadata annotations, to result in the following:

lionepha/01-guidetree-estimation/guidetree.nex

Delimitation of Population Units¶

Now that we have a guide tree that treats each distinct geographical lineage as a candidate distinct Wright-Fisher population, we will run BP&P in “A10” mode to delimit the true population units under the “multipopulation coalescent” (i.e., the MSC).

For Small Datasets: the Single-Analysis Approach¶

The files provided in the lionepha/02a-population-delimitation-pooled directory set up a fairly straightforward BP&P “A10” analyses using the data that we have collected and the guide tree we have estimated:

Pooled BP&P Population Delimitation Analysis¶
File	Description
bpprun.input.ctl	Control file
bpprun.input.chars.txt	Character data
bpprun.input.imap.txt	Sequence to population map
bpp00.job	Execution job

The problem is that, when executing the analysis by either running the job file:

$ bash bpp00.job

or submitting it to an execution host:

$ qsub bpp00.job

or simply running the command directly:

$ bpp --cfile bpprun.input.ctl

we might find that the data set is too large to analyze:

bpp v4.1.4_linux_x86_64, 31GB RAM, 12 cores
https://github.com/bpp
.
.
.
Total species delimitations: 23958541050464777
Unable to allocate enough memory.

For Larger Datasets: the Subtree Decomposition Approach¶

The solution is to break the data set up into subtrees and carry out a separate population delimitation analysis on each subtree. While designing a decomposition scheme, it is important to note that we should not separate out candidate population lineages that might potentially belong to the same (actual) population into separate subtrees, thereby preventing BP&P from being able to collapse them if it does not detect any gene flow restriction between them. In most studies, this should not be too difficult to identify. Even in cases where we might not know where the species boundaries are, with reference to the guide tree phylogeny we should be able to identify subtree clusters that do not disrupt population boundaries. In this case of this Lionepha study, the nominal species clades provide very nice granularity — small enough to be analyzed by BP&P, yet with no danger of breaking up an actual population.

Guide tree for Lionepha population delimitation, showing subtrees used when decomposing into separate analyses.

Figure: We cannot delimit populations for the entire data set simulatneously in BP&P due to the number of lineages. Instead, we decompose the analysis into a set of smaller analyses based on subtrees.

The set up for this set of analyses can be found at:

lionepha/02b-population-delimitation-subtrees

with each subdirectory containing a stand-alone analysis.

Collating Results of the Subtree Approach¶

Each of the subtree analysis now has the populations delimited under the MSC model. Having identified these population units across various subtrees, we now need to collate and pool them. DELINEATE helpfully provides a script for you to do this fairly robustly: delineate-bppsum. This script takes as its input two sets of files:

the “imap” files you provided to BP&P as input, which maps sequences to candidate population lineages
the output log of BP&P analyses (not the MCMC log), which provides a tree at the end with posterior probability of internal nodes indicated by labels

These files can be specified in any order, but must collectively span the entire analysis: the set of all candidate population names defined across all “imap” files must be equal to the set of all candidate population names found across all trees in all BP&P output log files.

In this example, we have all the independent subtree analyses packed away in subdirectories, “01”, “02”, etc. Assuming we are in the BP&P analysis subdirectory, lionepha/02b-population-delimitation-subtrees, we could just type in all the paths:

delineate-bppsum \
    --imap    01/bpprun.input.imap.txt \
              02/bpprun.input.imap.txt \
              .
              .
              (etc.)
    --results 01/results.out.txt \
              02/results.out.txt \
              .
              .
              (etc.)

but because of judicious naming of the files, we can use some basic shell commands to generate the list of input files:

delineate-bppsum \
    --imap $(find . -name "*imap*") \
    --results $(find . -name "*out.txt")

Executing the above command results in:

[delineate-bppsum] 11 BPP 'imap' files specified
[delineate-bppsum] - Reading mapping file   1 of 11: ./00/bpprun.input.imap.txt
[delineate-bppsum]   - (13 lineages, 7 candidate populations)
[delineate-bppsum] - Reading mapping file   2 of 11: ./01/bpprun.input.imap.txt
[delineate-bppsum]   - (9 lineages, 9 candidate populations)
[delineate-bppsum] - Reading mapping file   3 of 11: ./02/bpprun.input.imap.txt
[delineate-bppsum]   - (10 lineages, 9 candidate populations)
[delineate-bppsum] - Reading mapping file   4 of 11: ./03/bpprun.input.imap.txt
[delineate-bppsum]   - (14 lineages, 7 candidate populations)
[delineate-bppsum] - Reading mapping file   5 of 11: ./04/bpprun.input.imap.txt
[delineate-bppsum]   - (12 lineages, 5 candidate populations)
[delineate-bppsum] - Reading mapping file   6 of 11: ./05/bpprun.input.imap.txt
[delineate-bppsum]   - (10 lineages, 8 candidate populations)
[delineate-bppsum] - Reading mapping file   7 of 11: ./06/bpprun.input.imap.txt
[delineate-bppsum]   - (16 lineages, 15 candidate populations)
[delineate-bppsum] - Reading mapping file   8 of 11: ./07/bpprun.input.imap.txt
[delineate-bppsum]   - (34 lineages, 30 candidate populations)
[delineate-bppsum] - Reading mapping file   9 of 11: ./08/bpprun.input.imap.txt
[delineate-bppsum]   - (9 lineages, 5 candidate populations)
[delineate-bppsum] - Reading mapping file  10 of 11: ./09/bpprun.input.imap.txt
[delineate-bppsum]   - (6 lineages, 5 candidate populations)
[delineate-bppsum] - Reading mapping file  11 of 11: ./10/bpprun.input.imap.txt
[delineate-bppsum]   - (10 lineages, 3 candidate populations)
[delineate-bppsum] 11 BPP output files specified
[delineate-bppsum] - Reading output file   1 of 11: ./00/results.out.txt
[delineate-bppsum]   - (7 candidate populations)
[delineate-bppsum] - Reading output file   2 of 11: ./01/results.out.txt
[delineate-bppsum]   - (9 candidate populations)
[delineate-bppsum] - Reading output file   3 of 11: ./02/results.out.txt
[delineate-bppsum]   - (9 candidate populations)
[delineate-bppsum] - Reading output file   4 of 11: ./03/results.out.txt
[delineate-bppsum]   - (7 candidate populations)
[delineate-bppsum] - Reading output file   5 of 11: ./04/results.out.txt
[delineate-bppsum]   - (5 candidate populations)
[delineate-bppsum] - Reading output file   6 of 11: ./05/results.out.txt
[delineate-bppsum]   - (8 candidate populations)
[delineate-bppsum] - Reading output file   7 of 11: ./06/results.out.txt
[delineate-bppsum]   - (15 candidate populations)
[delineate-bppsum] - Reading output file   8 of 11: ./07/results.out.txt
[delineate-bppsum]   - (30 candidate populations)
[delineate-bppsum] - Reading output file   9 of 11: ./08/results.out.txt
[delineate-bppsum]   - (5 candidate populations)
[delineate-bppsum] - Reading output file  10 of 11: ./09/results.out.txt
[delineate-bppsum]   - (5 candidate populations)
[delineate-bppsum] - Reading output file  11 of 11: ./10/results.out.txt
[delineate-bppsum]   - (3 candidate populations)
[delineate-bppsum] Posterior probability threshold of 0.50: 86 populations
[delineate-bppsum] Posterior probability threshold of 0.75: 72 populations
[delineate-bppsum] Posterior probability threshold of 0.90: 65 populations
[delineate-bppsum] Posterior probability threshold of 0.95: 59 populations
[delineate-bppsum] Posterior probability threshold of 1.00: 46 populations

and produces the following files:

coalescent-pops.sb2-traits.p050.txt
coalescent-pops.sb2-traits.p075.txt
coalescent-pops.sb2-traits.p090.txt
coalescent-pops.sb2-traits.p095.txt
coalescent-pops.sb2-traits.p100.txt
coalescent-pops.summary.csv

The population boundaries and identities of the various individuals are reported at different posterior probabilities (0.50, 0.75, 0.90, 0.95, and 1.00). A comprehensive overview of all the identities under the different posterior probabilities is provided in the file: coalescent-pops.summary.csv. The other files (ending with filenames “…sb2-traits.p0xx.txt”) are StarBeast2 “traits” files at each of those posterior probability thresholds. These latter make setting up a StarBeast2 analysis to estimate an ultrametric phylogeny relating the population units delimited at each of the posterior probability thresholds very straightforward. With the latter, do not be confused by the column header “species”! Again, this is simply due to the misleading terminology adopted for the higher-level units of organization in StarBeast2 and most other programs that use the MSC. What we are working with here are population units, some of which may also correspond to species while others may not. This is abundantly clear by the fact that the 12 nominal species — established with both genomic as well as morphological data contributing toward an understanding of the system over a decade of the study — is split into 46 to 86 units (!) by the MSC analysis, depending on the posterior probability threshold we adopt for those units. Interpreting these units as “species” is categorically and unconditionally wrong, and would represent a jettisoning of a basic understanding of speciation, systematics, and biology, due to a complete misunderstanding of the MSC model and misinterpretation of its results.

The MSC delimits populations, not species.

Note that where our candidate populations assignment have indeed turned out to be distinct population lineages under a particular posterior probability threshold, the original population label is retained (e.g., “L_australerasa_CA_Martin_Meadow” or “L_probata_WA_Blue_Mountains”).

In cases where multiple candidate population lineages have been “collapsed”, because there was insufficient signal for restriction of Wright-Fisher panmixia between them detected at a particular posterior probability threshold, they have been relabeled with synthetic labels (e.g. “coalescentpop001”, “coalescentpop002”, etc.). Thus, for example, we see that, in the setup to our BP&P analysies, we allowed for the possibility that L. erasa from Mary’s Peak, Mount Hebo, Mount Hood, and Prairie Peak might each be a distinct population:

sample	BP&B Candidate Population (“Species”)
L_erasa_OR_Marys_Peak_2575	L_erasa_OR_Marys_Peak
L_erasa_OR_Marys_Peak_2586	L_erasa_OR_Marys_Peak
L_erasa_OR_Marys_Peak_2615	L_erasa_OR_Marys_Peak
L_erasa_OR_Marys_Peak_2616	L_erasa_OR_Marys_Peak
L_erasa_OR_Mount_Hebo_3013	L_erasa_OR_Mount_Hebo
L_erasa_OR_Mount_Hebo_3016	L_erasa_OR_Mount_Hebo
L_erasa_OR_Mt_Hood_4144	L_erasa_OR_Mt_Hood
L_erasa_OR_Prairie_Peak_2580	L_erasa_OR_Prairie_Peak

Following the analysis, we see that at a 0.95 posterior probability threshold (see the “p095” column in 02b-population-delimitation-subtrees/coalescent-pops.summary.csv or 02b-population-delimitation-subtrees/coalescent-pops.sb2-traits.p095.txt) there was insufficient evidence to support four distinct populations here. Population boundaries between Marys Peak, Mount Hebo, and Prairie Peak was were not strong, and all individuals here were assigned to a single merged or collapsed population, “coalescentpop007”. Only the Mount Hood individuals were separated out to be delimited as a distinct population.

sample	p095
L_erasa_OR_Marys_Peak_2575	coalescentpop007
L_erasa_OR_Marys_Peak_2586	coalescentpop007
L_erasa_OR_Marys_Peak_2615	coalescentpop007
L_erasa_OR_Marys_Peak_2616	coalescentpop007
L_erasa_OR_Mount_Hebo_3013	coalescentpop007
L_erasa_OR_Mount_Hebo_3016	coalescentpop007
L_erasa_OR_Mt_Hood_4144	L_erasa_OR_Mt_Hood
L_erasa_OR_Prairie_Peak_2580	coalescentpop007

On the other hand, however, at the 0.50 posterior probability threshold, while the Mary’s Peak and Prairie Peak candidate population lineages were still collapsed into a single population (“coalescentpop004”), the Mount Hebo and Mount Hood individuals were each estimated to form a distinct population unto themselves (see the “p050” column in 02b-population-delimitation-subtrees/coalescent-pops.summary.csv or 02b-population-delimitation-subtrees/coalescent-pops.sb2-traits.p050.txt):

sample	p050
L_erasa_OR_Marys_Peak_2575	coalescentpop004
L_erasa_OR_Marys_Peak_2586	coalescentpop004
L_erasa_OR_Marys_Peak_2615	coalescentpop004
L_erasa_OR_Marys_Peak_2616	coalescentpop004
L_erasa_OR_Mount_Hebo_3013	L_erasa_OR_Mount_Hebo
L_erasa_OR_Mount_Hebo_3016	L_erasa_OR_Mount_Hebo
L_erasa_OR_Mt_Hood_4144	L_erasa_OR_Mt_Hood
L_erasa_OR_Prairie_Peak_2580	coalescentpop004

Going forward to the next stage of analysis, we have to decide the posterior probability threshold at which we want to determine our population units. For this example, we will (somewhat arbitrarily) decide on 0.95.

Stage II. Generating the (Multipopulation Coalescent, Ultrametric) Phylogeny of Populations¶

We use StarBeast2 to estimate an ultrametric phylogeny of population lineages. We use the original set of alignments (found in lionepha/00-alignments) for as the input data for this, and a “traits” file that maps each of the sequence labels in the alignment to population identities assigned in the previous step. For this exercise, we shall use population units that were delimited with 0.95 posterior probability, and the “traits” file corresponding to this is given by coalescent-pops.sb2-traits.p095.txt. The analysis setup for the StarBeast2 run can be found in the lionepha/03-population-tree-estimation/ directory. There are six StarBeast2 XML configuration files. These all set up the same “species tree” analysis, the only difference being the name prefixes of the output files. This is so that we can run multiple independent analyses, the sin qua non of MCMC best practices.

Again, we used a single strict clock model across all genes, and a HKY+G model of substitution. Each analysis was run for 400 million generations, with a sampling frequency of 400,000 generations. Thus, we obtained 1000 samples from the posterior from each of the six replicates. We can again use SumTrees to summarize the results with the following command:

$ sumtrees.py \
    -b 250 \
    -s mcct \
    -e mean-age \
    -l clear \
    --force-rooted \
    --suppress-annotations \
    -ro lionepha-p095-hkyg.mcct-mean-age.tree \
    *.species.trees

This will discard the first 250 of the 1000 samples from each replicate as burn-in, merge the results, and select the Maximum Clade Credibility Tree (MCCT) as the summary topology, with branch lengths set such that the internal node ages are the mean of the the corresponding node ages across all post-burnin samples. The result of this can be found here: lionepha/03-population-tree-estimation/lionepha-p095-hkyg.mcct-mean-age.tree.nex. This population-level phylogeny will be one of the two primary inputs to DELINEATE.

Stage III. Assignment of Known vs. Unknown Species Identities¶

We now inspect our phylogeny and determine a priori species assignments for as many population lineages as we can. These assignments will be made with reference to our integrative understanding of the system in conjunction with the evolutionary relationships between population lineages shown in the phylogeny.

There will be some population lineages for which the species identity is uncontroversial — i.e., they can be unambiguously assigned to known species (or even new ones) based on morphological evidence of individuals in that population. In the case of the “collapsed” populations (labeled “coalescentpop001”, “coalescentpop002”, etc.), we would need to examine all individuals in those units, and if any one of them can be definitely assigned to an independent species status based on systematic evidence (i.e., distinct species from all others in the system), then the entire population lineage would get assigned to a distinct species. The reasoning behind this is that, based on previous stages of the analysis using BP&P, we have already decided that all individuals in that population constitute a single cohesive population, so the species identity of any one individual in that lineage would necessarily be shared by all other individuals in the same lineage.

There will, of course, be population units for which we will not be able to confidentally assign to a distinct species status — either a known, existing one, or to a new one. These population lineages could be distinct species unto themselves, or could be populations of other species in the system (named or otherwise). These are the actual targets of delimitation — we would to determine whether or not the boundaries between them and other lineages are species boundaries or population boundaries.

The first set of population lineages — the ones for which we can confidently assign species identities, and, hence, boundaries — constitute our “constrained” lineages. We set them as constraints when we configure the DELINEATE analysis by setting their status to “1” in the constraints table. This can be seen with, for e.g., “coalescentpop001”, “coalescentpop002”, and “L_australerasa_CA_Martin_Meadow” in the example constraints file (04-species-delimitation/lionepha.run1.tsv; see below), which all have been assigned to “australerasa” (as can be seen in the “species” column), with the “status” set to 1. This will indicate to DELINEATE that the assignment of these populations to species is fixed, and we should not consider any partition that violates this.

lineage	species	status
coalescentpop001	australerasa	1
coalescentpop002	australerasa	1
L_australerasa_CA_Martin_Meadow	australerasa	1
L_casta_CA_West_Branch_Mill_Creek	casta	1
coalescentpop005	casta	1
coalescentpop003	casta	1
coalescentpop004	casta	1
L_casta_OR_Lost_Prairie	casta	1
L_disjuncta_OR_Lostine_River	disjuncta	1
L_disjuncta_BC_Summit_Creek	disjuncta	1
L_disjuncta_CA_Emerson_Creek	disjuncta	1
L_disjuncta_MT_Mill_Creek	disjuncta	1
L_disjuncta_ID_Salmon_River	disjuncta	1
L_erasa_BC_Cherryville	erasa	1
L_erasa_OR_Lost_Prairie	erasa	1
L_erasa_OR_Mt_Hood	erasa	1
coalescentpop007	erasa	1
L_erasa_AK_Thompson_Pass	erasa	1
L_lindrothi_CA_Long_Valley_Creek	lindrothi	1
L_lindrothi_CA_Tioga_Lake	lindrothi	1
L_osculans_CA_Cold_Creek	osculans	1
coalescentpop015	osculans	1
coalescentpop014	osculans	1
coalescentpop013	osculans	1
coalescentpop012	osculans	1
L_osculans_OR_Little_Philips_Creek	osculans	1
L_osculans_CA_Strawberry_Creek	osculans	1
L_probata_MT_Mill_Creek	probata	1
coalescentpop023	probata	1
L_probata_CA_Warner_Range	probata	1
L_probata_OR_Lostine_River_Valley	probata	1
L_probata_CA_Ellery_Lake	probata	1
L_probata_OR_Mt_Ashland	probata	1
L_probata_WA_Blue_Mountains	probata	1
coalescentpop019	probata	1
coalescentpop020	probata	1
coalescentpop021	probata	1
coalescentpop022	probata	1
coalescentpop016	probata	1
coalescentpop017	probata	1
coalescentpop018	probata	1
L_pseudoerasa_CA_Kaiser_Pass	pseudoerasa	1
coalescentpop025	pseudoerasa	1
coalescentpop026	sequoiae	1
coalescentpop027	sequoiae	1

Conversely, there remain a number of population lineages that we are not sure about (we are going to pretend, for the purposes of this exercise). These populations could be populations of existing species OR they could be independent species (either as a single lineage or clusters of lineages). These are the lineages that are the (remaining) primary focus of the species delimitation analysis. In the constraints table, we set the status of these lineages to “0”, indicating that we want the DELINEATE analysis to explore all possible partitions that vary in the species assignments of these populations. Note that we still have entries in the “species” field for this “unconstrained” lineages. This is purely for our own book-keeping and will be ignored entirely by DELINEATE. We could, in principle, replace them with any text or level them blank, and they will have no difference in the program running or output.

lineage	species	status
L_disjuncta_OR_Mt_Hood	?	0
L_disjuncta_CA_Lily_Lake	?	0
coalescentpop006	?	0
coalescentpop008	?	0
coalescentpop009	?	0
L_lindrothi_CA_Emerald_Lake	?	0
coalescentpop011	?	0
L_lindrothi_CA_South_Fork_Bishop_Creek	?	0
L_probata_UT_Shingle_Creek	?	0
coalescentpop024	?	0
L_tuulukwa_CA_Trinity_Alps	?	0
L_tuulukwa_OR_Knowles_Creek	?	0
L_tuulukwa_OR_Marys_Peak	?	0

Stage IV. Delimitation of Species Units¶

We are now ready to run DELINEATE to carry out the species delimitation analysis. This will target the population lineages of unknown (“0”) status, and calculate the probabilities of all possible partitions that vary in their placement with respect to species identities. The full analytical setup can be found in the subdirectory 04-species-delimitation/. For input data, we will use the phylogeny obtained in Stage II (lionepha/03-population-tree-estimation/lionepha-p095-hkyg.mcct-mean-age.tree.nex) and the constraints table in Stage III (04-species-delimitation/lionepha.run1.tsv) as inputs to the program delineate-estimate. To execute the analysis, we run the following command:

delineate-estimate partitions \
    -P 0.99 \
    -t lionepha-p095-hkyg.mcct-mean-age.tree.nex \
    -c lionepha.run1.tsv

The “-P 0.99” flag says to only report the probabilities of all partitions that contribute collectively to over 0.99 of the probabilities. There could be thousands (or billions, depending on the number of unconstrained lineages you set) of other partitions in the 0.01 remaining tail that we choose to ignore to save disk space and time.

Note

Analysis Size and Computational Requirements

This analysis has 13 unconstrained lineages out of 56. The former (i.e., the number of unconstrained lineages) is the primary factor driving computational requirements, both in terms of time as well as memory: an analysis of 10,000 populations of which only 5 unconstrained will run faster (almost instantaneously) than one with 100 populations with 10 unconstrained. An analysis with 50 unconstrained lineages will probably never complete even with the most powerful computers thrown at it, regardless of how few or many the contrained lineages are. With this example of 13 unconstrained population lineages, we require 115 GB of computer memory and took about 35 mins to run on a fairly current (as of 2019) high-performance computing cluster. If you find that this might be too demanding for your hardware (especially the memory requirement), you can reduce the complexity of the analysis by “fixing” the status of a few more species to be constraints by changing some “0’s” to “1’s” in the status columns of the constraint file (04-species-delimitation/lionepha.run1.tsv).

The analysis will produce the following files:

lionepha.run1.delimitation-results.json
lionepha.run1.delimitation-results.trees
lionepha.run1.delimitation-results.trunc.trees

The first two are included in the example archive in compressed form: 04-species-delimitation/lionepha-delimitation-results.zip, while the last is given directly.

The JSON-format file (lionepha.run1.delimitation-results.json) lists some metadata about the analysis, followed by detailed description of all partitions considered. The partitions are listed in order of descending probability, i.e., with the partition of highest probability given first. Thus, the maximum likelihood estimate of the species delimitation is represented by the first partition listed. The second partition listed is the species delimitation of the next highest probability, and so on.

The tree file lionepha.run1.delimitation-results.trees consists of a set of trees, with one tree for each partition considered. The tree structure itself — the phylogeny, including both topology and the branch lengths — are identical across all trees. The only difference in the trees is the metadata markup of each tree, with annotations regarding species assignments of the tip (population) lineages. Each tree corresponds to the a partition listed in the JSON file, in the same order. Thus the trees illustrate the different partitions (and species assignments/boundaries of each partition), with the first tree representing the maximum likelihood estimate of the species delimitation, the second tree listed being the next most probable species delimitation and so on.

The third file lionepha.run1.delimitation-results.trunc.trees is also a tree file, but contains the just the first 10 trees from the primary results tree file (lionepha.run1.delimitation-results.trees) The results files can very large which may make viewing them challenging. To facilitate viewing/manipulating of the most probable results, we provide a smaller tree file with just the first 10 trees. As the the result entries (trees and partitions) are given in order of descending probability, the earlier results have a higher probability than the later ones, so truncating the tree file by selecting the first few like this will let you view the most probable results. Again, as with the primary tree file, the first listing in the results will have the highest probablity, and thus is MLE partition/tree.

Examination and Interpretation of the Results¶

The trees generated by DELINEATE have some special mark-up that will show up when opened in FigTree for visualization. For example, opening up the truncated trees result file (lionepha.run1.delimitation-results.trunc.trees) in FigTree and viewing the first tree, i.e. the maximum likelihood species delimitation shows the following:

Results of species delimitation analysis.

The branches painted in blue (▆) represent the constrained population lineages, i.e., lineages with a fixed species assignment. The branches painted in ochre (▆), on the other hand, represent the unconstrained population lineages, i.e., lineages with unknown species identity and whose species membership was allowed to vary. Furthermore the lineage labels are in a special format, showing the species assignment followed by the actual population label in parenthesis.

We interpret the results as follows:

Maximum Likelihood Species Delimitation¶
Lineages	Species Delimitation	Explanation
coalescentpop008 coalescentpop009	DelineatedSp002	Group together in the same NEW and unnamed species, labeled “DelineatedSp002”. That is, the boundaries between these two lineages on the one hand, and all other lineages on the other, are species boundaries, not population.
L_lindrothi_CA_Emerald_Lake coalescentpop011 L_lindrothi_CA_South_Fork_Bishop_Creek	lindrothi	Group together as populations that are part of a pre-defined species, “L. lindorothi”. That is, the boundaries between these three lineages on the one hand, and all other lindrothi lineages on the other, are (just) population boundaries, not species.
L_probata_UT_Shingle_Creek coalescentpop024	probata	Group together as populations that are part of a pre-defined species, “L. probata”. That is, the boundaries between these two lineages on the one hand, and all other probata lineages on the other, are (just) population boundaries, not species.
L_disjuncta_OR_Mt_Hood L_disjuncta_CA_Lily_Lake coalescentpop006	disjuncta	Group together as populations that are part of a pre-defined species, “L. disjuncta”. That is, the boundaries between these three lineages on the one hand, and all other disjuncta lineages on the other, are (just) population boundaries, not species.
L_tuulukwa_CA_Trinity_Alps L_tuulukwa_OR_Knowles_Creek L_tuulukwa_OR_Marys_Peak	DelineatedSp001	Group together in the same NEW and unnamed species, labeled “DelineatedSp001”. That is, the boundaries between these three lineages on the one hand, and all other lineages on the other, are species boundaries, not population.

Conferring with the reference alpha taxonomy as worked out by Maddison and Poul (2020), we see that these are perfectly concordant. The “DelineatedSp001” populations all collectively corresponded to the distinct species L. tuulukwa and “DelineatedSp002” populations all collectively corresponded to the distinct species L. kavanaughi. At the same time, the remaining unconstrained population lineages in our example indeed were referred to probata, lindrothi, and disjuncta species.