ARCHIPELAGO

Jeet Sukumaran

March 20, 2016

Contents

Contents 1

1 Introduction 3

2 Installation 4

2.1 Mandatory Pre-requisites . . . . . . . . . . . . . . . . . . . . . 4

2.2 Optional Pre-requisites . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Installation from Source . . . . . . . . . . . . . . . . . . . . . . 5

3 Usage 6

3.1 Overview of Workﬂow . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Simulation of Training Data . . . . . . . . . . . . . . . . . . . . 7

3.3 Calculation of Summary Statistics . . . . . . . . . . . . . . . . 16

3.4 Classiﬁcation of Target Data . . . . . . . . . . . . . . . . . . . 16

3.5 Proﬁling the Target Data . . . . . . . . . . . . . . . . . . . . . 16

3.6 Encoding the Target Data . . . . . . . . . . . . . . . . . . . . . 16

3.7 Encoding the Target Data Using the “archipelago-encode-trees.py”

Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 CONTENTS

4 Walk-Through of a Basic Analysis 18

4.1 Analytical Objectives . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 The Target Data . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Simulating the Training Data . . . . . . . . . . . . . . . . . . . 19

4.4 Calculation of Summary Statistics . . . . . . . . . . . . . . . . 25

4.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1 Introduction

“Archipelago” is the name of a generative phylogeny-based model that simulta-

neously incorporates the diversiﬁcation processes of speciation and extinction

nad the biogeographical processes of area gain (“dispersal”) and area loss (“ex-

tirpation”), with these processes being diﬀerentially regulated by ecological or

other traits that are themselves co-evolving on the phylogeny. The theory and

background to this model and its usage is described in the following paper:

This software project, archipelago, presents a suite of programs to gen-

erate and analyze data under the Archipelago model. The primary objec-

tive of the analysis is model selection: i.e., statistically identifying the model

that generated a particular set of data. The suite of programs constituting

archipelago is, thus, a computational biogeographical model selection analy-

sis package that exploits the power and ﬂexibility of the Archipelago model to

allow you to ask and answer historical biogeographical questions of a nature

and complexity that are not possible under any other approach. In partic-

ular, instead of asking questions about ancestral area patterns, you can ask

questions about processes, and how some processes (e.g., ecology) aﬀect other

processes (e.g., dispersal or speciation).

2 Installation

2.1 Mandatory Pre-requisites

• Python 2.7 or higher

• DendroPy Phylogenetic Computing Library, version 4 or above (http:

//dendropy.org).

• R (http://www.r-project.org/)

• The following R package:

– adegenet

2.2 Optional Pre-requisites

To “calibrate” the simulation biogeographical submodel, you need to estimate

the dispersal and extinction rate under the DEC or a DEC-like model, while to

calibrate the simulation trait model, you need to estimate the trait evolution

rate. archipelago provides a script that will “proﬁle” your empirical tree for

you, estimating these rates.

To take advantage of this, you need at least one of the following installed

to estimate the dispersal and extinction rate under the DEC model:

• lagrange

• BioGeoBears (R package)

However, you can also estimate these parameters yourself using BayArea or

RevBayes. In fact, I would highly recommend that you do the latter if you

have a large number of geographical areas (e.g. more than 8), or you wish to

incorporate uncertainty in the parameter estimates.

2.3. INSTALLATION FROM SOURCE 5

Similarly, the archipelago uses the R package “GEIGER” to estimate

the trait evolution rate. You can, of course, use any other method that you

prefer.

2.3 Installation from Source

Run the following in the top-level directory of the project:

python setup.py install

3 Usage

3.1 Overview of Workﬂow

Let D be the target data, consisting of an ultrametric phylogeny with each

of the tips associated with a geographic range (presence/absence over one or

more areas) as well as with set of trait states. Let M = {M

, M

, ...M

} be

a set of k models that we are interested in studying. Let S(·) be a function

that takes a set of data returns a set of summary statistics. Our analytical

objective is, for each M

, M

∈ M, estimate the probability that it generated

the target data D, relative to all the other models in M. The operational

procedure consists of the following steps:

1. Simulation of Training Data: For each M

, M

∈ M, simulate n

replicates of data, D

∗i

. We label each replicate of this data with the

name of the model that generated it (e.g., “Model i”). The set of all

data so generated constitutes the training data, D

∗

3.2. SIMULATION OF TRAINING DATA 7

2. Calculation of Summary Statistics: Calculate summary statistics

on the training data to yield, S(D

∗

), and summary statistics on the

target data to yield S(D).

3. Classiﬁcation of Target Data: Construct a Discriminant Analysis

(DA) function on principal components (PC) calculated on the train-

ing data set summary statistics, S(D

∗

); apply the discriminant analysis

function to principal components calculated on the summary statistics

of the target data, S(D).

3.2 Simulation of Training Data

We use the program, “archipelago-simulate.py”, to simulate the data un-

der each model in the set of competing models. The simulation program is

run separately for each of the models, with multiple independent replicates in

each run. Full details on this program invocation can be found by running:

archipelago-simulate.py --help

A typical invocation of the program will have the following arguments:

• “MODEL-FILE”

A JSON or Python ﬁle containing nothing but a dictionary specifying

the full model under which to simulate data. This is described in detail

in the Model Speciﬁcation section.

• “-n” or “--nreps”

The number of independent replicates to simulate under this model.

• “-o” or “--output-prefix”

The path and ﬁlename stem where the results will be saved.

For example,

archipelago-simulate.py -n 100 -o results/m1 model1.json

will run 100 replicates under the model speciﬁed in the ﬁle “model1.json”

and all output will saved to ﬁles in the “results/” subdirectory with ﬁlenames

beginning with “m1”.

The simulation program produces the following output:

8 CHAPTER 3. USAGE

• <OUTPUT-PREFIX>.focal-areas.trees

A collection of phylogenies in NEWICK format, where each phylogeny

represents the result from a single simulation replicate. The tips of

these trees represent lineages that occur in at least one of the focal

areas (in contrast to the phylogenies in the “all-areas.trees” ﬁle,

where all extant lineages are represented, regardless of where they oc-

cur). The distribution and trait states of the tip lineages are encoded

in the tip labels. This is the data that is of primary analytical inter-

est, and will be passed to the summary statistics calculation program,

archipelago-summarize.py.

• <OUTPUT-PREFIX>.all-areas.trees

A collection of phylogenies in NEWICK format, where each phylogeny

represents the result from a single simulation replicate. The tips of these

trees represent all extant lineages at the end of the simulation, regardless

of whether or they occur in the focal areas. The distribution and trait

states of the tip lineages are encoded in the tip labels.

• <OUTPUT-PREFIX>.log

A detailed log of the run.

• <OUTPUT-PREFIX>.model.json

A description of the model that the run executed. This can be used to

conﬁrm that the model used corresponded to the model that was input.

In addition, for default settings or values may have been used aspects

or parameters that were not speciﬁed in the input model ﬁle, and these

are recorded here.

3.2.1 Model Speciﬁcation

We specify the model under which we simulate data using archipelago-simulate.py

by a model ﬁle. This is a JSON ﬁle providing a single dictionary as its only

element. An example of a basic model speciﬁcation is:

{

model_id: "samplemodel1",

"areas": [

{

"label": "a1", "is_supplemental": false, "relative_diversity": 1.0,

"area_connection_weights": [0.0, 1.0, 1.0, 1.0, 1.0]

{

"label": "a2", "is_supplemental": false, "relative_diversity": 1.0,

3.2. SIMULATION OF TRAINING DATA 9

"area_connection_weights": [1.0, 0.0, 1.0, 1.0, 1.0]

{

"label": "a3", "is_supplemental": false, "relative_diversity": 1.0,

"area_connection_weights": [1.0, 1.0, 0.0, 1.0, 1.0]

{

"label": "a4", "is_supplemental": false, "relative_diversity": 1.0,

"area_connection_weights": [1.0, 1.0, 1.0, 0.0, 1.0]

{

"label": "s1", "is_supplemental": true, "relative_diversity": 1.0,

"area_connection_weights": [1.0, 1.0, 1.0, 1.0, 0.0]

}

"traits": [

{

"label": "Trait1", "nstates": 3, "transition_rate": 0.01,

"transition_weights": [

[0.0, 0.5, 0.5],

[0.5, 0.0, 0.5],

[0.5, 0.5, 0.0]

]

{

"label": "Trait2", "nstates": 2, "transition_rate": 0.01,

"transition_weights": [

[0.0, 1.0],

[1.0, 0.0]

]

}

"diversification": {

"mean_birth_rate": 0.1,

"lineage_birth_weight": {

"definition_type": "lambda_definition",

"definition": "lambda lineage: 1.0",

"description": "fixed: 1.0"

"mean_death_rate": 0.1,

"lineage_death_weight": {

"definition_type": "lambda_definition",

"definition": "lambda lineage: 1.0",

"description": "fixed: 1.0"

}

"anagenetic_range_evolution": {

"global_area_gain_rate": 0.2,

"lineage_area_gain_weight": {

10 CHAPTER 3. USAGE

"definition_type": "lambda_definition",

"definition": "lambda lineage: 1.0",

"description": "fixed: 1.0"

"mean_area_loss_rate": 0.1,

"lineage_area_loss_weight": {

"definition_type": "lambda_definition",

"definition": "lambda lineage: 1.0",

"description": "fixed: 1.0"

}

"cladogenetic_range_evolution": {

"sympatric_subset_speciation_weight": 1.0,

"single_area_vicariance_speciation_weight": 1.0,

"widespread_vicariance_speciation_weight": 1.0,

"founder_event_speciation_weight": 0.0

"termination_conditions": {

"target_focal_area_lineages": 50,

"gsa_termination_focal_area_lineages": null,

"max_time": null

}

includes/samplemodel1.json

The model dictionary consists of the following elements:

“model id”

This key takes an arbitrary string value that identiﬁes the category or

class of model being run. It will serve as the identiﬁer of all data gen-

erated under this model. Each tree generated under this model will

have the metadata item, “model id”, with this value associated with

it. Each set of summary statistics calculated on each tree will have

an additional ﬁeld, “model id“ added to it to document the generat-

ing model by the value given to this ﬁeld. The classiﬁcation programn

“archipelago-classify.py”, will expect the training data summary

statistics to have a column with this name, and it will use the values of

this column to determine the origin or source of the data.

“areas”

This key deﬁnes the geographical template. Its corresponding value is a

list of dictionaries with the following keys, with each dictionary deﬁning

a distinct area:

3.2. SIMULATION OF TRAINING DATA 11

“label”:

A unique string value giving the name or identiﬁer for the area.

“is supplemental”:

A boolean (true or false) specifying whether or not the area is

a focal area or a supplemental area. Focal areas are areas from

which the lineages are sampled, while supplemental areas are area

that interact with the focal areas ecologically, evolutionarily, and

biogeographically, but are not sampled in the target data. For

example, when analyzing a group of Western Paciﬁc islands, we

may consider the island groups to be focal areas, but Australia and

Papua New Guinea may be considered supplemental areas, because

our target data only spans lineages that have representatives in the

focal areas, though there is no doubt that the evolutionary history

of those lineages includes Australia and Papua New Guinea.

“area connection weights”:

A list of positive real values which provides the dispersal weights

from the area being deﬁned to all other areas, speciﬁed in the or-

der when they are deﬁned. Note that this includes the connection

weight to the same area as well, and this is must be set to 0.0.

If not speciﬁed, by default it is assumed that all areas are equally

connected (i.e., connection weights default to 1.0, except for the

self-connection weight, which is 0.0.).

“traits”

This key deﬁnes the traits. Its corresponding value is a list of dictionaries

with the following keys, with each dictionary deﬁning a distinct trait:

“label”:

A unique string value giving the name or identiﬁer for the trait.

“nstates”:

A non-zero positive integer specifying the number of distinct

states that the trait has.

“transition

rate”:

A positive real specifying the mean rate of change per unit time

for the trait.

“transition weights”: (optional)

A list of lists specifying the relative weights (as positive real val-

ues) for each state transition. If not given, equal rates are assumed

for all transitions.

12 CHAPTER 3. USAGE

“diversiﬁcation”

This key deﬁnes the diversiﬁcation processes (speciation and extinction).

It consists of a dictionary with the following two keys:

“mean birth rate”:

The birth rate of the branching processes under a birth-death model:

i.e., the mean per-lineage rate of speciation. This will be modiﬁed

for each lineage by the lineage birth weight function speciﬁed below,

but the simulation will maintain this mean across all lineages.

“lineage birth weight”:

A lineage-speciﬁc weight function deﬁnition (see below) specifying

the weight of the probability of speciation of this lineage relative

to the others. This will be normalized across the phylogeny to

maintain the birth rate given by mean birth rate.

“mean death rate”:

The death rate of the branching processes under a death-death

model: i.e., the mean per-lineage rate of extinction. This will be

modiﬁed for each lineage by the lineage death weight function spec-

iﬁed below, but the simulation will maintain this mean across all

lineages.

“lineage death weight”:

A lineage-speciﬁc weight function deﬁnition (see below) specifying

the weight of the probability of extinction of this lineage (i.e., the

death weight in a birth-death branching process submodel, not the

“extinction” or area loss weight of the DEC biogeographical sub-

model) relative to the others. This will be normalized across the

phylogeny to maintain the death rate given by mean death rate.

“anagenetic range evolution”

This key deﬁnes the anagenetic range evolution processes (area gain and

area loss). It consists of a dictionary with the following two keys:

“global area gain rate”:

This corresponds to the area gain rate parameter in the BayArea

model the d or “global dispersal” parameter in the DEC model.

“lineage area gain weight”:

A lineage-speciﬁc weight function deﬁnition (see below) specifying

the relative weight at which a lineage gains an area in its range rel-

ative to the other lineages. This will be normalized across the phy-

logeny to maintain the global area gain rate given by “global area gain rate”.

3.2. SIMULATION OF TRAINING DATA 13

“mean area loss rate”:

This corresponds to the area loss rate in the BayArea model the e

or “extinction” (really, extirpation) parameter in the DEC model.

“lineage area loss weight”:

A lineage-speciﬁc weight function deﬁnition (see below) specifying

the relative weight at which a lineage loses an area in its range rel-

ative to the other lineages. This will be normalized across the phy-

logeny to maintain the mean area loss rate given by “mean area loss rate”.

“cladogenetic range evolution”

This key deﬁnes the relative weights of the cladogenetic range evolution

speciation modes. It consists of a dictionary with the following keys,

with the value of each key being the weight of that mode relative to all

other modes:

“sympatric

subset speciation weight”:

Default value: 1.0.

“single area vicariance speciation weight”:

Default value: 1.0.

“widespread vicariance speciation weight”:

Default value: 1.0.

“founder event speciation weight”:

Default value: 0.0.

“termination conditions”

This key deﬁnes the conditions which determine when the simulation

should end. Its value is a dictionary consisting of one of the following

keys:

“target focal area lineages”:

A non-zero positive integer that speciﬁes the minimum number of

extant lineages that occur across all focal areas. Once this number

is reached or exceeded, the simulation terminates. Typically, if this

mode of termination is used, this will be set to the number of tips

of the target data phylogeny.

“max time”:

A positive real number that speciﬁes the maximum time the sim-

ulation should run. Typically, if this mode of termination is used,

this will be set to the root age of the target data phylogeny.

14 CHAPTER 3. USAGE

3.2.2 Lineage-Speciﬁc Weight Function

A number of process rates (speciation, extinction, area gain, area loss) are

modiﬁed by a “lineage-speciﬁc weight function”. This allows the rate of the

process to vary depending on the trait state or other attributes of the lineage.

For each process, the weight function can be speciﬁed in one of four ways:

a ﬁxed value, a (Python) lambda expression, a trait state mapping, or as a

Python function object.

3.2.2.1 Fixed-Value Lineage-Speciﬁc Weight Function

With a ﬁxed-value lineage-speciﬁc rate function, the rate of the process is the

same, ﬁxed, constant value regardless of lineage. That is, all lineages uncondi-

tionally have the same relative rate for the process throughout their lifetimes

in the simulation. A ﬁxed-value lineage-speciﬁc rate function is speciﬁed by

providing a dictionary like the following as the value for the process key:

{

"definition_type": "fixed_value",

"definition": 1.00,

}

The value assigned to “deﬁnition” does not really matter, as it will be nor-

malized across all lineages and multipled by the mean or global rate for the

process: the key point here is that all lineages get the same weight.

3.2.2.2 Trait State Mapping Lineage-Speciﬁc Weight Function

Here, you provide a list of weight values for that processes parameter, with

a one-to-one correspondence to the states of an trait deﬁned previously. For

example, if you have deﬁned a trait with the label “color” and three states:

{"label": "color", "nstates": 3, "transition_rate": 0.01}

you could deﬁne a process that takes on the is weighted by 0.0 if the “color”

trait of a lineage was in state 0, 0.5 if the trait was in state 1, and 1.0, if the

trait was in state 2 by specifying:

{

"definition_type": "trait_state_index_map:color",

"definition": [0.0, 0.5, 1.0],

}

3.2. SIMULATION OF TRAINING DATA 15

These weight represent the rate of the process for a lineage relative to others

based on the trait state. The overall rate of the process across the phylogeny

remains the same; it is just “distributed” across lineages diﬀerentially based

on the state of the “color” trait associated with the lineage.

3.2.2.3 Lambda-Expression Weight Function

The weight function is given as a Python lambda function deﬁnition. The

lambda function should take a single argument, an object of the “Lineage”

class, and return a real value representing the weight with which to modify

the process for this lineage, relative to other lineages, at this particular point

in time. For example, the following replicates the above trait state mapping:

{

"definition_type": "lambda_definition",

"definition": "lambda lineage: 0.0 if lineage.traits_vector[0] == ’0’ else

(0.5 if lineage.traits_vector[0] == ’1’ else 1.0)"

}

Note that trait types are 0-based numeric indexes, while trait states are char-

acter values, “0”, “1”, “2“, etc. archipelago internally uses character values

to specify trait states as this allows for special states such as “NA” or “null”.

Again, these weights represent the rate of the process for a lineage relative to

others based on the trait state, and the overall rate of the process across the

phylogeny remains the same.

As another example, the following inspects the distribution vector of a

lineage and returns a weight of 1.0 if the ﬁrst area is occupied and 0.5 if not:

{

"definition_type": "lambda_definition",

"definition": "lambda lineage: 1.0 if lineage.distribution[0] == 1 else

0.5"

}

In constrast to trait states, absence/presence in an (0-based indexed) area is

indicated by numeric values, 0 and 1, respectievly.

16 CHAPTER 3. USAGE

3.3 Calculation of Summary Statistics

3.4 Classiﬁcation of Target Data

3.5 Proﬁling the Target Data

3.6 Encoding the Target Data

To calculate the summary statistics or proﬁle the target data using archipelago,

we need to communicate the trait and geographical data associated with the

tips of the the target phylogeny to the program. This is achieved by encoding

the information in the labels of the phylogeny. The tip label for each lineage

consists of three components separated by a carat (“^”) character:

s68^1.2.0^0011

The ﬁrst component (“s68” in the above example) is an arbitrary species

index that uniquely identiﬁes each lineage.

The second component (“1.2.0” in the above example) is the vector of trait

states for the species. The states of each trait for the lineage is represented by

an integer index separated by a period. In this example, there are three trait

types, with the value of the ﬁrst trait type “1”, the second “2”, and the third

“0”. Note that the ﬁrst trait state is indexed by 0, not 1. That is, if there are

“n” states, then the set of valid trait state indexes is {0, 1, 2, ..., n − 1}. All

lineages in the system must have the same number of trait types (three, in

this example), and the trait state values are assumed to be given in sequential

order.

The third component (“0011” in the above example) is the geographical

presence/absence vector for the lineage, where “0” indicates the absence of a

lineage from the area, and “1” indicates the presence. In this example, the

lineage is absent from the ﬁrst and second areas, but present in the third

and fourth. All lineages must have their presences or absences speciﬁed in all

areas.

Consider a system given by the following phylogeny:

[&R] ((A:1,B:1):3,(C:3,(D:2,E:2):1):1);

where the traits are represented by the following character matrix:

A 001

B 011

C 201

3.7. ENCODING THE TARGET DATA USING THE

“ARCHIPELAGO-ENCODE-TREES.PY” UTILITY 17

D 100

E 211

and the distribution over six areas is represented by the following incidence

matrix:

A 110001

B 010011

C 101101

D 001100

E 011111

We would then encode this information in the phylogeny as follows:

[&R] ((A^0.0.1^110001:1,B^0.1.1^010011:1):3,(C^2.0.1^101101:3,(D

^1.0.0^001100:2,E^2.1.1^011111:2):1):1);

3.7 Encoding the Target Data Using the

“archipelago-encode-trees.py” Utility

Given the proper input, the program “archipelago-encode-trees.py“ will

produce appropriately encoded trees for you, removing the tedium of the en-

coding task. This program requires the following inputs:

• A phylogeny tree ﬁle, in Newick or Nexus format, speciﬁed using the

“-p” or “--phylogeny” option.

• A traits data ﬁle, in Phylip or Nexus format, speciﬁed using the “-t” or

“--traits” option.

• A geography data ﬁle, in Phylip or Nexus format, speciﬁed using the

“-g” or “--geography” option.

It writes the encoded trees to the standard output, which you can store in a

ﬁle using the “gt” redirection operator. For example:

archipelago-encode-trees.py -p phylogeny.tre -g geography.dat -t traits.

dat > encoded-phylogeny.trees

The ﬁle “encoded-phylogeny.trees“ will now contain the tree with tip labels

encoded with the geography and traits data.

4 Walk-Through of a Basic

Analysis

4.1 Analytical Objectives

Island i

Open

Habitat

Interior

Habitat

Island j

Open

Habitat

Interior

Habitat

CONSTRAINED

Island i

Open

Habitat

Interior

Habitat

Island j

Open

Habitat

Interior

Habitat

UNCONSTRAINED

As an example of a very ba-

sic analysis, let us consider

testing whether or not dis-

persal is constrained by habi-

tat association of a lineage

(Figure). We propose two

diﬀerent models, each corre-

sponding to a diﬀerent dis-

persal regime. In the “un-

constrained” model, lineages

are free to disperse between islands regardless of their habitat aﬃnities. In

the “constrained” model, ony lineages associated with the disturbed or open

habitats can disperse between islands, while lineages associated with interior

habitats are not able to disperse. The habitat aﬃnities of the lineages them-

selves are modeled as a discrete binary trait that evolves stochasitcally over

the phylogeny. That is, each lineage has a habitat trait, which can either be

in the “disturbed” or “interior” state, and the state of this trait determines

whether or not the lineage is capable of dispersing in the “constrained” model.

Thus, in the unconstrained model, lineages disperse between interior habitats

on diﬀerent islands directly. In the constrained model, on the other hand,

lineages cannot disperse between interior habitats directly: they have to move

into a disturbed habitat association in ecological space over evolutionary time,

at which phase they are capable of dispersing into new islands, and then re-

invade the interior habitats on those new islands by, again, moving into them

in ecological space over evolutionary time.

4.2. THE TARGET DATA 19

4.2 The Target Data

We will analyze the data set discussed in Sukumaran et al. 2015. This is a data

set of birds in Wallacea, originally presented in Carstensen et al. 2013. We

will be focussing on a subset of 365 species for which suﬃcient ecological data

is available that allows us to categorize them into either “open” (disturbed)

or “undisturbed” habitat species. The data, model ﬁles, etc. for this analy-

sis can be found in the “examples/wallacea bird dispersal by habitat/”

subdirectory of the project directory. The subdirectory “original-data” has

the full phylogeny of 549 species (“master tree.nexus”), from the Jetz et al.

(????) paper as well as a comma-separated value formatted ﬁle with the ge-

ographical, ecological, and other information from the Carstensen et al. 2013

paper.

4.3 Simulating the Training Data

We will simulate data under two classes of models: “unconstrained” and “con-

strained”. In this analysis, we will be using the basic DEC biogeography model

as our bigeography (sub)model. We will need to specify the following rates:

• The speciation or birth rate

In the DEC model as implemented by lagrange and BioGeoBEARS,

speciation is not an explicit process: the model is conditioned on the

speciation event timings and structure given by the input phylogeny,

which is (typically) taken as truth. However, in Archipelago, we explic-

itly model the diversiﬁcation process and sample phylogenies as part of

the training data. While Archipelago supports a full birth-death model,

we will not be modeling lineage extinction directly under this model.

Instead, following the DEC model, we will model lineage extinction as

part of the biogeographical submodel, with a lineage going extinct when

its range is reduced to the null set, i.e. when it has lost all areas from its

range. The speciation process is a nuisance process with respect to our

study question, and both “unconstrained” and “constrained” classes of

models will be set to the same ﬁxed speciation rates. We will estimate

the speciation rate under a pure-birth Yule model using the empirical

target phylogeny.

• The trait transition rate

20 CHAPTER 4. WALK-THROUGH OF A BASIC ANALYSIS

We use a simple equal-rates (ER) trait transition submodel. As with

the diversiﬁcation submodel, the trait evolution process is nuisance pro-

cess with respect to our study question, and both “unconstrained” and

“constrained” classes of models will be set to the same ﬁxed trait tran-

sition rate, as estimated from the empirical target data.

• The global rate of area gain

This corresponds to the “dispersal” or “d” parameter in the DEC model.

We will estimate this rate based on the empirical target data under

the basic DEC model, using lagrange, BioGeoBEARS, BayArea, or

RevBayes. In the “unconstrained” class of model, this rate will be used

by all lineage equally. In the “constrained” class of model, we will use

a trait-state mapping function which will return a modiﬁer weight of 0

if the lineage habitat trait is in “interior” state and a modiﬁer weight

of 1.0 the lineage trait is “disturbed” state. This means that lineages

associated with the “interior” habitat will not disperse and colonize any

new areas, while lineages associated with the “disturbed” state will. The

actual rate of the area gain process for each lineage associated with the

“disturbed” habitat state will vary, but will fulﬁll the following con-

straints:

1. All lineages with the “disturbed” habitat state will have the same

rate of area gain at any point in time

2. The phylogeny-wide area gain rate will be the “global rate of area

gain“ speciﬁed.

This means, that, at typically stationary distribution under an equal-

rates transition model, where on the average half the lineage are in the

“disturbed” state, the actual rate of area gain per “disturbed”-area lin-

eage will be approximately twice the global area gai rate to compensate

for the fact that half the lineages in the phylogeny are not dispersing.

• The global rate of area loss

This corresponds to the “extinction” (extirpation) or “e” parameter in

the DEC model. We will estimate this rate based on the empirical tar-

get data under the basic DEC model, using lagrange, BioGeoBEARS,

BayArea, or RevBayes. The rate of area loss is the same under both the

“unconstrained” and “constrained” models, and thus the estimated rate

will be used as a ﬁxed rate for this process under both these models.

4.3. SIMULATING THE TRAINING DATA 21

4.3.1 Estimating the Rate Parameters from the Empirical

Data to Calibrate the Simulations

Setting the parameters for the various processes (diversiﬁcation, area gain,

area loss, etc.) is known as calibrating the model. As noted above, this is

typically done by setting the rates to estimates based on the target data. The

archipelago package provides a program, “archipelago-profile.py” that

conveniently does this for you. It requires that the target phylogeny labels

have trait and geographical data encoded. The subdirectory “target-data”

of the Wallacean example directory has the original phylogeny pruned down

to the 365 species that are the focus of this study, with trait and geographical

data encoded in the labels. We can run the proﬁler program on this phylogeny

to yield the training data process rate parameters:

archipelago-profile.py --estimate-dec-lagrange target-data/wallacea.newick

if we have (the C++) lagrange installed, or, otherwise:

archipelago-profile.py --estimate-dec-biogeobears target-data/wallacea.newick

if we want to estimate the DEC model parameters using BioGeoBEARS.

This program will provide a “proﬁle” of this phylogeny (provide MLE’s

of speciation, trait transition, area loss, and area gain rates). The result is a

comma-separated value (CSV) formatted-ﬁle with one entry per tree passed

to it. As here we are only using a single tree, there is only one data row. The

results can be seen in the ﬁle “wallacea.profile.csv” which includes the

following information:

Field Value

num.tips 365

root.age 106.998428

pure.birth.rate 0.05482309568853121

trait.1.est.transition.rate 0.062247

area.est.transition.rate 2.396912

biogeob ears.dec.dispersal.rate 0.018156246

biogeob ears.dec.extinction.rate 0.002645152

4.3.2 Setting up the Training Data Models

We will create two models to generate the training data, “unconstrained” and

“constrained”. The ﬁles corresponding to these models can be found in the

22 CHAPTER 4. WALK-THROUGH OF A BASIC ANALYSIS

“examples/wallacea bird dispersal by habitat/training-data-models”

subdirectory of the project: “unconstrained.json” and “constrained.json”,

respectively.

The ﬁrst entry in each model ﬁle is the identiﬁer for each of the models:

unconstrained.json:

{

"model_id": "unconstrained",

}

constrained.json:

{

"model_id": "constrained",

}

These identiﬁers are crucial to allow the training data items to be as-

signed to the correct generating model, which itself is crucial for the proper

construction of the discriminant analysis functions. The model identiﬁers will

be assigned to the each tree generated by the simulations under each model,

and will be propagated through rest of the pipeline by other programs.

Both models share the same geographical template, consisting of four focal

areas and two supplemental areas, and are coded identically. The four focal

areas represent the four island “modules” in the original data: Banda Sea,

Lesser Sundas, Maluku, and Sulawesi. The two supplemental areas represent

the two continental sources: Sundaland (Asia) and the Sahul Shelf (Australia

and Papua New Guinea).

"areas": [

{ "is_supplemental": true, "label": "b1" },

{ "is_supplemental": true, "label": "b2" },

{ "is_supplemental": false, "label": "a1" },

{ "is_supplemental": false, "label": "a2" },

{ "is_supplemental": false, "label": "a3" },

{ "is_supplemental": false, "label": "a4" }

Both models also share the same trait regime: a single two-state trait, rep-

resenting the habitat associated with each lineage. We use the trait transition

rate estimated under an equal-rates model on the original data as the training

data simulation trait transition rate.

"traits": [

{ "label": habitat, "nstates": 2, "transition_rate": 0.062247}

Both models also share the same diversiﬁcation regime, and we use the birth

rate estimated from the original data as the ﬁxed, constant birth rate for both

models. We also set the death rate to a ﬁxed, constant rate of 0.0 for both

models. (The Archipelago model distinguishes between the death rate of the

4.3. SIMULATING THE TRAINING DATA 23

birth-death branching process, which is sometimes referred to as the “extinc-

tion” rate, and the rate of area loss, which is also called the “extinction” rate

in the original DEC model description and related literature. Here, following

the DEC model, we do not explicitly model extinction of lineages, but instead

model extinction implicitly through modeling of the rate of area loss.)

"diversification": {

"mean_birth_rate": 0.05486829846551938,

"mean_death_rate": 0.0

We do not specify lineage-speciﬁc weight functions, and these default to equal

and ﬁxed weights across all lineages, equivalent to:

"lineage_birth_rate": {

"definition_type": "fixed_value",

"definition": 1.0

"lineage_death_rate": {

"definition_type": "fixed_value",

"definition": 1.0

The anagenetic rate of area gain is where the models diﬀer. Both models

actually share the same global area gain rate as estimated on the original data:

"anagenetic_range_evolution": {

"global_area_gain_rate": 0.018156246,

But the overall rate is distributed across lineages diﬀerently in the two models

based on the state of the habitat trait of the lineages.

In the “unconstrained” model, all lineages share the same rate:

"anagenetic_range_evolution": {

"global_area_gain_rate": 0.018156246,

"lineage_area_gain_rate": {

"definition_type": "fixed_value",

"definition": 1.0

24 CHAPTER 4. WALK-THROUGH OF A BASIC ANALYSIS

However, in the “constrained” model, we only allow lineages that are as-

sociated with a state index of “0” (which we are arbitrarily assigning to mean

the “open” or “disturbed” state) are allowed to disperse:

"anagenetic_range_evolution": {

"global_area_gain_rate": 0.018156246,

"lineage_area_gain_rate": {

"definition_type": "trait_state_index_map:habitat",

"definition": [ 1.0, 0.0 ]

We use a trait state mapping function to specify this regime, where lineages

that are associated with a state index of “1” are prohibited from dispersing

by assigning them a rate modiﬁer weight of 0.0. Here, we identify the trait

by the label that we assigned to it in the trait deﬁnition (“habitat”) in the

“deﬁnition type” ﬁeld, and for the “deﬁnition” ﬁeld we provide a list of values,

with each element in the list specifying the rate rate for the state with the

corresponding index, so that the ﬁrst rate is associated with the ﬁrst trait

state, the second rate is associated with the second trait state, and so on.

4.3.3 Running the Simulations

We use the program “archipelago-simulate.py” to simulate the training

data. We generate 200 replicates under each of the models, saving the results

to appropriately named output locations:

archipelago-simulate.py --nreps 200 --output-prefix unconstrained_training

unconstrained.json

archipelago-simulate.py --nreps 200 --output-prefix constrained_training

unconstrained.json

The above commands will generate two sets of output ﬁles, one beginning with

the preﬁx “unconstrained training” and the other beginning with the preﬁx

“constrained training”. See the output ﬁle description section for details on

the ﬁles produced. Here, we are interested in the ﬁles with the extension

“.focal-areas.trees“: these are the phylogenies that we need for the next

stage. The above commands can be run on multiple computers simulata-

neously, or launched on the a multi-core computers multiple times in parallel,

with all the “.focal-areas.trees“ collected and passed to the next stage:

summary statistic calculation.

4.4. CALCULATION OF SUMMARY STATISTICS 25

4.4 Calculation of Summary Statistics

We calculate summary statistics on both the training data as well as the

original empirical data. We use the program “archipelago-summarize.py”

to calculate the summary statistics. This program takes the path of one or

more phylogenies as arguments and produces a comma-separated value ﬁle of

summary statistics as output, with one row of summary statistics for each

phylogeny passed as input. While not a mandatory argument, explicitly spec-

ifying the output ﬁlenames using the “-o” or “--output-filepath” option

is a good idea; otherwise, output gets written to the standard output in-

stead of being saved to a ﬁle. The input phylogenies must be in NEWICk

format, and have trait and geographical data encoded in the labels. The

simulation program “archipelago-simulate.py” does this naturally, and,

in addition, tags the trees with a model identiﬁer label in such a way that

“archipelago-summarize.py” recognizes this and adds it to the CSV ﬁle in a

way that allows the next program in the pipeline, “archipelago-classify.py”,

to use it. Note that we must calculate the summary statistics of the training

data separately from the target data, and save the results of each to separate

ﬁles.

To calculate the summary statistics for the training data we issue the fol-

lowing command (assuming that the simulations in the previous step wrote

their output to ﬁles with “unconstrained training” and “constrained training”

preﬁxes:

archipelago-summarize.py --output-filepath summaries/training.summaries.csv

unconstrained_training.focal-areas.tree constrained_training.focal-areas.

tree

This will result in a ﬁle with the name “training.summaries.csv” in the

summaries subdirectory that contains all the training data summary statistics

for the next stage.

To calculate the summary statistics for the target data we issue the fol-

lowing command:

archipelago-summarize.py --output-filepath summaries/target.summary.csv target

-data/wallacea.newick

This will result in a ﬁle with the name “target.summary.csv” in the summaries

subdirectory that contains the target data summary statistics for the next

stage.

26 CHAPTER 4. WALK-THROUGH OF A BASIC ANALYSIS

4.5 Model Selection

The ﬁnal stage of the analysis is where we attempt to classify the target data

with respect to the generating model, i.e. model selection. We use the program

“archipelago-classify.py” for this. This program takes, as a minimum, a

path to a target data summary statistics ﬁle and one or more paths to training

data summary statistics ﬁles. The ﬁrst positional argument is take to be the

target data summary statistics ﬁle while the remaining positional arguments

are taken to be the training data summary statistics ﬁles. As above, it is

a good idea to explicitly specify and output ﬁlepath using the “-o” option

to save the output to a ﬁle. In addition, it is highly recommended to use

the “--describe-trained-model” option to save details about the machine

learning model used to classify the target data. We also need to tell the

classiﬁer how many principal components to use to build the discriminant

analysis function. We will use the “--optimize-npca” to tell the classiﬁer to

use as the number of principal components that maximize the probability of

correctly (re-)classifying the training data, with penalty factor of 0 (i.e., do not

apply a penalty weight for each principal component used). So, for example,

we run our the classiﬁcation program on the data that we have generated

above using the following command:

archipelago-classify.py --optimize-npca 0 -o result.csv --describe-trained-

model dapc-details.txt summaries/target.summary.csv summaries/training.

summaries.csv

This primary classiﬁer output, written to “result.csv” in the above, is

comma-separated value ﬁle which include the following ﬁelds:

assign

The value of this ﬁeld is the identifer (name or label) of the generating

model that most probably generated the data.

posterior.<MODEL ID>

This is the relative weight in support of the model with the identiﬁer

“<MODEL ID>”.