37 Supporting data: statistical models

37.1 Introduction to statistical models for estimating animal distribution

The distribution of marine megafauna at sea has been difficult to study due to the lack of human observers in the marine environment. Besides the use of tracking devices to follow individuals, researchers have used a range of sophisticated modelling techniques to predict where marine megafauna occur at sea. These species distribution models are widespread to predict the occurrence of plants and animals on land, and have been adapted to the marine environment for at least three decades.

Providing detailed guidance or specific overviews about how to implement species distribution models for marine megafauna is beyond the scope of this toolkit. We provide a high level review about these options as potential alternatives to the identification of important sites based on tracking data.

37.2 How do statistical models for estimating animal distribution work (broadly speaking)?

Species distribution models generally work by relating some observations about the presence of a species to environmental variables, and then using maps of these variables and the estimated relationships to project where a species may occur.

For seabirds at sea (and other marine megafauna), the observations are generally provided by either direct observations from a vessel (research surveys, opportunistic sightings, museum records etc.), or from tracking data. These observations are then combined with either sampled or random locations where the target species was not observed or recorded, and then related to static (depth, latitude, longitude, distance from coast, seamounts etc.) or dynamic (sea temperature, salinity, chlorophyll concentration, wind, wave height etc.) environmental variables.

A large range of different model types are available to construct species distribution models, which vary in their complexities, data requirements, assumptions, and the way how the relationships between observations and environmental variables are related.

37.3 Examples of what you can use distribution models for

Species distribution models generally return spatially explicit predictions of the habitat suitability or probability of occurrence of the target species, which can be shown on a map to visualise the projected distribution of a species. These maps can then be used to inform the identification and protection of important sites at sea in the same way as the outputs created using the track2kba protocol detailed in this toolkit.

We do not, however, recommend that users propose KBAs or IBAs solely based on prediction from species distribution models. Rather, predicted distributions can be useful for:

Strategically allocating survey effort to areas where there is a high probability of occurrence,
Delimiting site boundaries for sites that have been confirmed as important by other data sources, and
Extrapolating the number of individuals potentially using a site based on environment-abundance relationships.

37.4 Further detail to statistical models for estimating animal distribution

Predictive models are based on the principles that equations and rule-sets can be constructed to represent how a species’ distribution is related to environmental conditions (Aarts, Fieberg, and Matthiopoulos (2012); Robinson et al. (2017)). Species distribution models (also often referred to as ‘habitat suitability models’ or ‘resource selection functions’) for seabirds have been used for >20 years and were historically based on observations obtained from vessels. More recently, species distribution models are based on the locations that tracked individuals of a species used at sea, which resulted in a much broader distribution of ‘presence observations’ over a wider range of the marine realm than had been possible using vessel-based observation data (Matthiopoulos et al. (2022)). In addition, models are now frequently used to not only predict where a species occurs, but where specific behaviours occur, e.g. where seabirds are foraging rather than just occur in transit (Boyd et al. (2015)).

Species distribution models rely on contrasting environmental conditions between locations where a species or a particular behaviour is observed, and where it is not observed to infer relationships between environment and occurrence. However, for highly mobile species like seabirds, defining and interpreting what ‘not observed’ actually means can be very difficult and affect the quality and interpretability of species distribution models.

In contrast to sessile plants, it is virtually impossible to objectively determine whether a seabird species is truly ‘absent’ (= never occurs) at a given point in the sea – the fact that no birds of that species were seen during a research cruise at that location, or that none of the tracked individuals visited that location, does not exclude that other individuals of the same species may occur there at other times. Many seabird distribution models therefore rely on ‘presence-only’ data, which are the least robust form of data that cannot differentiate between the limit of a species’ distribution and the limit of the sampling effort.

One of the key decisions when modelling the distribution of seabirds at sea is therefore to select the right contrast and appropriate data to model the contrast (Matthiopoulos et al. (2022)). If survey effort is sufficient to use non-detection data as contrast to locations where a species was observed, then so-called detection – non-detection models can be fitted which generally permit stronger inference about the probability of a species to occur at a certain location. For tracking data, this can be the case if only certain behaviours are modelled (where the occurrence of ‘foraging’ can be contrasted with all other behaviours), but more frequently the observations will be contrasted with a set of random points where no information exists whether the species did or did not occur there. These random points can be created based on virtual tracks that mimic the movements of seabirds (Žydelis et al. (2011)). Presence-only data can be combined with other data sources to inform the underlying distribution and the observation process (Matthiopoulos et al. (2022)), but they need to be interpreted with caution given that there are natural limits of inference (Hastie and Fithian (2013)).

The spatial and environmental extent and variability at random locations against which observations are contrasted will have a great influence on the model and the inferred relationships between the presence of a species or behaviour and the environment. The scale at which inference is sought (e.g. do we want to know where species X occurs in the world, or which bay they prefer for foraging from colony Y) is critical in guiding the environmental data and the selection of background or non-detection points to ensure that the model can yield information at the desired scale of inference.

37.5 Examples of different statistical modelling approaches for predicting distribution

Popular and widely used approaches to predict seabird distributions or behaviours at sea are:

Generalised Linear Models (GLM; hue@huettmannSeabirdColonyLocations2001),
Generalised Additive Models (GAM; Critchley et al. (2020); Warwick-Evans et al. (2022); Žydelis et al. (2011)),
mixed-effects implementations of GLM and GAM that allow for the serial dependence of observations through the incorporation of random effects (GLMM or GAMM; Chimienti et al. (2017); Gilman et al. (2014); Waggitt et al. (2016); Weber et al. (2021)),
Maximum Entropy (MaxEnt; Hodges, Erikstad, and Reiertsen (2022); Krüger et al. (2017); Lemos et al. (2023)),
Boosted Regression Trees (BRTs; Evans, Lea, and Hindell (2021); Humphries (2015); Torres et al. (2015)),
Random Forests (RFs; Boyd et al. (2015); Diop et al. (2018); Huettmann et al. (2011); Mikami et al. (2022)),
point process models (I. W. Renner et al. (2015); Wakefield et al. (2017)), and
ensembles of multiple models with average predictions based on the performance of each model (Fox et al. (2017); Häkkinen et al. (2021); Lavers et al. (2014); Lieske, Fifield, and Gjerdrum (2014); Oppel et al. (2012); Pereira et al. (2018); M. Renner et al. (2013); Scales et al. (2016)).

Ultimately, however, the choice of the algorithm and how the contrast is selected will have a greater effect on the resulting model predictions than the spatial and temporal resolution of the data used to train the model (Quillfeldt et al. (2017)).

If tracking data are considered as a presence-only dataset for the purpose of a distribution model, it is important to consider a species’ ecology and the spatial and temporal resolution of the tracking data to avoid the inclusion of sections of the track where the individual is not actively ‘using’ the associated environment, but is just passing through. A range of bespoke analytical techniques are available to identify different behaviours from tracking data (e.g. state-space models, hidden markov models, expectation-maximisation binary clustering etc.), and each of these methods comes with its associated assumptions and limitations which we will not elaborate here. Useful overviews of what methods can be used to identify behaviour are: benni@bennisonSearchForagingBehaviors2018; Browning et al. (2018); Garriga et al. (2016); McClintock and Michelot (2018); Patterson et al. (2019).

Identification of “hotspots” can also be achieved mathematically with metrics such as Getis-Ord and Maximum Curvature (requ@cleasbyIdentifyingImportantAtsea2020; Requena et al. (2020)).

37.6 Using statistical models to estimate distribution in unsampled locations

Species distribution models are generally evaluated with independent test data – observations that were not used to develop the model, but are then used to evaluate how accurate the predictions of the model are. Depending on how far away in space and environmental conditions the test data are from the data that were used to build the model, the resulting predictions generally become less accurate the farther away the test data are from the training data. Seabird species distribution models have so far not transferred very well across different regions (Diop et al. (2018); Torres et al. (2015)), indicating that explaining the distribution of a species in one region will not necessarily allow for an accurate prediction where that species may occur in another region even if similar environmental data are available.

We therefore do not recommend to use predicted distributions that are based on models which were constructed without any input data from the target species population. Although, when guided by expert knowledge, these data might be considered for strategically allocating survey effort to areas where there is a high probability of occurrence of the species.

37.7 Examples of using predicted species distribution outputs for conservation purposes

Predicted species distributions can be used in conjunction with other outputs of this toolkit, or across multiple species or seasons to prioritise marine areas based on systematic spatial planning approaches. Popular algorithms such as (Zonation)[https://cbig.github.io/zonator/] and (prioritizr)[https://prioritizr.net/index.html] allow users to overlay predicted distribution maps for several species (or the same species in several seasons) and then determine the areas at sea that are most valuable to protecting all species (or a species in all seasons) based on algorithms that trade off the size of the area that needs to be protected with the minimum amount of habitat that needs to be protected for each species or season (Dias et al. (2017); Oppel et al. (2012)).

37.8 Further resources

Help us populate this list

Some further resources to consider include:

BRT guide: Elith, Leathwick, and Hastie (2008)
Data mining and statistical models: Hochachka et al. (2007)
MaxEnt: Merow, Smith, and Silander (2013)
Community science data: Johnston et al. (2021)