Spatial collection bias is ubiquitous in biodiversity data and has recently been identified as one of the major challenges to the utilization of biodiversity data for modeling applications in ecology, evolution, and conservation sciences (Anderson et al. 2016). Natural history collections and citizen science datasets alike (e.g., eBird and iNaturalist) may be spatially aggregated due to preferential collection near field stations, by institutions operating in a single region, or due to accessibility issues (i.e., collections along roadsides —Kadmon et al., 2004 — or on public lands). Remote or otherwise difficult to access areas tend to be under-sampled in general. Other areas may be poorly represented in biodiversity databases because institutions that do collect data there may not have contributed to databases. The overall result is uneven representation of the true distribution of most taxa on earth. However, these data are samples of real patterns that are driven by biotic and abiotic factors. The objective of the method to be proposed here is to use available data with known biases to elucidate samples of hypothetical occurrence data based on patterns in environmental variation.
Biodiversity data are a good example of the debatable 4th “V” of “Big-Data” (see: De Mauro et al., 2015), “variety”. Collection data mobilized by GBIF come from a heterogeneous array of source institutions and are based on many types of collections (i.e., specimen based vs. human observation). The reasons for collection vary widely from systematic surveys to random chance (iNaturalist observations may be a good example of the latter). Identification and spatial uncertainty is highly variable depending on source institution, region or country, age of record, and who identified the sample.
In ecological modeling, correlative species distribution models (SDM) are commonly used generate more complete spatial distributions from primary biodiversity data (Peterson et al., 2011). However, these models are susceptible to biases in environmental preferences introduced by uneven sampling resulting in the need to try to correct spatial biases prior to modeling (e.g., Kramer-Schadt, et al. 2013). Current practices in distribution modeling primarily focus on reducing representation of sampling bias by evenly sampling occurrence records from within a given background (e.g., Boria et al., 2014; Hijmans and Hall 2016), or by using a layer to mask un(under)sampled areas in model inputs (Phillips et al., 2009).
Given this highly variable sampling the true distribution of the world’s biodiversity is largely unknown in geographic and environmental terms. As with other highly complex distributions for which we only have a sample of the data the estimation of the true distribution is an NP-complete or NP-Hard problem because the possible combination of presences and absences is extremely large. The number of occurrence localities for any given species is unknown, that is: all individuals have not been observed. Testing models built on all combinations of potential localities (i.e., all unique points on the earth) is not computationally feasible.
The solution proposed here is to develop a heuristic search algorithm with the goal of identifying a near-optimal set of simulated occurrences in geographic space that are similar to the initial sample of known occurrence localities in terms of distance in environmental space. The primary difference between the proposed method and existing SDM methods is that the proposed method updates the model and occurrence sample in tandem to produce both a likelihood model and a, now, simulated sample of occurrence localities. The goal of the proposed method is not necessarily to replace SDMs, but rather to provide an alternative sample of occurrence records for use in SDM applications based only on environmental suitability, but potentially providing a better representation of the “true” distribution for that species.
See attached materials for the description of the proposed algorithm.
See R library "vegdistmod" and the associated GitHub repository. Particularly funcitons: findlocal() geo_findlocal()
For implementation of the proposed algorithm.
Check out the README.R file for a demonstration of the algorithm!