GBIF's distribution data is based on the contribution of several individuals and organizations worldwide, but this might not be synchronized with the current state of expert knowledge about species distributions. Some geographical regions might be under-represented in GBIF database (DATA GAP), but also new discoveries and considerable range extensions might be recorded in GBIF sources before they get incorporated in other sources of knowledge (KNOWLEDGE GAP).
What it does
I developed a simple R-code in order to assess spatial gaps in species distribution data available in GBIF. I use range maps as a source of expert knowledge to validate the geographical extent of occurrence data from GBIF. I summarize the results in a triangular plot that identifies three possible outcomes: overlap, data gaps, knowledge gaps. The scale of the measurements is quantitative and allows direct comparison, but the visual representation helps to gain a simplified qualitative overview.
This will be helpful for several key audiences of GBIF, especially for the GBIF network, data holders, biological knowledge experts and data users.
For the GBIF network and data holders this tool could help to point out which species are under-sampled or under-represented in their databases, and to spot inconsistencies in ranges that might be related to taxonomic problems (wrong identification or changes in taxonomy). This can be crucial for prioritizing areas or taxa that require additional digitization and publication efforts, and identify potential partners that can help to mobilize additional data to fill the existing gaps.
For biological knowledge experts this tool can be used as a basic indicator of the quality of GBIF distribution data for species of interest, and suggests areas where additional research is needed in order to provide additional data, or where the existing data can complement and improve our current knowledge on species distribution ranges.
For data users this tool provides a first assessment of available evidence for evaluating species ranges and identify areas of uncertainty that can be addressed with more sophisticated tools of spatial and statistical analysis.
How I built it
I assembled code and functions from different R-packages to build a simple tool to bring new insights to GBIF data users and managers.
My approach is based on simple comparisons of polygon overlaps. I consider two sources of distribution knowledge: expert's opinion and GBIF data. Published range maps from different sources can be used as a proxy for expert's opinion. I use the alpha-hull method to convert species occurrence records downloaded from GBIF into a comparable distribution range.
The resulting GBIF-range (set G) is compared to existing range maps based on expert's opinion (set E) for the corresponding species. Using set theory there are three posible regions:
- the intersect between both sets (E intersect G, or OVERLAP region)
- the expert range without GBIF data (E-G, or DATA GAP region)
- the range with GBIF data/hull not included in the expert's range (G-E, or KNOWLEDGE GAP region)
If both sources have high overlap then the area in (E intersect G) will be much higher than the areas in (E-G) and (G-E). In this optimal case, GBIF occurrence records can be used with confidence. Mismatch between sources can have multiple interpretations: higher (E-G) usually represents lack of sampling or digitalization effort, higher (G-E) might have positive or negative implications. On the one hand GBIF data might be providing new distribution records to complement the expert's-range, but there can also be a mismatch due to errors in identification, changes in taxonomy or differences in nomenclature between sources.
Accomplishments that I'm proud of
I use a triangular plot in order to summarize the proportion of area in each of these three regions. This is a simple visual aid to assess the status of GBIF data for a species, and how to improve it. It can be used to compare different sources of expert's ranges, or to compare species within a genus, family or higher taxonomic range.
The GBIF network, data users, data holders and biological knowledge experts can see at a glance whether the results for a particular species or group of species points to OVERLAP, DATA GAP or KNOWLEDGE GAP, and decide which actions to take.
What's next for GBIFgaps
The analysis can be extended to wider taxonomic groups and to include temporal dimensions of data availability (for example, estimate how fast we are closing our data gaps).
The visual aid of the triangular plot could be included in existing visualizations of GBIF occurrence data.