SDM Performance report

Summary

This document summarizes the performance of different sSDM and mSDM algorithms for 449 South American mammal species. Model performance is evaluated on six metrics (auc, f1, kappa, accuracy, precision, recall) and analyzed along five potential influence factors (number of records, range size, range coverage, range coverage bias). The comparison of sSDM vs mSDM approaches is of particular interest.

Code can be found on GitLab.

Modeling overview:

General decisions

Presence/Absence
- Random absence sampling
- Balanced number of presences and absences within core area of distribution (core)
- Additionally, 100 absences across South America per species (background)
Predictors
- 19 CHELSA BioClim variables at 1km resolution
Evaluation
- Six metrics (AUC, F1, Kappa, Accuracy, Precision, Recall)
- Evaluation only on core samples (background samples excluded)
- Five fold spatially blocked Cross Validation
- Values shown are averaged across all five validation folds

sSDM Algorithms

Four algorithms: Random Forest (RF), Gradient Boosting Machine (GBM), Generalized Additive Model (GAM), Neural Network (NN)
NN: Manual hyperparameter tuning, same settings across species
RF + GBM + GAM: Automated hyperparameter tuning (8 random combinations) per species

mSDM Algorithms

Three algorithms: Random Forest (MSDM_rf), Neural Network (MSDM_embed, MSDM_onehot)
Species identity part of the input data, internal representation then either as onehot vector (MSDM_rf, MSDM_onehot) or via embedding (MSDM_embed)

Key findings

MSDM algorithms score much higher across all performance algorithms
Among MSDM algorithms, RF outperforms NNs significantly

Analysis

Quantify drivers of model performance

Number of records

Range size

Range size was calculated based on polygon layers from the IUCN Red List of Threatened Species (2016).

Range coverage

Species ranges were split into continuous hexagonal grid cells of 1 degree diameter. Range coverage was then calculated as the number of grid cells containing at least one occurrence record divided by the number of total grid cells.

\[ RangeCoverage = \frac{N_{cells\_occ}}{N_{cells\_total}} \]

Range coverage bias

Range coverage bias was calculated as 1 minus the ratio of the actual range coverage and the hypothetical range coverage if all observations were maximally spread out across the range.

\[ RangeCoverageBias = 1 - \frac{RangeCoverage}{min({N_{obs\_total}} / {N_{cells\_total}}, 1)} \]

Higher bias values indicate that occurrence records are spatially more clustered within the range of the species.