Georeferencing

When coordinates are absent or incorrect for a record, we georeference the location based on the most detailed locality description information available. That is, we use the “site” or “location” variables if these are available and administrative areas if they are not.

Not that location names are often not unique in a country. For example, there are more than 100 places called “San Juan” in Mexico or “Rampur” in India. When georeferencing locations, it is therefore very important to also consider the administrative subdivsion that they are reported to be in. Furthermore, we should consider the context of the other locations in a data set; experimental and survey locations are typically clustered in a single region, or in a few. While it is possible, it is unlikely to have a cluster of sites in one part of the country, and a single site at the other end of the country. Such outliers need to be inspected and reconsidered.

For a background on georeferencing approaches in the biological sciences see this best practises document.

In Carob, we use the “point-radius” georeferencing method. This is especially important when georeferencing administrative areas, as these may be quite large, making us rather uncertain about the accuracy of the coordinates we assign and we want to capture a measure of that uncertainty. Likewise, it is great to know when we have a low uncertainty about the coordinates’ accuracy.

In our slight “high probability” variation of the “point-radius” method, we first determine the geographic center (centroid) of the area described. The centroid is moved to the nearest location on the border of the area (polygon) if it otherwise would be outside of it. We then express our uncertainty as the radius of the smallest circle that is centered on the centroid and includes all places where the observation may have been made (e.g., the entire administrative subdivsion). In most cases, this will also circumscribe areas that do not match the locality description, and that is OK.

The detailed recommended protocols for georeferencing using the point-radius method are given in the Georeferencing Quick Reference Guide.

For administrative areas you can use the carobiner::adm_pointRadius method illustrated below. In this example we are georeferencing “Kiteto” and “Kongwa” districts (adm2) in Tanzania.

## get the coordinates and uncertainty for adm2 boundaries for Tanzania
xy <- carobiner::adm_pointRadius("Tanzania", 2)
head(xy)
##    country   adm1         adm2 longitude latitude geo_uncertainty
## 1 Tanzania Arusha       Arusha   36.6882  -3.3480           38107
## 2 Tanzania Arusha Arusha Urban   36.6754  -3.4380           13570
## 3 Tanzania Arusha       Karatu   35.4351  -3.5545           85516
## 4 Tanzania Arusha   Lake Eyasi   35.1100  -3.5854           44029
## 5 Tanzania Arusha Lake Manyara   35.8370  -3.5153           12866
## 6 Tanzania Arusha      Longido   36.4969  -2.7241          105762
##       geo_source
## 1 GADM 4.1, adm2
## 2 GADM 4.1, adm2
## 3 GADM 4.1, adm2
## 4 GADM 4.1, adm2
## 5 GADM 4.1, adm2
## 6 GADM 4.1, adm2

## subset to the names of interest. In practise, the names may not perfectly match, 
## so this may take some more effort
s <- xy[xy$adm2 %in% c("Kiteto", "Kongwa"), ]
s
##     country    adm1   adm2 longitude latitude geo_uncertainty     geo_source
## 18 Tanzania  Dodoma Kongwa   36.5470  -6.0170           61298 GADM 4.1, adm2
## 78 Tanzania Manyara Kiteto   36.7624  -5.2287           90234 GADM 4.1, adm2

## create code that can be copied to the script
carobiner::dfput(s, name="geo", drop="country")
## geo <- data.frame(
##     adm1 = c("Dodoma", "Manyara"),
##     adm2 = c("Kongwa", "Kiteto"),
##     longitude = c(36.547, 36.7624),
##     latitude = c(-6.017, -5.2287),
##     geo_uncertainty = c(61298, 90234),
##     geo_source = c("GADM 4.1, adm2", "GADM 4.1, adm2")
## )

Check the georeferences with terra::plet (combine the georeferences with existing ones if there are any) to look for outliers.

v <- terra::vect(s, crs="lonlat")
## Warning: [vect] guessed geom variables
terra::plet(v, cex=4, col="red")


# also see the admin boundaries 
g <- geodata::gadm("TZA", level=2)
terra::plet(v, cex=4, col="red") |> terra::lines(g, col="gray", lwd=1)