Geospatial data analysis in R

Spatial autocorrelation and review

Josh Merfeld

KDI School

November 19, 2024

What are we doing today?

  • Spatial correlations and regressions

  • Review of the semester

Spatial correlations

  • Why do we care about spatial correlations?
  • We saw it last week: it helps with interpolation!
  • For regression: correlation matters for variance estimates (standard errors)

Let’s start with temporal correlations

  • Let’s look at temporal correlations, first!
Code
villages <- vect("week9files/villages.shp")
villages
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 19930, 57  (geometries, attributes)
 extent      : 3368652, 3683711, 413647.6, 775201.1  (xmin, xmax, ymin, ymax)
 source      : villages.shp
 coord. ref. : Kalianpur 1975 / India zone I (EPSG:24378) 
 names       :          shrid2  year pca_no_hh pca_tot_p pca_tot_m pca_tot_f
 type        :           <chr> <num>     <num>     <num>     <num>     <num>
 values      : 11-06-069-0035~  1991       110       516       295       221
               11-06-069-0035~  1991        85       578       333       245
               11-06-069-0035~  1991       112       770       391       379
 pca_m_06 pca_f_06 pca_p_06 pca_m_sc (and 47 more)
    <num>    <num>    <num>    <num>              
       55       58      113       56              
       53       35       88       20              
       83       92      175       44              
Code
# let's look at pca_p_lit (literate population)
villages$literate <- villages$pca_p_lit/villages$pca_tot_p

Lagged literacy

  • We want to look at values in different years
Code
table(villages$year)

1991 2001 2011 
6609 6660 6661 
Code
villages <- villages |>
  group_by(shrid2) |>
  arrange(shrid2, year) |>
  mutate(lit_lag = lag(literate)) |>
  ungroup()

Lagged literacy

Code
ggplot() +
  geom_point(data = as_tibble(villages) |> filter(year>1991), 
    aes(x = lit_lag, y = literate, color = as.factor(year)), alpha = 0.5) +
  scale_color_brewer("Year", palette = "Set2") +
  labs(x = "Lagged literacy", y = "Literacy") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_bw()

Literacy in 1991 and 2001

Checking temporal correlation

  • Let’s check the temporal correlation using cor
Code
cor(villages$literate, villages$lit_lag, use = "pairwise.complete.obs")
  • The correlation is 0.875
    • This is very high!

How might we visualize spatial correlation?

  • How can we visualize spatial correlation?
  • Let’s look at the literacy in just one year: 2001

How might we visualize spatial correlation?

Code
ggplot() +
  geom_spatvector(data = villages[villages$year==2001,], 
    aes(fill = literate), color = NA) +
  scale_fill_distiller("Literacy", palette = "Spectral")

How might we visualize spatial correlation?

And the other two years?

Spatial correlation is more complicated!

  • Temporal correlation is one-dimensional: just correlate today with yesterday

  • Spatial correlation is two-dimensional: we need to consider neighbors

Let’s start simple

  • Let’s look at neighbors
    • Note: we are going to use a smaller section for this (due to time)
    • villagessmall.shp is a subset of villages.shp
Code
villagessmall <- vect("week9files/villagessmall.shp")
# let's also just use 2001
villagessmall <- villagessmall[villagessmall$year==2001,]
mat <- adjacent(villagessmall, symmetrical = TRUE, type = "touches")
head(mat)
     from  to
[1,]    1   2
[2,]    1  57
[3,]    1 177
[4,]    1 178
[5,]    3   4
[6,]    3   5

Let’s start simple

  • Matrix of neighbors, let’s get POINTS from centroids
Code
xy <- centroids(villagessmall[villagessmall$year==2001,])
points1 <- as_tibble(geom(xy[mat[,1],])[,c("x", "y")])
points2 <- as_tibble(geom(xy[mat[,2],])[,c("x", "y")])

ggplot() +
  geom_spatvector(data = villagessmall, fill = NA, color = kdisgray) +
  geom_spatvector(data = xy, color = kdisgreen, size = 0.5) +
  geom_segment(aes(x = points1$x, y = points1$y, xend = points2$x, yend = points2$y), color = kdisgreen, lwd = 0.1)

Let’s start simple

How do we turn this into a number?

  • Okay, great, but it doesn’t really tell us anything

  • Here is what we are going to do:

    • We are going to create a matrix that says whether villages are neighbors
    • We are going to use this to calculate Moran’s I

Adjacency matrix

Code
villagessmall <- villagessmall[villagessmall$year==2001,]
mat <- adjacent(villagessmall, pairs = FALSE, type = "touches")
dim(mat)
[1] 962 962
Code
villagessmall
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 962, 57  (geometries, attributes)
 extent      : 3608626, 3683711, 413647.6, 513038.4  (xmin, xmax, ymin, ymax)
 coord. ref. : Kalianpur 1975 / India zone I (EPSG:24378) 
 names       :          shrid2  year pca_no_hh pca_tot_p pca_tot_m pca_tot_f
 type        :           <chr> <num>     <num>     <num>     <num>     <num>
 values      : 11-06-085-0041~  2001      1219      5246      3201      2045
               11-06-085-0041~  2001      1083      4641      2851      1790
               11-06-086-0041~  2001       166      1028       549       479
 pca_m_06 pca_f_06 pca_p_06 pca_m_sc (and 47 more)
    <num>    <num>    <num>    <num>              
      508      480      988      296              
      420      373      793      363              
       94       89      183      134              

Moran’s I

\[I = \frac{n}{\sum_{i=1}^n (x_i - \bar{x})^2} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}}\]

  • It looks complicated! We need:
    • The mean of the variable (\(\bar{x}\))
    • The number of observations (\(n\))
    • The weights matrix (\(w_{ij}\)) - we already have this!

Calculations

Code
villagessmall$literacy <- villagessmall$pca_p_lit/villagessmall$pca_tot_p
n <- nrow(villagessmall)
x <- villagessmall$literacy
xbar <- mean(x)

# now we are going to create (x_i - xbar) * (x_j - xbar)
dx <- x - xbar
# expand.grid creates all possible combinations
g <- expand.grid(dx, dx)
dim(g)
[1] 925444      2
Code
length(dx)*length(dx)
[1] 925444

Calculations

Code
# now multiply! for all pairs
xixj <- g[,1] * g[,2]

# now turn it into a matrix:
pm <- matrix(xixj, ncol = n)
dim(pm) # length times length!
[1] 962 962
Code
# multiply by weights matrix; they are zeros if not adjacent!
pmw <- pm*mat
dim(pmw)
[1] 962 962

Calculations

\[I = \frac{n}{\sum_{i=1}^n (x_i - \bar{x})^2} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}}\]

  • pmw is \(\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})\) (part of the numerator!)
  • sum(pmw) is the sum of this matrix
  • sum(mat) is the sum of the weights matrix (second denominator)
  • sum(dx^2) is the first denominator (the variance of the variable)
Code
I <- (n/sum(dx^2)) * (sum(pmw)/sum(mat))
  • The value: 0.684

Calculations

Code
Iterra <- autocor(villagessmall$literacy, mat, "moran")
I
[1] 0.6839845
Code
Iterra # the same!
[1] 0.6839845

Is it significant?

  • How do we test its significance?
  • We need to simulate the distribution of Moran’s I under the null hypothesis of NO correlation!
Code
# Note: we'd usually want to do a lot more than 99!
m <- sapply(1:99, function(i) {
    autocor(sample(villagessmall$literacy), mat, "moran")
})
m
 [1] -0.0067793590 -0.0057928574  0.0042540304  0.0122109861  0.0063340623
 [6] -0.0122729174  0.0259658294  0.0092155532  0.0072904004  0.0077533542
[11] -0.0118536904 -0.0093242619 -0.0028919450 -0.0140148390  0.0258995152
[16] -0.0136357550 -0.0025004532 -0.0005632029 -0.0170937976  0.0180636186
[21] -0.0123835510 -0.0323319420  0.0346561441 -0.0198729982  0.0057040619
[26] -0.0040367366  0.0183431123  0.0175124724 -0.0024211727 -0.0067879317
[31] -0.0440667593 -0.0103402301  0.0118285206 -0.0003738438  0.0130804137
[36] -0.0132697155  0.0367135803  0.0074950715 -0.0211976682 -0.0089247227
[41] -0.0125560055 -0.0005833564 -0.0023215837 -0.0050591438  0.0047203724
[46] -0.0196946876 -0.0024158407 -0.0185335713  0.0222341661  0.0057386693
[51]  0.0064249332 -0.0199770763 -0.0117519804 -0.0062222967 -0.0218890297
[56]  0.0235421960 -0.0165644738  0.0061632073  0.0362522881 -0.0080786580
[61] -0.0136763702 -0.0077375883 -0.0180984644 -0.0204050024 -0.0178227242
[66] -0.0104542427  0.0378854685  0.0160847412  0.0213525787  0.0314204755
[71] -0.0057897110  0.0264347416  0.0157277341 -0.0102732104  0.0264408198
[76] -0.0010603512 -0.0018277514  0.0488201115  0.0220162592  0.0016303050
[81]  0.0280869830 -0.0266669740 -0.0044732368  0.0107401599  0.0042997151
[86] -0.0134477332 -0.0033694906  0.0031172776 -0.0012045563  0.0096403322
[91]  0.0104552891 -0.0403722270  0.0085382305  0.0002844825  0.0383515259
[96]  0.0377135778 -0.0053783924 -0.0055389828  0.0033978211

Is it significant?

Let’s go back to interpolation

Code
remotes::install_github("rspatial/rspat")
library(rspat)
counties <- spat_data("counties")
counties
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 68, 5  (geometries, attributes)
 extent      : -124.4096, -114.1344, 32.53416, 42.00952  (xmin, xmax, ymin, ymax)
 coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
 names       : STATE COUNTY      NAME  LSAD LSAD_TRANS
 type        : <chr>  <chr>     <chr> <chr>      <chr>
 values      :    06    093  Siskiyou    06     County
                  06    015 Del Norte    06     County
                  06    049     Modoc    06     County

Let’s go back to interpolation

Code
points <- spat_data("precipitation")
head(points)
     ID                 NAME   LAT    LONG ALT  JAN FEB MAR APR MAY JUN JUL
1 ID741         DEATH VALLEY 36.47 -116.87 -59  7.4 9.5 7.5 3.4 1.7 1.0 3.7
2 ID743  THERMAL/FAA AIRPORT 33.63 -116.17 -34  9.2 6.9 7.9 1.8 1.6 0.4 1.9
3 ID744          BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1 1.9
4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0 2.4
5 ID754               NILAND 33.28 -115.51 -18  9.0 8.0 9.0 3.0 0.0 1.0 8.0
6 ID758        EL CENTRO/NAF 32.82 -115.67 -13  9.8 1.6 3.7 3.0 0.4 0.0 3.0
   AUG SEP OCT NOV DEC
1  2.8 4.3 2.2 4.7 3.9
2  3.4 5.3 2.0 6.3 5.5
3  9.2 6.5 5.0 4.8 9.7
4  2.6 8.3 5.4 7.7 7.3
5  9.0 7.0 8.0 7.0 9.0
6 10.8 0.2 0.0 3.3 1.4
Code
points <- vect(points, geom = c("LONG", "LAT"), crs = "EPSG:4326")
names(points)
 [1] "ID"   "NAME" "ALT"  "JAN"  "FEB"  "MAR"  "APR"  "MAY"  "JUN"  "JUL" 
[11] "AUG"  "SEP"  "OCT"  "NOV"  "DEC" 
Code
# total precipitation
points$ppttotal <- apply(as_tibble(points[,4:ncol(points)]), 1, sum)
# altitude in 1000s
points$ALT <- points$ALT/1000

Let’s go back to interpolation

Precipitation and altitude

  • Let’s see how altitude correlates with precipitation
Code
lm(ppttotal ~ ALT, data = as_tibble(points))

Call:
lm(formula = ppttotal ~ ALT, data = as_tibble(points))

Coefficients:
(Intercept)          ALT  
      523.6        170.0  
  • This regression is global; it fits one model everywhere!
    • But what if effects differ throughout the state?

Spatially weighted regression

Code
remotes::install_github("rsbivand/spgwr")
library(spgwr)

# select "bandwidth"
bw <- gwr.sel(ppttotal ~ ALT, 
  data = as.data.frame(points), 
  coords = geom(points)[,c("x", "y")])
Bandwidth: 5.278132 CV score: 65336646 
Bandwidth: 8.531671 CV score: 74455807 
Bandwidth: 3.267335 CV score: 54520395 
Bandwidth: 2.024593 CV score: 45090059 
Bandwidth: 1.256537 CV score: 35702192 
Bandwidth: 0.7818523 CV score: 28603589 
Bandwidth: 0.4884809 CV score: 22008348 
Bandwidth: 0.3071674 CV score: 16813321 
Bandwidth: 0.1951095 CV score: 14277138 
Bandwidth: 0.1258539 CV score: NA 
Bandwidth: 0.2379118 CV score: 15015728 
Bandwidth: 0.1686562 CV score: 14634499 
Bandwidth: 0.1970882 CV score: 14294166 
Bandwidth: 0.1905669 CV score: 14252895 
Bandwidth: 0.1821978 CV score: 14281620 
Bandwidth: 0.1889092 CV score: 14250156 
Bandwidth: 0.1883477 CV score: 14250072 
Bandwidth: 0.1885173 CV score: 14250051 
Bandwidth: 0.188558 CV score: 14250052 
Bandwidth: 0.1884766 CV score: 14250052 
Bandwidth: 0.1885173 CV score: 14250051 

Spatially weighted regression

Code
r <- rast(counties, res = 0.05)
r <- rasterize(counties, r)
newpts <- as.points(r)

g <- gwr(ppttotal ~ ALT, data = as.data.frame(points), 
  coords = geom(points)[,c("x", "y")], 
  bandwidth = bw, 
  fit.points = geom(newpts)[,c("x", "y")])
g
Call:
gwr(formula = ppttotal ~ ALT, data = as.data.frame(points), coords = geom(points)[, 
    c("x", "y")], bandwidth = bw, fit.points = geom(newpts)[, 
    c("x", "y")])
Kernel function: gwr.Gauss 
Fixed bandwidth: 0.1885173 
Fit points: 16629
Summary of GWR coefficient estimates at fit points:
                  Min.   1st Qu.    Median   3rd Qu.   Max.
X.Intercept. -1070.369    77.672   329.227   734.798 3754.5
ALT          -4776.103    30.498   207.727   438.462 5144.3

Spatially weighted regression

Code
slope <- intercept <- r
slope[!is.na(slope)] <- g$SDF$ALT
intercept[!is.na(intercept)] <- g$SDF$'(Intercept)'

g1 <- ggplot() +
  geom_spatraster(data = intercept, aes(fill = layer), color = NA) +
  scale_fill_distiller("Intercept", palette = "Spectral")
g2 <- ggplot() +
  geom_spatraster(data = slope, aes(fill = layer), color = NA) +
  scale_fill_distiller("Slope", palette = "Spectral")
plot_grid(g1, g2)

Spatially weighted regression