Geospatial data analysis in R

Spatial autocorrelation and review

Josh Merfeld

KDI School

November 19, 2024

What are we doing today?

Spatial correlations and regressions
Review of the semester

Spatial correlations

Why do we care about spatial correlations?

We saw it last week: it helps with interpolation!
For regression: correlation matters for variance estimates (standard errors)

Let’s start with temporal correlations

Let’s look at temporal correlations, first!

Code

villages <- vect("week9files/villages.shp")
villages

 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 19930, 57  (geometries, attributes)
 extent      : 3368652, 3683711, 413647.6, 775201.1  (xmin, xmax, ymin, ymax)
 source      : villages.shp
 coord. ref. : Kalianpur 1975 / India zone I (EPSG:24378) 
 names       :          shrid2  year pca_no_hh pca_tot_p pca_tot_m pca_tot_f
 type        :           <chr> <num>     <num>     <num>     <num>     <num>
 values      : 11-06-069-0035~  1991       110       516       295       221
               11-06-069-0035~  1991        85       578       333       245
               11-06-069-0035~  1991       112       770       391       379
 pca_m_06 pca_f_06 pca_p_06 pca_m_sc (and 47 more)
    <num>    <num>    <num>    <num>              
       55       58      113       56              
       53       35       88       20              
       83       92      175       44

Code

# let's look at pca_p_lit (literate population)
villages$literate <- villages$pca_p_lit/villages$pca_tot_p

Lagged literacy

We want to look at values in different years

Code

table(villages$year)


1991 2001 2011 
6609 6660 6661

Code

villages <- villages |>
  group_by(shrid2) |>
  arrange(shrid2, year) |>
  mutate(lit_lag = lag(literate)) |>
  ungroup()

Lagged literacy

Code

ggplot() +
  geom_point(data = as_tibble(villages) |> filter(year>1991), 
    aes(x = lit_lag, y = literate, color = as.factor(year)), alpha = 0.5) +
  scale_color_brewer("Year", palette = "Set2") +
  labs(x = "Lagged literacy", y = "Literacy") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_bw()

Literacy in 1991 and 2001

Checking temporal correlation

Let’s check the temporal correlation using cor

Code

cor(villages$literate, villages$lit_lag, use = "pairwise.complete.obs")

The correlation is 0.875
- This is very high!

How might we visualize spatial correlation?

How can we visualize spatial correlation?

Let’s look at the literacy in just one year: 2001

How might we visualize spatial correlation?

Code

ggplot() +
  geom_spatvector(data = villages[villages$year==2001,], 
    aes(fill = literate), color = NA) +
  scale_fill_distiller("Literacy", palette = "Spectral")

How might we visualize spatial correlation?

And the other two years?

Spatial correlation is more complicated!

Temporal correlation is one-dimensional: just correlate today with yesterday
Spatial correlation is two-dimensional: we need to consider neighbors

Let’s start simple

Let’s look at neighbors
- Note: we are going to use a smaller section for this (due to time)
- villagessmall.shp is a subset of villages.shp

Code

villagessmall <- vect("week9files/villagessmall.shp")
# let's also just use 2001
villagessmall <- villagessmall[villagessmall$year==2001,]
mat <- adjacent(villagessmall, symmetrical = TRUE, type = "touches")
head(mat)

     from  to
[1,]    1   2
[2,]    1  57
[3,]    1 177
[4,]    1 178
[5,]    3   4
[6,]    3   5

Let’s start simple

Matrix of neighbors, let’s get POINTS from centroids

Code

xy <- centroids(villagessmall[villagessmall$year==2001,])
points1 <- as_tibble(geom(xy[mat[,1],])[,c("x", "y")])
points2 <- as_tibble(geom(xy[mat[,2],])[,c("x", "y")])

ggplot() +
  geom_spatvector(data = villagessmall, fill = NA, color = kdisgray) +
  geom_spatvector(data = xy, color = kdisgreen, size = 0.5) +
  geom_segment(aes(x = points1$x, y = points1$y, xend = points2$x, yend = points2$y), color = kdisgreen, lwd = 0.1)

Let’s start simple

How do we turn this into a number?

Okay, great, but it doesn’t really tell us anything
Here is what we are going to do:
- We are going to create a matrix that says whether villages are neighbors
- We are going to use this to calculate Moran’s I

Adjacency matrix

Code

villagessmall <- villagessmall[villagessmall$year==2001,]
mat <- adjacent(villagessmall, pairs = FALSE, type = "touches")
dim(mat)

[1] 962 962

Code

villagessmall

 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 962, 57  (geometries, attributes)
 extent      : 3608626, 3683711, 413647.6, 513038.4  (xmin, xmax, ymin, ymax)
 coord. ref. : Kalianpur 1975 / India zone I (EPSG:24378) 
 names       :          shrid2  year pca_no_hh pca_tot_p pca_tot_m pca_tot_f
 type        :           <chr> <num>     <num>     <num>     <num>     <num>
 values      : 11-06-085-0041~  2001      1219      5246      3201      2045
               11-06-085-0041~  2001      1083      4641      2851      1790
               11-06-086-0041~  2001       166      1028       549       479
 pca_m_06 pca_f_06 pca_p_06 pca_m_sc (and 47 more)
    <num>    <num>    <num>    <num>              
      508      480      988      296              
      420      373      793      363              
       94       89      183      134

Moran’s I

\[I = \frac{n}{\sum_{i=1}^n (x_i - \bar{x})^2} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}}\]

It looks complicated! We need:
- The mean of the variable (\(\bar{x}\))
- The number of observations (\(n\))
- The weights matrix (\(w_{ij}\)) - we already have this!

Calculations

Code

villagessmall$literacy <- villagessmall$pca_p_lit/villagessmall$pca_tot_p
n <- nrow(villagessmall)
x <- villagessmall$literacy
xbar <- mean(x)

# now we are going to create (x_i - xbar) * (x_j - xbar)
dx <- x - xbar
# expand.grid creates all possible combinations
g <- expand.grid(dx, dx)
dim(g)

[1] 925444      2

Code

length(dx)*length(dx)

[1] 925444

Calculations

Code

# now multiply! for all pairs
xixj <- g[,1] * g[,2]

# now turn it into a matrix:
pm <- matrix(xixj, ncol = n)
dim(pm) # length times length!

[1] 962 962

Code

# multiply by weights matrix; they are zeros if not adjacent!
pmw <- pm*mat
dim(pmw)

[1] 962 962

Calculations

\[I = \frac{n}{\sum_{i=1}^n (x_i - \bar{x})^2} \frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}}\]

pmw is \(\sum_{i=1}^n \sum_{j=1}^n w_{ij} (x_i - \bar{x})(x_j - \bar{x})\) (part of the numerator!)
sum(pmw) is the sum of this matrix
sum(mat) is the sum of the weights matrix (second denominator)
sum(dx^2) is the first denominator (the variance of the variable)

Code

I <- (n/sum(dx^2)) * (sum(pmw)/sum(mat))

The value: 0.684

Calculations

Code

Iterra <- autocor(villagessmall$literacy, mat, "moran")
I

[1] 0.6839845

Code

Iterra # the same!

[1] 0.6839845

Is it significant?

How do we test its significance?

We need to simulate the distribution of Moran’s I under the null hypothesis of NO correlation!

Code

# Note: we'd usually want to do a lot more than 99!
m <- sapply(1:99, function(i) {
    autocor(sample(villagessmall$literacy), mat, "moran")
})
m

 [1] -0.0067793590 -0.0057928574  0.0042540304  0.0122109861  0.0063340623
 [6] -0.0122729174  0.0259658294  0.0092155532  0.0072904004  0.0077533542
[11] -0.0118536904 -0.0093242619 -0.0028919450 -0.0140148390  0.0258995152
[16] -0.0136357550 -0.0025004532 -0.0005632029 -0.0170937976  0.0180636186
[21] -0.0123835510 -0.0323319420  0.0346561441 -0.0198729982  0.0057040619
[26] -0.0040367366  0.0183431123  0.0175124724 -0.0024211727 -0.0067879317
[31] -0.0440667593 -0.0103402301  0.0118285206 -0.0003738438  0.0130804137
[36] -0.0132697155  0.0367135803  0.0074950715 -0.0211976682 -0.0089247227
[41] -0.0125560055 -0.0005833564 -0.0023215837 -0.0050591438  0.0047203724
[46] -0.0196946876 -0.0024158407 -0.0185335713  0.0222341661  0.0057386693
[51]  0.0064249332 -0.0199770763 -0.0117519804 -0.0062222967 -0.0218890297
[56]  0.0235421960 -0.0165644738  0.0061632073  0.0362522881 -0.0080786580
[61] -0.0136763702 -0.0077375883 -0.0180984644 -0.0204050024 -0.0178227242
[66] -0.0104542427  0.0378854685  0.0160847412  0.0213525787  0.0314204755
[71] -0.0057897110  0.0264347416  0.0157277341 -0.0102732104  0.0264408198
[76] -0.0010603512 -0.0018277514  0.0488201115  0.0220162592  0.0016303050
[81]  0.0280869830 -0.0266669740 -0.0044732368  0.0107401599  0.0042997151
[86] -0.0134477332 -0.0033694906  0.0031172776 -0.0012045563  0.0096403322
[91]  0.0104552891 -0.0403722270  0.0085382305  0.0002844825  0.0383515259
[96]  0.0377135778 -0.0053783924 -0.0055389828  0.0033978211

Is it significant?

Let’s go back to interpolation

Code

remotes::install_github("rspatial/rspat")
library(rspat)
counties <- spat_data("counties")
counties

 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 68, 5  (geometries, attributes)
 extent      : -124.4096, -114.1344, 32.53416, 42.00952  (xmin, xmax, ymin, ymax)
 coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
 names       : STATE COUNTY      NAME  LSAD LSAD_TRANS
 type        : <chr>  <chr>     <chr> <chr>      <chr>
 values      :    06    093  Siskiyou    06     County
                  06    015 Del Norte    06     County
                  06    049     Modoc    06     County

Let’s go back to interpolation

Code

points <- spat_data("precipitation")
head(points)

     ID                 NAME   LAT    LONG ALT  JAN FEB MAR APR MAY JUN JUL
1 ID741         DEATH VALLEY 36.47 -116.87 -59  7.4 9.5 7.5 3.4 1.7 1.0 3.7
2 ID743  THERMAL/FAA AIRPORT 33.63 -116.17 -34  9.2 6.9 7.9 1.8 1.6 0.4 1.9
3 ID744          BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1 1.9
4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0 2.4
5 ID754               NILAND 33.28 -115.51 -18  9.0 8.0 9.0 3.0 0.0 1.0 8.0
6 ID758        EL CENTRO/NAF 32.82 -115.67 -13  9.8 1.6 3.7 3.0 0.4 0.0 3.0
   AUG SEP OCT NOV DEC
1  2.8 4.3 2.2 4.7 3.9
2  3.4 5.3 2.0 6.3 5.5
3  9.2 6.5 5.0 4.8 9.7
4  2.6 8.3 5.4 7.7 7.3
5  9.0 7.0 8.0 7.0 9.0
6 10.8 0.2 0.0 3.3 1.4

Code

points <- vect(points, geom = c("LONG", "LAT"), crs = "EPSG:4326")
names(points)

 [1] "ID"   "NAME" "ALT"  "JAN"  "FEB"  "MAR"  "APR"  "MAY"  "JUN"  "JUL" 
[11] "AUG"  "SEP"  "OCT"  "NOV"  "DEC"

Code

# total precipitation
points$ppttotal <- apply(as_tibble(points[,4:ncol(points)]), 1, sum)
# altitude in 1000s
points$ALT <- points$ALT/1000

Let’s go back to interpolation

Precipitation and altitude

Let’s see how altitude correlates with precipitation

Code

lm(ppttotal ~ ALT, data = as_tibble(points))


Call:
lm(formula = ppttotal ~ ALT, data = as_tibble(points))

Coefficients:
(Intercept)          ALT  
      523.6        170.0

This regression is global; it fits one model everywhere!
- But what if effects differ throughout the state?

Spatially weighted regression

Code

remotes::install_github("rsbivand/spgwr")
library(spgwr)

# select "bandwidth"
bw <- gwr.sel(ppttotal ~ ALT, 
  data = as.data.frame(points), 
  coords = geom(points)[,c("x", "y")])

Bandwidth: 5.278132 CV score: 65336646 
Bandwidth: 8.531671 CV score: 74455807 
Bandwidth: 3.267335 CV score: 54520395 
Bandwidth: 2.024593 CV score: 45090059 
Bandwidth: 1.256537 CV score: 35702192 
Bandwidth: 0.7818523 CV score: 28603589 
Bandwidth: 0.4884809 CV score: 22008348 
Bandwidth: 0.3071674 CV score: 16813321 
Bandwidth: 0.1951095 CV score: 14277138 
Bandwidth: 0.1258539 CV score: NA 
Bandwidth: 0.2379118 CV score: 15015728 
Bandwidth: 0.1686562 CV score: 14634499 
Bandwidth: 0.1970882 CV score: 14294166 
Bandwidth: 0.1905669 CV score: 14252895 
Bandwidth: 0.1821978 CV score: 14281620 
Bandwidth: 0.1889092 CV score: 14250156 
Bandwidth: 0.1883477 CV score: 14250072 
Bandwidth: 0.1885173 CV score: 14250051 
Bandwidth: 0.188558 CV score: 14250052 
Bandwidth: 0.1884766 CV score: 14250052 
Bandwidth: 0.1885173 CV score: 14250051

Spatially weighted regression

Code

r <- rast(counties, res = 0.05)
r <- rasterize(counties, r)
newpts <- as.points(r)

g <- gwr(ppttotal ~ ALT, data = as.data.frame(points), 
  coords = geom(points)[,c("x", "y")], 
  bandwidth = bw, 
  fit.points = geom(newpts)[,c("x", "y")])
g

Call:
gwr(formula = ppttotal ~ ALT, data = as.data.frame(points), coords = geom(points)[, 
    c("x", "y")], bandwidth = bw, fit.points = geom(newpts)[, 
    c("x", "y")])
Kernel function: gwr.Gauss 
Fixed bandwidth: 0.1885173 
Fit points: 16629
Summary of GWR coefficient estimates at fit points:
                  Min.   1st Qu.    Median   3rd Qu.   Max.
X.Intercept. -1070.369    77.672   329.227   734.798 3754.5
ALT          -4776.103    30.498   207.727   438.462 5144.3

Spatially weighted regression

Code

slope <- intercept <- r
slope[!is.na(slope)] <- g$SDF$ALT
intercept[!is.na(intercept)] <- g$SDF$'(Intercept)'

g1 <- ggplot() +
  geom_spatraster(data = intercept, aes(fill = layer), color = NA) +
  scale_fill_distiller("Intercept", palette = "Spectral")
g2 <- ggplot() +
  geom_spatraster(data = slope, aes(fill = layer), color = NA) +
  scale_fill_distiller("Slope", palette = "Spectral")
plot_grid(g1, g2)