Geospatial data analysis in R

Spatial interpolation

Josh Merfeld

KDI School

November 12, 2024

What are we doing today?

  • Getting raw data ready for analysis

  • Spatial interpolation

What is this?

Getting raw data ready for analysis

Downloading the data

Downloading the data

Let’s look at the data

Code
pm <- read_csv("week8files/GApm.csv")
summary(pm)
     Date              Source             Site ID               POC       
 Length:12594       Length:12594       Min.   :130210007   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:130590002   1st Qu.: 3.00  
 Mode  :character   Mode  :character   Median :131210055   Median : 3.00  
                                       Mean   :131293301   Mean   : 9.33  
                                       3rd Qu.:131530001   3rd Qu.:23.00  
                                       Max.   :133030001   Max.   :24.00  
                                                                          
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value
 Min.   :-0.600                 Length:12594       Min.   :  0.0  
 1st Qu.: 5.300                 Class :character   1st Qu.: 29.0  
 Median : 7.200                 Mode  :character   Median : 40.0  
 Mean   : 8.108                                    Mean   : 40.8  
 3rd Qu.: 9.800                                    3rd Qu.: 52.0  
 Max.   :51.900                                    Max.   :141.0  
                                                                  
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:12594       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88168     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code    Method Description   CBSA Code    
 Length:12594              Min.   :145.0   Length:12594       Min.   :10500  
 Class :character          1st Qu.:236.0   Class :character   1st Qu.:12060  
 Mode  :character          Median :238.0   Mode  :character   Median :12260  
                           Mean   :467.6                      Mean   :21050  
                           3rd Qu.:736.0                      3rd Qu.:31420  
                           Max.   :802.0                      Max.   :47580  
                                                              NA's   :827    
  CBSA Name         State FIPS Code    State           County FIPS Code  
 Length:12594       Min.   :13      Length:12594       Length:12594      
 Class :character   1st Qu.:13      Class :character   Class :character  
 Mode  :character   Median :13      Mode  :character   Mode  :character  
                    Mean   :13                                           
                    3rd Qu.:13                                           
                    Max.   :13                                           
                                                                         
    County          Site Latitude   Site Longitude  
 Length:12594       Min.   :30.74   Min.   :-85.32  
 Class :character   1st Qu.:32.61   1st Qu.:-84.29  
 Mode  :character   Median :33.43   Median :-83.81  
                    Mean   :33.17   Mean   :-83.66  
                    3rd Qu.:33.92   3rd Qu.:-83.34  
                    Max.   :34.98   Max.   :-81.13  
                                                    

Some cleaning

  • Let’s clean the raw PM data a bit

  • First, let’s turn “Date” into an actual date

  • Then, we will keep just the columns we want

  • Finally, let’s turn it into a shapefile

First, to date

Code
# first, to date
pm$Date[1] # note the format
[1] "01/01/2020"
Code
pm$date <- mdy(pm$Date) # month day year (from the lubridate package, part of tidyverse)
head(pm$date)
[1] "2020-01-01" "2020-01-04" "2020-01-07" "2020-01-10" "2020-01-13"
[6] "2020-01-16"
Code
class(pm$date)
[1] "Date"

An aside on dates

Code
head(year(pm$date))
[1] 2020 2020 2020 2020 2020 2020
Code
head(month(pm$date))
[1] 1 1 1 1 1 1
Code
head(day(pm$date))
[1]  1  4  7 10 13 16

An aside on dates

Code
table(month(pm$date))

   1    2    3    4    5    6    7    8    9   10   11   12 
1132 1051 1135 1074 1114 1060  977 1000  962 1005 1015 1069 

Just the columns we want

Code
colnames(pm)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
[23] "date"                          

Just the columns we want

Code
# note the syntax here!
pm <- pm |>
  select(siteid = `Site ID`, date, pm25 = `Daily Mean PM2.5 Concentration`, 
    AQI = `Daily AQI Value`, lon = `Site Longitude`, lat = `Site Latitude`)
pm
# A tibble: 12,594 × 6
      siteid date        pm25   AQI   lon   lat
       <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
 1 130210007 2020-01-01   9.9    52 -83.6  32.8
 2 130210007 2020-01-04   4.4    24 -83.6  32.8
 3 130210007 2020-01-07   6.8    38 -83.6  32.8
 4 130210007 2020-01-10   8.6    48 -83.6  32.8
 5 130210007 2020-01-13   6.7    37 -83.6  32.8
 6 130210007 2020-01-16   4.8    27 -83.6  32.8
 7 130210007 2020-01-19   2.9    16 -83.6  32.8
 8 130210007 2020-01-22   7      39 -83.6  32.8
 9 130210007 2020-01-25   4.3    24 -83.6  32.8
10 130210007 2020-01-28   5.8    32 -83.6  32.8
# ℹ 12,584 more rows

Finally, to shapefile

Code
pm <- vect(pm, geom = c("lon", "lat"), crs = "EPSG:4326")
pm
 class       : SpatVector 
 geometry    : points 
 dimensions  : 12594, 4  (geometries, attributes)
 extent      : -85.3232, -81.13022, 30.74072, 34.9788  (xmin, xmax, ymin, ymax)
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    siteid       date  pm25   AQI
 type        :     <num>     <Date> <num> <num>
 values      : 1.302e+08 2020-01-01   9.9    52
               1.302e+08 2020-01-04   4.4    24
               1.302e+08 2020-01-07   6.8    38

Let’s look only at observations from January

  • Note that the shapefile is for “census blocks” for the entire state of Georgia
Code
ga <- vect("week8files/tl_2020_13_tract.shp")
# make sure it's the same crs
ga <- project(ga, crs(pm))
ggplot() +
  geom_spatvector(data = ga, lwd = 0.05, color = "gray", fill = NA) +
  geom_spatvector(data = pm |> filter(month(date)==1), color = "red") +
  theme_bw()

Let’s look only at observations from January

Our goal

  • What do we want to do with this data?

  • We want to look at poverty rates at the census tract level and how they vary with average pollution levels

  • But we cannot do that yet. Why?

  • We only have 20-something pollution stations in Georgia. We need to interpolate!

Interpolation

  • There are many ways to interpolate
    • Interpolation is the process of estimating values between known data points
  • Perhaps the simplest: use Voronoi polygons
    • We already know how to do this!

Interpolation with Voronoi polygons

  • What are we going to do?
Code
pmunique <- pm |>
  group_by(siteid) |>
  filter(row_number() == 1) |>
  ungroup()
voronoiga <- voronoi(pmunique, bnd = ga)
ggplot() +
  geom_spatvector(data = ga, lwd = 0.05, color = "gray", fill = NA) +
  geom_spatvector(data = voronoiga, lwd = 0.05, color = "blue", fill = NA) +
  geom_spatvector(data = pm |> filter(month(date)==1), color = "red") +
  theme_bw()

Interpolation with Voronoi polygons

Interpolation with Voronoi polygons

  • With the polygons, what are our options?
  • We could find the overlap between the census tracts and voronoi polygons and either:
    • Option 1: Assign the pollution value from voronoi polygon with highest overlap
    • Option 2: Create a weighted average of pollution values based on overlap
  • Option 3: We could also just find the location of the centroid for each census tract and assign the pollution value from the voronoi polygon that contains it
  • Let’s try options 1 and 3!

Option 1: largest overlap

  • Try it!
    • Calculate mean pollution levels by station for JANUARY
    • Find the intersections between the census tracts and the voronoi polygons
    • Find the voronoi polygon with the largest overlap for each census tract
    • Find a way to add the pollution value for January to the census tract shapefile

Mean pollution by month

Code
# keep one observation per station
pmmonth <- pm |>
  filter(month(date)==1) |>
  group_by(siteid) |>
  summarize(pm25 = mean(pm25, na.rm = TRUE), .groups = "drop")

Intersections

Code
# Intersection
vorintersect <- intersect(ga, voronoiga)
# find area
vorintersect$area <- expanse(vorintersect)
# group by GEOID and find largest value
vorintersect <- vorintersect |>
  group_by(GEOID) |>
  filter(area==max(area)) |>
  ungroup() |>
  # now keep just the site id and geoid
  select(GEOID, siteid) |>
  as_tibble()

Add and plot

Code
vorintersect <- vorintersect |>
  left_join(as_tibble(pmmonth), by = "siteid")
head(vorintersect)
gamonth <- ga |>
  left_join(vorintersect, by = "GEOID")
ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = pm25), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw()

Add and plot

By centroids

Code
# create centroids
gacentroids <- centroids(ga)
#intersect
vorcentroids <- intersect(gacentroids, voronoiga)
# create new gamonth
gamonth <- ga |>
  mutate(siteid = vorcentroids$siteid)
gamonth <- gamonth |>
  left_join(as_tibble(pmmonth), by = "siteid")
ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = pm25), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

By centroids

This is probably not ideal…

  • Doesn’t look ideal. Why?

Alternative interpolation methods

  • So what other alternatives do we have? Can you think of other options?
  • We’re going to open up the possibilities with a new R package: gstat
Code
install.package("gstat")
library(gstat)

New option 1: inverse distance weighting

  • Consider the following equation:

\[ \hat{y}_0 = \frac{\sum_{i=1}^{n} y_i\times\left(\frac{1}{d_i^\beta}\right)}{\sum_{i=1}^{n}\left(\frac{1}{d_i^\beta}\right)} \]

  • \(\hat{y}_0\) is what we want to predict at point \(0\)
  • \(y_i\) is the value at point \(i\) (the weather stations)
  • \(d_i\) is the distance between point \(0\) and point \(i\)
  • \(\beta\) is a parameter that determines how much to weight the distance
    • If \(\beta=1\), this is traditionally called “inverse distance weighting”

How do we set this up?

  1. We need a list of points where we want to predict
  2. We need a list of points where we have data (we already have this)
  • Also a small complication: gstat does not work with terra objects!
    • We are going to use sf
    • However, I am not going to load the package. Instead, I will use sf::st_as_sf() to turn terra objects into sf objects

How do we set this up?

  • Here’s the code:
Code
# new points
grid <- centroids(ga)
# Note: how I convert to sf inside the call!
# Note: idp is the "beta" from above
grid <- idw(pmmonth$pm25 ~ 1, sf::st_as_sf(pmmonth), 
  newdata = sf::st_as_sf(grid), idp = 1)
[inverse distance weighted interpolation]
Code
grid
Simple feature collection with 2796 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -85.55702 ymin: 30.5733 xmax: -80.84775 ymax: 34.97834
Geodetic CRS:  WGS 84
First 10 features:
   var1.pred var1.var                   geometry
1   6.301066       NA POINT (-85.00318 31.18634)
2   6.309804       NA POINT (-84.88777 31.32099)
3   6.293376       NA  POINT (-84.93954 31.4459)
4   6.392179       NA POINT (-84.49576 33.85575)
5   6.392407       NA POINT (-84.46748 33.87364)
6   6.317686       NA   POINT (-84.514 34.00787)
7   6.318725       NA POINT (-84.53907 33.99773)
8   6.154639       NA  POINT (-84.3168 34.34863)
9   6.233536       NA POINT (-84.48069 34.20635)
10  6.224112       NA POINT (-84.50699 34.22357)

Let’s add it back into the ga object

  • Here’s the code:
Code
gamonth$idwbeta1 <- grid$var1.pred

Comparing values of \(\beta\)

[inverse distance weighted interpolation]
[inverse distance weighted interpolation]

We can control how many points to use

Code
grid <- centroids(ga)
grid <- idw(pmmonth$pm25 ~ 1, sf::st_as_sf(pmmonth), 
  newdata = sf::st_as_sf(grid), nmax = 1)
gamonth$idwbetavor <- grid$var1.pred

ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = idwbetavor), color = NA, show.legend = FALSE) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  labs(subtitle = expression("Panel A: "~beta~" = 1")) +
  theme_bw()

What does this look like??

[inverse distance weighted interpolation]

Comparing nmax

[inverse distance weighted interpolation]
[inverse distance weighted interpolation]

Some fun with maps!

Some fun with maps!

Code
library(rayshader)
g1 <- ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = idwbeta1), color = NA) +
  scale_fill_distiller("PM 2.5", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

plot_gg(g1, multicore = TRUE, theta = 10, phi = 45)

Let’s move to California (which has more data)

Code
capm <- read_csv("week8files/CApm25.csv")
capm$Date[1] 
capm$date <- mdy(capm$Date)
capm <- capm |>
  select(siteid = `Site ID`, date, pm25 = `Daily Mean PM2.5 Concentration`, 
    AQI = `Daily AQI Value`, lon = `Site Longitude`, lat = `Site Latitude`)
capm <- vect(capm, geom = c("lon", "lat"), crs = "EPSG:4326")

# just keep january, and one observation per station
camonth <- capm |>
  filter(month(date)==1) |>
  group_by(siteid) |>
  summarize(pm25 = mean(pm25, na.rm = TRUE), .groups = "drop")

# and the shapefile
ca <- vect("week8files/tl_2020_06_tract.shp")
# project
ca <- project(ca, crs(camonth))

Let’s move to California (which has more data)

[1] "01/01/2020"

First, let’s do IDW!

  • Your turn. Try it!

Code

Code
grid <- centroids(ca)
grid <- idw(camonth$pm25 ~ 1, locations = sf::st_as_sf(camonth), newdata = sf::st_as_sf(grid))
ca$idw <- grid$var1.pred

ggplot() +
  geom_spatvector(data = ca, aes(fill = idw), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Map

[inverse distance weighted interpolation]

Kriging

  • With this new data, we’re going to try something new: kriging

  • What is kriging?

    • Kriging is a method of spatial interpolation that uses the spatial correlation between points to estimate values at unobserved locations
    • It’s a bit complicated, but it relies on spatial correlations
  • Importantly, we can use predictors other than just distance

    • For example, we could use the x and y coordinates of the points as predictors
    • We could also use other variables that we think are related to pollution levels

Kriging steps

  1. Create a variogram
  • A variogram is a plot of the spatial correlation between points as a function of distance
  1. Fit a model to the variogram
  2. Use the model to predict values at unobserved locations
  • Small note: we are going to use the log of pm25 (logs are generally more well behaved)

Kriging

Code
# create the variogram
pm.vgm <- variogram(log(pm25)~1, sf::st_as_sf(camonth)) # calculates sample variogram values 
# fit the variogram
pm.fit <- fit.variogram(pm.vgm, model=vgm("Gau")) # fit model
# create the grid
grid <- centroids(ca)
# predict
pm.kriged <- krige(log(pm25) ~ 1, sf::st_as_sf(camonth), sf::st_as_sf(grid), model=pm.fit)

# put results into the ca shapefile
ca$krigingresults <- pm.kriged$var1.pred

Kriging

[using ordinary kriging]

Code

  • We can also explicitly tell it to use x and y coordinates as predictors
Code
camonth$x <- geom(camonth)[,"x"]
camonth$y <- geom(camonth)[,"y"]
grid <- centroids(ca)
grid$x <- geom(grid)[,"x"]
grid$y <- geom(grid)[,"y"]

# create the variogram
pm.vgm <- variogram(log(pm25)~x+y, sf::st_as_sf(camonth)) # calculates sample variogram values 
# fit the variogram
pm.fit <- fit.variogram(pm.vgm, model=vgm("Gau")) # fit model
# predict
pm.kriged <- krige(log(pm25) ~ x+y, sf::st_as_sf(camonth), sf::st_as_sf(grid), model=pm.fit)
[using universal kriging]
Code
# put results into the ca shapefile
ca$krigingresultslonlat <- pm.kriged$var1.pred

Comparing results

Your turn

  • Here’s your task for the rest of class:
    • Download the data for ozone for California (same website as earlier)
    • Interpolate using inverse distance weighting
    • Map it
    • Use the CApov.csv file and create some figures (not maps) that show relationships between poverty rates and pollution (ozone)

The example from Georgia

  • I have uploaded a csv file:
    • week8files/GApov.csv
Code
gapov <- read_csv("week8files/GApov.csv")
head(gapov)
# A tibble: 6 × 8
  GEO_ID    NAME  poptotal popblack pophispanic poortotal poorblack poorhispanic
  <chr>     <chr>    <dbl>    <dbl>       <dbl>     <dbl>     <dbl>        <dbl>
1 1400000U… Cens…     3246      147         258       746         0            0
2 1400000U… Cens…     2276      637         269       306       105            0
3 1400000U… Cens…     2221     1133          13       500       248            0
4 1400000U… Cens…     2883      255         937       565        23          462
5 1400000U… Cens…     1729      241           0       564        92            0
6 1400000U… Cens…     1892      361          74       337        64           26
Code
gapov$GEO_ID <- as.character(gapov$GEO_ID)
gapov$poorrate <- gapov$poortotal/gapov$poptotal
gapov$poorrateblack <- gapov$poorblack/gapov$popblack
gapov$poorratehispanic <- gapov$poorhispanic/gapov$pophispanic
gamonth <- gamonth |>
  left_join(gapov, by = c("GEOID" = "GEO_ID"))