Geospatial data analysis in R

Spatial interpolation

Josh Merfeld

KDI School

November 12, 2024

What are we doing today?

Getting raw data ready for analysis
Spatial interpolation

What is this?

Getting raw data ready for analysis

Let’s download some pollution data
- US Environmental Protection Agency (EPA) website
- https://www.epa.gov/outdoor-air-quality-data/download-daily-data
Going to download one pollution indicator for now:
- PM2.5 (particulate matter 2.5 micrometers or less in diameter)
- Only for Georgia in 2020

Downloading the data

Let’s look at the data

Code

pm <- read_csv("week8files/GApm.csv")
summary(pm)

     Date              Source             Site ID               POC       
 Length:12594       Length:12594       Min.   :130210007   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:130590002   1st Qu.: 3.00  
 Mode  :character   Mode  :character   Median :131210055   Median : 3.00  
                                       Mean   :131293301   Mean   : 9.33  
                                       3rd Qu.:131530001   3rd Qu.:23.00  
                                       Max.   :133030001   Max.   :24.00  
                                                                          
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value
 Min.   :-0.600                 Length:12594       Min.   :  0.0  
 1st Qu.: 5.300                 Class :character   1st Qu.: 29.0  
 Median : 7.200                 Mode  :character   Median : 40.0  
 Mean   : 8.108                                    Mean   : 40.8  
 3rd Qu.: 9.800                                    3rd Qu.: 52.0  
 Max.   :51.900                                    Max.   :141.0  
                                                                  
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:12594       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88168     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code    Method Description   CBSA Code    
 Length:12594              Min.   :145.0   Length:12594       Min.   :10500  
 Class :character          1st Qu.:236.0   Class :character   1st Qu.:12060  
 Mode  :character          Median :238.0   Mode  :character   Median :12260  
                           Mean   :467.6                      Mean   :21050  
                           3rd Qu.:736.0                      3rd Qu.:31420  
                           Max.   :802.0                      Max.   :47580  
                                                              NA's   :827    
  CBSA Name         State FIPS Code    State           County FIPS Code  
 Length:12594       Min.   :13      Length:12594       Length:12594      
 Class :character   1st Qu.:13      Class :character   Class :character  
 Mode  :character   Median :13      Mode  :character   Mode  :character  
                    Mean   :13                                           
                    3rd Qu.:13                                           
                    Max.   :13                                           
                                                                         
    County          Site Latitude   Site Longitude  
 Length:12594       Min.   :30.74   Min.   :-85.32  
 Class :character   1st Qu.:32.61   1st Qu.:-84.29  
 Mode  :character   Median :33.43   Median :-83.81  
                    Mean   :33.17   Mean   :-83.66  
                    3rd Qu.:33.92   3rd Qu.:-83.34  
                    Max.   :34.98   Max.   :-81.13

Some cleaning

Let’s clean the raw PM data a bit
First, let’s turn “Date” into an actual date
Then, we will keep just the columns we want
Finally, let’s turn it into a shapefile

First, to date

Code

# first, to date
pm$Date[1] # note the format

[1] "01/01/2020"

Code

pm$date <- mdy(pm$Date) # month day year (from the lubridate package, part of tidyverse)
head(pm$date)

[1] "2020-01-01" "2020-01-04" "2020-01-07" "2020-01-10" "2020-01-13"
[6] "2020-01-16"

Code

class(pm$date)

[1] "Date"

An aside on dates

Code

head(year(pm$date))

[1] 2020 2020 2020 2020 2020 2020

Code

head(month(pm$date))

[1] 1 1 1 1 1 1

Code

head(day(pm$date))

[1]  1  4  7 10 13 16

An aside on dates

Code

table(month(pm$date))


   1    2    3    4    5    6    7    8    9   10   11   12 
1132 1051 1135 1074 1114 1060  977 1000  962 1005 1015 1069

Just the columns we want

Code

colnames(pm)

 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
[23] "date"

Just the columns we want

Code

# note the syntax here!
pm <- pm |>
  select(siteid = `Site ID`, date, pm25 = `Daily Mean PM2.5 Concentration`, 
    AQI = `Daily AQI Value`, lon = `Site Longitude`, lat = `Site Latitude`)
pm

# A tibble: 12,594 × 6
      siteid date        pm25   AQI   lon   lat
       <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
 1 130210007 2020-01-01   9.9    52 -83.6  32.8
 2 130210007 2020-01-04   4.4    24 -83.6  32.8
 3 130210007 2020-01-07   6.8    38 -83.6  32.8
 4 130210007 2020-01-10   8.6    48 -83.6  32.8
 5 130210007 2020-01-13   6.7    37 -83.6  32.8
 6 130210007 2020-01-16   4.8    27 -83.6  32.8
 7 130210007 2020-01-19   2.9    16 -83.6  32.8
 8 130210007 2020-01-22   7      39 -83.6  32.8
 9 130210007 2020-01-25   4.3    24 -83.6  32.8
10 130210007 2020-01-28   5.8    32 -83.6  32.8
# ℹ 12,584 more rows

Finally, to shapefile

Code

pm <- vect(pm, geom = c("lon", "lat"), crs = "EPSG:4326")
pm

 class       : SpatVector 
 geometry    : points 
 dimensions  : 12594, 4  (geometries, attributes)
 extent      : -85.3232, -81.13022, 30.74072, 34.9788  (xmin, xmax, ymin, ymax)
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    siteid       date  pm25   AQI
 type        :     <num>     <Date> <num> <num>
 values      : 1.302e+08 2020-01-01   9.9    52
               1.302e+08 2020-01-04   4.4    24
               1.302e+08 2020-01-07   6.8    38

Let’s look only at observations from January

Note that the shapefile is for “census blocks” for the entire state of Georgia

Code

ga <- vect("week8files/tl_2020_13_tract.shp")
# make sure it's the same crs
ga <- project(ga, crs(pm))
ggplot() +
  geom_spatvector(data = ga, lwd = 0.05, color = "gray", fill = NA) +
  geom_spatvector(data = pm |> filter(month(date)==1), color = "red") +
  theme_bw()

Let’s look only at observations from January

Our goal

What do we want to do with this data?
We want to look at poverty rates at the census tract level and how they vary with average pollution levels
But we cannot do that yet. Why?

We only have 20-something pollution stations in Georgia. We need to interpolate!

Interpolation

There are many ways to interpolate
- Interpolation is the process of estimating values between known data points

Perhaps the simplest: use Voronoi polygons
- We already know how to do this!

Interpolation with Voronoi polygons

What are we going to do?

Code

pmunique <- pm |>
  group_by(siteid) |>
  filter(row_number() == 1) |>
  ungroup()
voronoiga <- voronoi(pmunique, bnd = ga)
ggplot() +
  geom_spatvector(data = ga, lwd = 0.05, color = "gray", fill = NA) +
  geom_spatvector(data = voronoiga, lwd = 0.05, color = "blue", fill = NA) +
  geom_spatvector(data = pm |> filter(month(date)==1), color = "red") +
  theme_bw()

Interpolation with Voronoi polygons

With the polygons, what are our options?

We could find the overlap between the census tracts and voronoi polygons and either:
- Option 1: Assign the pollution value from voronoi polygon with highest overlap
- Option 2: Create a weighted average of pollution values based on overlap
Option 3: We could also just find the location of the centroid for each census tract and assign the pollution value from the voronoi polygon that contains it

Let’s try options 1 and 3!

Option 1: largest overlap

Try it!
- Calculate mean pollution levels by station for JANUARY
- Find the intersections between the census tracts and the voronoi polygons
- Find the voronoi polygon with the largest overlap for each census tract
- Find a way to add the pollution value for January to the census tract shapefile

Mean pollution by month

Code

# keep one observation per station
pmmonth <- pm |>
  filter(month(date)==1) |>
  group_by(siteid) |>
  summarize(pm25 = mean(pm25, na.rm = TRUE), .groups = "drop")

Intersections

Code

# Intersection
vorintersect <- intersect(ga, voronoiga)
# find area
vorintersect$area <- expanse(vorintersect)
# group by GEOID and find largest value
vorintersect <- vorintersect |>
  group_by(GEOID) |>
  filter(area==max(area)) |>
  ungroup() |>
  # now keep just the site id and geoid
  select(GEOID, siteid) |>
  as_tibble()

Add and plot

Code

vorintersect <- vorintersect |>
  left_join(as_tibble(pmmonth), by = "siteid")
head(vorintersect)
gamonth <- ga |>
  left_join(vorintersect, by = "GEOID")
ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = pm25), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw()

Add and plot

By centroids

Code

# create centroids
gacentroids <- centroids(ga)
#intersect
vorcentroids <- intersect(gacentroids, voronoiga)
# create new gamonth
gamonth <- ga |>
  mutate(siteid = vorcentroids$siteid)
gamonth <- gamonth |>
  left_join(as_tibble(pmmonth), by = "siteid")
ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = pm25), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

By centroids

This is probably not ideal…

Doesn’t look ideal. Why?

Alternative interpolation methods

So what other alternatives do we have? Can you think of other options?

We’re going to open up the possibilities with a new R package: gstat

Code

install.package("gstat")
library(gstat)

New option 1: inverse distance weighting

Consider the following equation:

\[ \hat{y}_0 = \frac{\sum_{i=1}^{n} y_i\times\left(\frac{1}{d_i^\beta}\right)}{\sum_{i=1}^{n}\left(\frac{1}{d_i^\beta}\right)} \]

\(\hat{y}_0\) is what we want to predict at point \(0\)
\(y_i\) is the value at point \(i\) (the weather stations)
\(d_i\) is the distance between point \(0\) and point \(i\)
\(\beta\) is a parameter that determines how much to weight the distance
- If \(\beta=1\), this is traditionally called “inverse distance weighting”

How do we set this up?

We need a list of points where we want to predict
We need a list of points where we have data (we already have this)

Also a small complication: gstat does not work with terra objects!
- We are going to use sf
- However, I am not going to load the package. Instead, I will use sf::st_as_sf() to turn terra objects into sf objects

How do we set this up?

Here’s the code:

Code

# new points
grid <- centroids(ga)
# Note: how I convert to sf inside the call!
# Note: idp is the "beta" from above
grid <- idw(pmmonth$pm25 ~ 1, sf::st_as_sf(pmmonth), 
  newdata = sf::st_as_sf(grid), idp = 1)

[inverse distance weighted interpolation]

Code

grid

Simple feature collection with 2796 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -85.55702 ymin: 30.5733 xmax: -80.84775 ymax: 34.97834
Geodetic CRS:  WGS 84
First 10 features:
   var1.pred var1.var                   geometry
1   6.301066       NA POINT (-85.00318 31.18634)
2   6.309804       NA POINT (-84.88777 31.32099)
3   6.293376       NA  POINT (-84.93954 31.4459)
4   6.392179       NA POINT (-84.49576 33.85575)
5   6.392407       NA POINT (-84.46748 33.87364)
6   6.317686       NA   POINT (-84.514 34.00787)
7   6.318725       NA POINT (-84.53907 33.99773)
8   6.154639       NA  POINT (-84.3168 34.34863)
9   6.233536       NA POINT (-84.48069 34.20635)
10  6.224112       NA POINT (-84.50699 34.22357)

Let’s add it back into the `ga` object

Here’s the code:

Code

gamonth$idwbeta1 <- grid$var1.pred

Comparing values of \(\beta\)

[inverse distance weighted interpolation]

[inverse distance weighted interpolation]

We can control how many points to use

Code

grid <- centroids(ga)
grid <- idw(pmmonth$pm25 ~ 1, sf::st_as_sf(pmmonth), 
  newdata = sf::st_as_sf(grid), nmax = 1)
gamonth$idwbetavor <- grid$var1.pred

ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = idwbetavor), color = NA, show.legend = FALSE) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  labs(subtitle = expression("Panel A: "~beta~" = 1")) +
  theme_bw()

What does this look like??

[inverse distance weighted interpolation]

Comparing `nmax`

[inverse distance weighted interpolation]

[inverse distance weighted interpolation]

Some fun with maps!

Code

library(rayshader)
g1 <- ggplot() + 
  geom_spatvector(data = gamonth, aes(fill = idwbeta1), color = NA) +
  scale_fill_distiller("PM 2.5", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

plot_gg(g1, multicore = TRUE, theta = 10, phi = 45)

Let’s move to California (which has more data)

Code

capm <- read_csv("week8files/CApm25.csv")
capm$Date[1] 
capm$date <- mdy(capm$Date)
capm <- capm |>
  select(siteid = `Site ID`, date, pm25 = `Daily Mean PM2.5 Concentration`, 
    AQI = `Daily AQI Value`, lon = `Site Longitude`, lat = `Site Latitude`)
capm <- vect(capm, geom = c("lon", "lat"), crs = "EPSG:4326")

# just keep january, and one observation per station
camonth <- capm |>
  filter(month(date)==1) |>
  group_by(siteid) |>
  summarize(pm25 = mean(pm25, na.rm = TRUE), .groups = "drop")

# and the shapefile
ca <- vect("week8files/tl_2020_06_tract.shp")
# project
ca <- project(ca, crs(camonth))

Let’s move to California (which has more data)

[1] "01/01/2020"

First, let’s do IDW!

Your turn. Try it!

Code

grid <- centroids(ca)
grid <- idw(camonth$pm25 ~ 1, locations = sf::st_as_sf(camonth), newdata = sf::st_as_sf(grid))
ca$idw <- grid$var1.pred

ggplot() +
  geom_spatvector(data = ca, aes(fill = idw), color = NA) +
  scale_fill_distiller("PM 2.5\n(Jan. 2020)", palette = "Spectral") +
  theme_bw() + 
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) + 
  theme(legend.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Map

[inverse distance weighted interpolation]

Kriging

With this new data, we’re going to try something new: kriging
What is kriging?
- Kriging is a method of spatial interpolation that uses the spatial correlation between points to estimate values at unobserved locations
- It’s a bit complicated, but it relies on spatial correlations
Importantly, we can use predictors other than just distance
- For example, we could use the x and y coordinates of the points as predictors
- We could also use other variables that we think are related to pollution levels

Kriging steps

Create a variogram

A variogram is a plot of the spatial correlation between points as a function of distance

Fit a model to the variogram
Use the model to predict values at unobserved locations

Small note: we are going to use the log of pm25 (logs are generally more well behaved)

Kriging

Code

# create the variogram
pm.vgm <- variogram(log(pm25)~1, sf::st_as_sf(camonth)) # calculates sample variogram values 
# fit the variogram
pm.fit <- fit.variogram(pm.vgm, model=vgm("Gau")) # fit model
# create the grid
grid <- centroids(ca)
# predict
pm.kriged <- krige(log(pm25) ~ 1, sf::st_as_sf(camonth), sf::st_as_sf(grid), model=pm.fit)

# put results into the ca shapefile
ca$krigingresults <- pm.kriged$var1.pred

Kriging

[using ordinary kriging]

Code

We can also explicitly tell it to use x and y coordinates as predictors

Code

camonth$x <- geom(camonth)[,"x"]
camonth$y <- geom(camonth)[,"y"]
grid <- centroids(ca)
grid$x <- geom(grid)[,"x"]
grid$y <- geom(grid)[,"y"]

# create the variogram
pm.vgm <- variogram(log(pm25)~x+y, sf::st_as_sf(camonth)) # calculates sample variogram values 
# fit the variogram
pm.fit <- fit.variogram(pm.vgm, model=vgm("Gau")) # fit model
# predict
pm.kriged <- krige(log(pm25) ~ x+y, sf::st_as_sf(camonth), sf::st_as_sf(grid), model=pm.fit)

[using universal kriging]

Code

# put results into the ca shapefile
ca$krigingresultslonlat <- pm.kriged$var1.pred

Comparing results

Your turn

Here’s your task for the rest of class:
- Download the data for ozone for California (same website as earlier)
- Interpolate using inverse distance weighting
- Map it
- Use the CApov.csv file and create some figures (not maps) that show relationships between poverty rates and pollution (ozone)

The example from Georgia

I have uploaded a csv file:
- week8files/GApov.csv

Code

gapov <- read_csv("week8files/GApov.csv")
head(gapov)

# A tibble: 6 × 8
  GEO_ID    NAME  poptotal popblack pophispanic poortotal poorblack poorhispanic
  <chr>     <chr>    <dbl>    <dbl>       <dbl>     <dbl>     <dbl>        <dbl>
1 1400000U… Cens…     3246      147         258       746         0            0
2 1400000U… Cens…     2276      637         269       306       105            0
3 1400000U… Cens…     2221     1133          13       500       248            0
4 1400000U… Cens…     2883      255         937       565        23          462
5 1400000U… Cens…     1729      241           0       564        92            0
6 1400000U… Cens…     1892      361          74       337        64           26

Code

gapov$GEO_ID <- as.character(gapov$GEO_ID)
gapov$poorrate <- gapov$poortotal/gapov$poptotal
gapov$poorrateblack <- gapov$poorblack/gapov$popblack
gapov$poorratehispanic <- gapov$poorhispanic/gapov$pophispanic
gamonth <- gamonth |>
  left_join(gapov, by = c("GEOID" = "GEO_ID"))

Geospatial data analysis in R

What are we doing today?

What is this?

Getting raw data ready for analysis

Downloading the data

Downloading the data

Let’s look at the data

Some cleaning

First, to date

An aside on dates

An aside on dates

Just the columns we want

Just the columns we want

Finally, to shapefile

Let’s look only at observations from January

Let’s look only at observations from January

Our goal

Interpolation

Interpolation with Voronoi polygons

Interpolation with Voronoi polygons

Interpolation with Voronoi polygons

Option 1: largest overlap

Mean pollution by month

Intersections

Add and plot

Add and plot

By centroids

By centroids

This is probably not ideal…

Alternative interpolation methods

New option 1: inverse distance weighting

How do we set this up?

How do we set this up?

Let’s add it back into the ga object

Comparing values of \(\beta\)

We can control how many points to use

What does this look like??

Comparing nmax

Some fun with maps!

Some fun with maps!

Let’s move to California (which has more data)

Let’s move to California (which has more data)

First, let’s do IDW!

Code

Map

Kriging

Kriging steps

Kriging

Kriging

Code

Comparing results

Your turn

The example from Georgia

Let’s add it back into the `ga` object

Comparing `nmax`