Geospatial data analysis in R

Coordinate reference systems and vector data I

Josh Merfeld

KDI School

September 23, 2024

Introduction to geospatial data

  • One estimate says that 100 TB of only weather data are generated every single day
    • This means there is a lot of data to work with!
    • Note that this is also problematic, since it can be difficult to work with such large datasets
  • Geospatial data is used in a variety of fields
    • Agriculture
    • Urban planning
    • Environmental science
    • Public health
    • Transportation
    • And many more!

The amount of geospatial data is useful for many applications!

  • Geospatial data can be highly predictive of e.g. poverty
    • Urbanity
    • Land class/cover
    • Vegetation indices
    • Population counts
    • etc. etc.
  • More importantly: it’s available everywhere!

Getting started with
geospatial data

Getting started with geospatial data

  • What are we doing today?
    • Shapefiles
      • Polygons
      • Points
      • Lines
      • Mapping with the package sf
    • Coordinate reference systems
      • Latitude/longitude
      • Projections

Shapefiles

  • Shapefiles are a common format for geospatial data
    • They are a form of vector data
  • Shapefiles are made up of at least three files:
    • .shp - the shape itself
    • .shx - the index
    • .dbf - the attributes
    • .prj - the projection
      • This one is not technically necessary! But it’s common to have.
    • What these all mean isn’t important for now, just make sure they are there! Check the week2files folder on github.

Let’s look at Northern Malawi

  • Collection of features

  • One feature

Types of features

  • Polygons
    • Areas
    • Districts, countries, etc.
  • Lines
    • Lines
    • Roads, rivers, etc.
  • Points
    • Points

Let’s start with polygons

Imagine a rectangle, on a coordinate plane

Imagine a rectangle, on a coordinate plane

Imagine a rectangle, on a coordinate plane

  • We need four points.
  • But polygons in shapefiles are a little different.
    • We have to “close” the feature so it knows it’s a polygon!
  • We do this by adding a fifth point: the same as the first point!
Five points (vertices) in our feature
X value Y value
Point 1 1 1
Point 2 3 1
Point 3 3 3
Point 4 1 3
Point 5 1 1

All features are made of vertices

  • So we have all our vertices (489 of them!)
  • The question:
    • What is the coordinate system here?

All features are made of vertices

  • So we have all our vertices (489 of them!)
  • The question:
    • What is the coordinate system here?
  • We will return to this in a bit!

One more example of polygons

One more example of polygons

Lines

  • Lines are also made up of vertices
  • But they are not closed

Lines example - “Primary” roads in India (2014)

One road




Length of this line feature: 1619.93 (m)


Points

  • Points are exactly what they sound like: points!

  • What could be a point?

Points

  • Points are exactly what they sound like: points!

  • What could be a point?

    • A city
    • A weather station
    • A tree
    • A household
    • etc.

What do you think this is?

What do you think this is?

Train stations! First 15

name name_en railway geometry
지평 Jipyeong station POINT (127.6291 37.47681)
국수 Guksu station POINT (127.3993 37.51617)
양평 Yangpyeong station POINT (127.4919 37.49285)
공릉 Gongneung station POINT (127.073 37.62564)
남광주 Namgwangju station POINT (126.9228 35.13944)
공항 Airport station POINT (126.8117 35.1441)
농성 Nongseong station POINT (126.884 35.15322)
금남로5가 Geumnamro 5-ga station POINT (126.9101 35.15371)
금남로4가 Geumnamno 4(sa)-ga station POINT (126.9146 35.15071)
소태 Sotae station POINT (126.9322 35.12361)
녹동 Nokdong station POINT (126.934 35.10682)
동구청 Dong-gu Office station POINT (128.6322 35.8844)
동대구 Dongdaegu station POINT (128.6275 35.87742)
북구청 Buk-gu Office station POINT (128.5814 35.88381)
평창 Pyeongchang station POINT (128.43 37.56196)

The train stations are a collection of features

  • Just like before, the shapefile is a collection of features!
  • The only difference now is that each feature is a point

Reading and plotting shapefiles in R

Reading shapefiles in R

  • My go-to package for shapefiles in R is sf
  • Reading shapefiles is VERY easy! And you can treat them like dataframes.
Code
library(sf)
# this is the shapefile for Gambia
gambia <- read_sf("week2files/gambia.shp")
gambia
Simple feature collection with 115 features and 17 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -16.81426 ymin: 13.04347 xmax: -13.79112 ymax: 13.82274
Geodetic CRS:  WGS 84
# A tibble: 115 × 18
   Shape_Leng Shape_Area ADM3_EN       ADM3_PCODE ADM3_REF ADM3ALT1EN ADM3ALT2EN
        <dbl>      <dbl> <chr>         <chr>      <chr>    <chr>      <chr>     
 1     0.0115 0.00000688 New Town East GM010101   <NA>     <NA>       <NA>      
 2     0.0206 0.0000161  New Town West GM010102   <NA>     <NA>       <NA>      
 3     0.0200 0.0000192  Soldier Town  GM010103   <NA>     <NA>       <NA>      
 4     0.0337 0.0000302  Box Bar       GM010201   <NA>     <NA>       <NA>      
 5     0.156  0.000591   Campama       GM010202   <NA>     <NA>       <NA>      
 6     0.0179 0.0000127  Crab Island   GM010203   <NA>     <NA>       <NA>      
 7     0.0322 0.0000533  Half Die      GM010301   <NA>     <NA>       <NA>      
 8     0.0216 0.0000236  Jollof Town   GM010302   <NA>     <NA>       <NA>      
 9     0.0203 0.0000257  Portugiese T… GM010303   <NA>     <NA>       <NA>      
10     0.494  0.00937    Basse         GM020101   <NA>     <NA>       <NA>      
# ℹ 105 more rows
# ℹ 11 more variables: ADM2_EN <chr>, ADM2_PCODE <chr>, ADM1_EN <chr>,
#   LGA_EN <chr>, ADM1_PCODE <chr>, ADM0_EN <chr>, ADM0_PCODE <chr>,
#   date <date>, validOn <date>, validTo <date>, geometry <POLYGON [°]>

Plotting is also very easy

Code
ggplot() + 
  geom_sf(data = gambia)

My go-to theme

Code
ggplot() +
  geom_sf(data = gambia) +
  theme_bw()

Other changes you can make

Code
ggplot() +
  geom_sf(data = gambia, fill = "#f0f1eb", color = "#006334") +
  theme_bw() +
  labs(subtitle = "Admin3s in Gambia")

Give it a try with TAs (admin3) in Malawi (mw3.shp)

Code
library(sf)
# this is the shapefile for the northern region of Malawi, TA level
northmw <- read_sf("week2files/mw3.shp")
ggplot() +
  geom_sf(data = northmw)
Code
ggplot() +
  geom_sf(data = northmw) +
  theme_bw() +
  labs(subtitle = "TAs in Northern Malawi")

One more example - map from earlier

Code
admin2 <- read_sf("week2files/mw2.shp")

ggplot() + 
  geom_sf(data = admin2, 
    fill = "white", color = kdisgray) +
  geom_sf(data = admin2 |> filter(DIST_CODE=="107"), 
    fill = "white", color = kdisgreen) +
  theme_bw() +
  labs(subtitle = "Districts in Northern Malawi")

What if we want to plot mw2, mw3, and mw4?

What if we want to plot mw2, mw3, and mw4?

Code
admin2 <- read_sf("week2files/mw2.shp")
admin3 <- read_sf("week2files/mw3.shp")
admin4 <- read_sf("week2files/mw4.shp")

g1 <- ggplot() + 
  geom_sf(data = admin2, fill = "white", color = "black") +
  theme_bw() + labs(subtitle = "A. Admin2 (districts)")
g2 <- ggplot() + 
  geom_sf(data = admin3, fill = "white", color = "black") +
  theme_bw() + labs(subtitle = "B. Admin3 (TAs)")
g3 <- ggplot() + 
  geom_sf(data = admin4, fill = "white", color = "black") +
  theme_bw() + labs(subtitle = "C. Admin4 (EAs)")

What if we want to plot mw2, mw3, and mw4?

  • Enter cowplot!
Code
library(cowplot)

plot_grid(g1, g2, g3, ncol = 3)

A quick note about shapefile sizes

  • What do you think is the main determinant of the size of a shapefile?
  • The number of vertices!
    • Geographic size doesn’t really matter!
  • For the three Malawi shapefiles?
    • mw2.shp: 448 KB
    • mw3.shp: 4.9 MB
    • mw4.shp: 40.3 MB
  • And this is only for Northern Malawi.
    • The entire country is 123 MB
    • The 2023 shapefile from OSM for Indian roads is 236 MB
    • The shapefile of Indian villages is 614 MB

And that’s only shapefile

  • Other geospatial data can get even bigger!

  • How large do you think the folder on my computer that contains imagery for all of Malawi (at 5m resolution) is?

  • About 55 GB!

  • This is just a warning… I’ll generally always give you (relatively) small files for practice

Let’s talk about coordinates

Latitude and longitude on a globe

  • The most common coordinate reference system (CRS) is latitude/longitude
    • Latitude: North/South
    • Longitude: East/West
    • The equator is at 0° latitude
    • The prime meridian is at 0° longitude

  • But there’s a problem with using latitude/longitude
    • The Earth is a sphere (well, more or less; really an oblate spheroid)

The basic problem

  • The basic problem is that one degree of longitude changes at different latitudes!
    • At the equator, one degree of longitude is about 111 km
    • At 15N/S, one degree of longitude is about 107 km
    • At 30N/S, one degree of longitude is about 96 km
    • At 45N/S, one degree of longitude is about 79 km
    • At 60N/S, one degree of longitude is about 56 km
      • This explains Greenland!

  • It’s not an easy problem to solve, as all solutions have drawbacks!

Preserve shape, give up area

Projections

  • A projection is a way to represent the Earth’s surface on a flat plane
    • In other words, it’s a way to transform the three-dimensional surface of the earth into two dimensions
    • Think of this as a way to “flatten” the Earth
  • “Equal area” projections
    • These projections preserve the area of features
  • “Conformal” projections
    • These projections preserve the shape of features
  • It is impossible to perfectly preserve both!
    • But they can be close, especially in smaller areas
    • The earth looks quite flat from close up, after all

Robinson projection: compromise across the board

Meractor projection: preserves angular relationships

Azimuthal Equidistant projection

Mollweide Equal Area Cylindrical projection

Three families of projections

  • The QGIS website has a great explanation of the three families of projections
    • Cylindrical
    • Conical
    • Planar

Three families of projections

Coordinate reference systems

  • Coordinate reference systems (CRS) are a way to define how coordinates are represented
    • This includes the projection, but also other things
  • The most popular is probably WGS 84 (EPSG:4326), which is a geographic CRS
    • This is latitude/longitude
  • One degree of lat/lon:
    • 60 minutes
  • One minute:
    • 60 seconds
  • Geographic here means it is the location on the earth’s surface
    • This is different from a projected CRS, which is more about how to draw the earth on a flat surface

Geographic vs. projected

A common CRS: Universal Transverse Mercator (UTM)

  • Universal Transverse Mercator (UTM) is a projected CRS
    • It divides the world into 60 zones
    • Each zone is 6° wide
    • The equator is the origin of each zone
    • The equator is at 0 m

Universal Transverse Mercator (UTM)

A common CRS: Universal Transverse Mercator (UTM)

  • UTM defines X values (“longitude”) FROM THE MIDDLE of each zone

    • This middle line is called the central meridian
  • UTM defines Y values (“latitude”) from the equater

  • There are some other details:

    • UTM values are never negative, so we offset values
      • This is called a “false easting” and “false northing”
      • Details aren’t super important
  • Note: it ignores altitude. It assumes a perfect ellipsoid

“Transverse mercator” projection per zone (N and S)

Projections with sf

Code
admin2 <- read_sf("week2files/mw2.shp")

# Geographic CRS: WGS 84 - lat/lon
crs(admin2)
[1] "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    ID[\"EPSG\",4326]]"

What is the appropriate zone for Malawi?

  • Go to Google
    • Search “UTM CRS Malawi”
  • What did you find?
  • UTM zone 36S
  • With the sf package, we want to find the “EPSG” code to project
    • 20936

What is the appropriate zone for Malawi?

Code
admin2proj <- st_transform(admin2, 20936)

# It has changed! Now it's projected
crs(admin2proj)
[1] "PROJCRS[\"Arc 1950 / UTM zone 36S\",\n    BASEGEOGCRS[\"Arc 1950\",\n        DATUM[\"Arc 1950\",\n            ELLIPSOID[\"Clarke 1880 (Arc)\",6378249.145,293.4663077,\n                LENGTHUNIT[\"metre\",1]]],\n        PRIMEM[\"Greenwich\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        ID[\"EPSG\",4209]],\n    CONVERSION[\"UTM zone 36S\",\n        METHOD[\"Transverse Mercator\",\n            ID[\"EPSG\",9807]],\n        PARAMETER[\"Latitude of natural origin\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8801]],\n        PARAMETER[\"Longitude of natural origin\",33,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8802]],\n        PARAMETER[\"Scale factor at natural origin\",0.9996,\n            SCALEUNIT[\"unity\",1],\n            ID[\"EPSG\",8805]],\n        PARAMETER[\"False easting\",500000,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8806]],\n        PARAMETER[\"False northing\",10000000,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8807]]],\n    CS[Cartesian,2],\n        AXIS[\"(E)\",east,\n            ORDER[1],\n            LENGTHUNIT[\"metre\",1]],\n        AXIS[\"(N)\",north,\n            ORDER[2],\n            LENGTHUNIT[\"metre\",1]],\n    USAGE[\n        SCOPE[\"Engineering survey, topographic mapping.\"],\n        AREA[\"Malawi. Zambia and Zimbabwe - east of 30°E.\"],\n        BBOX[-22.42,30,-8.19,35.93]],\n    ID[\"EPSG\",20936]]"

They look quite similar!

They look quite similar! Different coordinates

Code
g1 <- ggplot(admin2) +
  geom_sf(fill = "white", color = kdisgreen) +
  theme_bw() +
  labs(subtitle = "A. Geographic CRS") +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))
g2 <- ggplot(admin2proj) +
  geom_sf(fill = "white", color = kdisgreen) +
  theme_bw() +
  labs(subtitle = "B. Projected CRS") +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb")) +
  coord_sf(crs = st_crs(20936))

Malawi is small. What about something larger?

Now it’s your turn!

  • Use the Indian roads shapefile (indiaprimaryroads.shp)

  • Do the following:

    • Find the CRS of the shapefile
    • Transform the shapefile to UTM zone 44N (EPSG: 24344)
    • Graph both side by side using cowplot

The solution

Code
roads <- read_sf("week2files/indiaprimaryroads.shp")
crs(roads)
[1] "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    ID[\"EPSG\",4326]]"
Code
roadsproj <- st_transform(roads, 24344)
crs(roadsproj)
[1] "PROJCRS[\"Kalianpur 1975 / UTM zone 44N\",\n    BASEGEOGCRS[\"Kalianpur 1975\",\n        DATUM[\"Kalianpur 1975\",\n            ELLIPSOID[\"Everest 1830 (1975 Definition)\",6377299.151,300.8017255,\n                LENGTHUNIT[\"metre\",1]]],\n        PRIMEM[\"Greenwich\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        ID[\"EPSG\",4146]],\n    CONVERSION[\"UTM zone 44N\",\n        METHOD[\"Transverse Mercator\",\n            ID[\"EPSG\",9807]],\n        PARAMETER[\"Latitude of natural origin\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8801]],\n        PARAMETER[\"Longitude of natural origin\",81,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8802]],\n        PARAMETER[\"Scale factor at natural origin\",0.9996,\n            SCALEUNIT[\"unity\",1],\n            ID[\"EPSG\",8805]],\n        PARAMETER[\"False easting\",500000,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8806]],\n        PARAMETER[\"False northing\",0,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8807]]],\n    CS[Cartesian,2],\n        AXIS[\"(E)\",east,\n            ORDER[1],\n            LENGTHUNIT[\"metre\",1]],\n        AXIS[\"(N)\",north,\n            ORDER[2],\n            LENGTHUNIT[\"metre\",1]],\n    USAGE[\n        SCOPE[\"Engineering survey, topographic mapping.\"],\n        AREA[\"India - onshore between 78°E and 84°E.\"],\n        BBOX[8.29,78,35.5,84.01]],\n    ID[\"EPSG\",24344]]"

The solution

Code
g1 <- ggplot(roads) +
  geom_sf(fill = "white", color = kdisgreen) +
  theme_bw() +
  labs(subtitle = "A. Geographic CRS")
g2 <- ggplot(roadsproj) +
  geom_sf(fill = "white", color = kdisgreen) +
  theme_bw() +
  labs(subtitle = "B. Projected CRS") +
  coord_sf(crs = st_crs(24344))

plot_grid(g1, g2)

The solution

More practice

  • I’d like you to find a shapefile for a country of your choice
    • Ths isn’t always easy
    • I always use Google to find one
    • One place you can often find something on humdata.org (almost always shows up in the search!)
  • Do the following:
    • Find the CRS of the shapefile
    • Transform the shapefile to the appropriate UTM zone for the country
      • Note that many countries have multiple UTM zones! You can just choose one
    • Plot them side by side using cowplot

Returning to the files

  • Shapefiles are made up of at least four files:
    • .shp - the shape itself
      • The geometry
    • .shx - the index
      • The “index”. You can actual recover the .shx file from the .shp file
    • .dbf - the attributes
      • The attributes for each features. Could be names, population values, etc.
    • .prj - the projection
      • The projection information, which we just discussed
  • There are often more files, but these are the main ones
    • You must have the first three. The fourth is optional, but common.