KDI School
September 10, 2024
Let’s start with a little introduction
Name, year, program, research interests, etc.
Geospatial data analysis in R
Major themes:
Today will just be a short introduction
For all future classes, please come with R and R Studio installed on your computer
Course website: https://github.com/JoshMerfeld/geospatialdataR
This is a brand new class, so I will likely be making changes as we go
Please check the course website regularly for updates
R
package sf
In-class lab (week 5)
Rasters (week 6)
R
package exactextractr
The goal of TA sections is to help you with R and R Markdown
For help with the actual material, please come to my office hours
I will be in Bangkok for a UN workshop the week of the 21st of October
I have not decided what to do, but there are two possibilities:
The goal for today is to give you a brief introduction to R and R Markdown
We will be using two small datasets to get you familiar with the program
A note: if you are completely new to R, the first few weeks will be a slog
Much of the material covered today comes from two (free!) sources:
# lasso --------------------------------------------
# we have ~60 features. This isn't that many, actually. We didn't create a lot of different possible combinations of the predictors.
# We also don't have any fixed effects. This is just to fix ideas. Nonetheless, let's try lasso!
# we use the glmnet package to implement lasso. It also allows ridge, but we want to make sure to use lasso.
# how do we do this? we want to allocate grid cells across different "folds".
You can run a line of code by clicking the “Run” button
You can run multiple lines of code by highlighting them and clicking the “Run” button (or the shortcut)
We will practice these later
tidyverse
package (more below)c()
function:
<-
. You can also use =
.1class()
function
class(vec)
is.vector(vec)
The working directory should be where you opened the file from. Check it like this:
The one exception to always using a script? I install packages in the CONSOLE. You can install packages like this:
We need to load any R packages we want to use at the very top of the script. You should have a comment on line one, so on line two write:
This will load the tidyverse package.
read_csv()
function to load the data
<-
). You can actually use =
though.The data frame should show up in the upper right hand corner of RStudio.
Click on the arrow and it will show more information.
[1] "res_id" "ability" "age"
[4] "educyears" "isfarmer" "yearsfarming"
[7] "yearsmanagingfarm" "outsidewage" "worriedaboutcropprices"
[10] "worriedaboutcropyields"
Rows: 7,209
Columns: 10
$ res_id <dbl> 501, 502, 503, 504, 505, 506, 507, 508, 509, 51…
$ ability <dbl> 74, 42, 67, 54, 57, 72, 51, 65, 54, 24, 24, 49,…
$ age <dbl> 83, 27, 49, 50, 70, 45, 58, 41, 45, 70, 24, 45,…
$ educyears <chr> "16", "7", "7", "7", "4", "7", "7", "7", "7", N…
$ isfarmer <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ yearsfarming <dbl> 60, 17, 20, 15, 40, 15, 30, 20, 20, 60, 12, 30,…
$ yearsmanagingfarm <dbl> 46, 17, 5, 10, 26, 15, 25, 15, 10, 50, 7, 25, 5…
$ outsidewage <dbl> 3.000e+06, 1.000e+10, 6.000e+06, 1.000e+10, 1.0…
$ worriedaboutcropprices <chr> "Sometimes", "Not at all", "Sometimes", "Not at…
$ worriedaboutcropyields <chr> "Sometimes", "Sometimes", "Not at all", "Someti…
spc_tbl_ [7,209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ res_id : num [1:7209] 501 502 503 504 505 506 507 508 509 510 ...
$ ability : num [1:7209] 74 42 67 54 57 72 51 65 54 24 ...
$ age : num [1:7209] 83 27 49 50 70 45 58 41 45 70 ...
$ educyears : chr [1:7209] "16" "7" "7" "7" ...
$ isfarmer : chr [1:7209] "Yes" "Yes" "Yes" "Yes" ...
$ yearsfarming : num [1:7209] 60 17 20 15 40 15 30 20 20 60 ...
$ yearsmanagingfarm : num [1:7209] 46 17 5 10 26 15 25 15 10 50 ...
$ outsidewage : num [1:7209] 3e+06 1e+10 6e+06 1e+10 1e+10 ...
$ worriedaboutcropprices: chr [1:7209] "Sometimes" "Not at all" "Sometimes" "Not at all" ...
$ worriedaboutcropyields: chr [1:7209] "Sometimes" "Sometimes" "Not at all" "Sometimes" ...
- attr(*, "spec")=
.. cols(
.. res_id = col_double(),
.. ability = col_double(),
.. age = col_double(),
.. educyears = col_character(),
.. isfarmer = col_character(),
.. yearsfarming = col_double(),
.. yearsmanagingfarm = col_double(),
.. outsidewage = col_double(),
.. worriedaboutcropprices = col_character(),
.. worriedaboutcropyields = col_character()
.. )
- attr(*, "problems")=<externalptr>
$
operatordata$age
res_id ability age educyears
Min. : 501 Min. : 10.00 Min. :18.00 Length:7209
1st Qu.:2783 1st Qu.: 51.00 1st Qu.:34.00 Class :character
Median :4714 Median : 59.00 Median :42.00 Mode :character
Mean :4775 Mean : 58.66 Mean :43.54
3rd Qu.:6764 3rd Qu.: 67.00 3rd Qu.:52.00
Max. :8955 Max. :100.00 Max. :87.00
isfarmer yearsfarming yearsmanagingfarm outsidewage
Length:7209 Min. :-9.00 Min. :-9.00 Min. :2.000e+03
Class :character 1st Qu.:18.00 1st Qu.: 6.00 1st Qu.:3.500e+06
Mode :character Median :26.00 Median :14.00 Median :1.000e+10
Mean :28.02 Mean :15.94 Mean :7.156e+09
3rd Qu.:38.00 3rd Qu.:22.00 3rd Qu.:1.000e+10
Max. :70.00 Max. :70.00 Max. :1.000e+10
NA's :219 NA's :219 NA's :216
worriedaboutcropprices worriedaboutcropyields
Length:7209 Length:7209
Class :character Class :character
Mode :character Mode :character
data[1,2]
data[,2]
# A tibble: 10 × 9
ability age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 74 83 16 Yes 60 46 3000000
2 42 27 7 Yes 17 17 9999999999
3 67 49 7 Yes 20 5 6000000
4 54 50 7 Yes 15 10 9999999999
5 57 70 4 Yes 40 26 9999999999
6 72 45 7 Yes 15 15 800000
7 51 58 7 Yes 30 25 2000000
8 65 41 7 Yes 20 15 9999999999
9 54 45 7 Yes 20 10 300000
10 24 70 <NA> Yes 60 50 9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
# worriedaboutcropyields <chr>
# A tibble: 10 × 9
ability age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 74 83 16 Yes 60 46 3000000
2 42 27 7 Yes 17 17 9999999999
3 67 49 7 Yes 20 5 6000000
4 54 50 7 Yes 15 10 9999999999
5 57 70 4 Yes 40 26 9999999999
6 72 45 7 Yes 15 15 800000
7 51 58 7 Yes 30 25 2000000
8 65 41 7 Yes 20 15 9999999999
9 54 45 7 Yes 20 10 300000
10 24 70 <NA> Yes 60 50 9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
# worriedaboutcropyields <chr>
dbl
is short for double, which is a numeric variable (the “type” of numeric variable is about how much memory is needed to store it)chr
is short for character, which is a string of characters (text)
educyears
was a character string even though it seemed to be a numbereducyears
using the unique() function, which outputs a vector:==
) to check whether the value is “Not Mentioned”
[1] "16" "7" "4" NA "11" "6" "13" "5" "8" "10" "12" "9" "2" "3" "15"
[16] "14" "20" "18" "17" "1" "19"
[1] "numeric"
|>
)
Here is an example of how we can use pipes with the mutate()
function in tidyverse
ifelse()
to make this work Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 7.000 7.000 6.735 7.000 20.000 3113
Note that we could wrap as.numeric() around the ifelse() command to do it on one line!
In Stata, by default, functions ignore missing values
[1] NA
If there are any missing values, the function will evalute to missing!
The mean() function in the previous slide outputs a single value - That means we could store that value as an object:
[1] 6.735107
[1] 2.404086
How is this helpful? We can use these values later in our script!
We can combine the mean() and sd() functions within mutate to create a new, standardized variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 7209
Oh no! what happened?
We can combine the mean()
and sd()
functions within mutate to create a new, standardized variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.3856 0.1102 0.1102 0.0000 0.1102 5.5176 3113
Note that we can shorten TRUE to T (or FALSE to F).
First install a new package that has a dataset we will use (you can do this in the console):
Now let’s see:
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
There’s a nice package called lubridate that makes working with dates much easier.
Departure time/arrival time is in the format HHMM (e.g., 1530 is 3:30pm). We can add this to the date
[1] "5H 17M 0S" "5H 33M 0S" "5H 42M 0S" "5H 44M 0S" "5H 54M 0S" "5H 54M 0S"
[7] "5H 55M 0S" "5H 57M 0S" "5H 57M 0S" "5H 58M 0S" "5H 58M 0S" "5H 58M 0S"
[13] "5H 58M 0S" "5H 58M 0S" "5H 59M 0S" "5H 59M 0S" "5H 59M 0S" "6H 0M 0S"
[19] "6H 0M 0S" "6H 1M 0S"
Lubridate also lets us work with “periods”
Let’s get the average departure delay by NYC airport:
# A tibble: 3 × 2
origin avg_dep_delay
<chr> <dbl>
1 EWR 15.1
2 JFK 12.1
3 LGA 10.3
Note that this does not create a single value. Instead it creates a tibble (a data frame) summarizing the data by our grouping variable.
What if we want to save that tibble instead?
# A tibble: 3 × 2
origin avg_dep_delay
<chr> <dbl>
1 EWR 15.1
2 JFK 12.1
3 LGA 10.3
I could then output this to a table if I wanted to (using Markdown, more on this later):
origin | avg_dep_delay |
---|---|
EWR | 15.10795 |
JFK | 12.11216 |
LGA | 10.34688 |
How does departure delay vary by time of day?
We can color code by origin, too!
R Markdown is a way to combine text and code
These slides were all created in R Markdown
My papers are written in R Markdown (well, some of them are, anyway)
Yihui Xie, J. J. Allaire, and Garrett Grolemund have an awesome – free! – resource on R Markdown,
You’ll need to install LaTeX and R Markdown. You can do this in the console:
---
AT THE TOP AND BOTTOM OF THE YAML HEADER!Sys.Date()
within R inline code (more in a second):Just below the YAML header you’ll see a “code chunk” called “setup” (r setup, include = FALSE
)
Note how it has \(```\) and \(```\) at the top and bottom. This differentiates the “code chunk” from the rest of the document.
Use the setup code chunk to load any packages or data that you want to use in the rest of the document.
This is an example of what the setup chunk looks like.
```{r setup, include=FALSE}
# universal chunk options.
# echo = TRUE will show the code in the document.
# echo = FALSE will not.
knitr::opts_chunk$set(echo = TRUE)
# load any packages you want to use throughout the document.
library(tidyverse)
# load any data you want to use throughout the document.
data <- read_csv("week1data/data.csv")
```
Here is an example of a regular code chunk.
Here is the output of that chunk:
Oh no! It looks bad! Changes:
How did I do that?
```{r chunkexample, include = TRUE, warning = FALSE, out.width = "55%", fig.align = "center"}
# note that I named the chunk.
# all chunks must have a UNIQUE name.
# you will get an error if they don't
# I already loaded by data above
ggplot(flights) +
geom_histogram(aes(x = air_time), binwidth = 10)
```
There are lots of code chunk options.
You can find the code chunk options here (https://rpubs.com/Lingling912/870659)
This will get easier to use as you get more and more practice.
\(y = x + \varepsilon\)
\[y = x + \varepsilon\]
LaTeX is particularly helpful for rendering math
You can find a handy reference guide here (https://icl.utk.edu/~mgates3/docs/latex.pdf)
kable()
function in the knitr package.kableExtra
package. You need to download this and load it if you want to use it.summat <- flights |>
# this groups ROWS based on their origin value
group_by(origin) |>
# create means by group
summarize(avg_dep_delay = mean(dep_delay, na.rm = T),
avg_arr_delay = mean(arr_delay, na.rm = T),
avg_air_time = mean(air_time, na.rm = T),
flights = n())
# output
kable(summat, align = "cccc")
origin | avg_dep_delay | avg_arr_delay | avg_air_time | flights |
---|---|---|---|---|
EWR | 15.10795 | 9.107055 | 153.3000 | 120835 |
JFK | 12.11216 | 5.551481 | 178.3490 | 111279 |
LGA | 10.34688 | 5.783488 | 117.8258 | 104662 |
summat <- flights |>
# this groups ROWS based on their origin value
group_by(origin) |>
# create means by group, ROUNDING to two decimal places
summarize(avg_dep_delay = round(mean(dep_delay, na.rm = T), 2),
avg_arr_delay = round(mean(arr_delay, na.rm = T), 2),
avg_air_time = round(mean(air_time, na.rm = T), 2),
flights = n())
# rename columns
colnames(summat) <- c("Origin", "Departure Delay", "Arrival Delay", "Flight Time", "Flights")
# output
kable(summat, caption = "Averages by origin (minutes)",
align = "ccc", linesep = "", digits = 2, format.args = list(big.mark = ","),
booktabs = TRUE) |> # this is from kablextra. You don't have to use it, but I like it.
# all of the below are also from kablextra
column_spec(1, width = "2cm") |>
column_spec(2:5, width = "3.5cm") |>
kable_styling(full_width = F) |>
kable_classic_2()
Origin | Departure Delay | Arrival Delay | Flight Time | Flights |
---|---|---|---|---|
EWR | 15.11 | 9.11 | 153.30 | 120,835 |
JFK | 12.11 | 5.55 | 178.35 | 111,279 |
LGA | 10.35 | 5.78 | 117.83 | 104,662 |
That’s enough on tables for now
As you can see, there are lots of ways to customize tables
Where this becomes really powerful is when you combine it with R code to create tables dynamically
When I write a paper in Markdown, I generally do not do all of my analysis in the Markdown document
Instead, I do the analysis in another script and then save the resulting tables
I then load these tables in the setup chunk of my Markdown document and use them in the document