Geospatial data analysis in R
Week 1 - Introduction

Josh Merfeld

KDI School

September 10, 2024

Introduction

Introductions

  • Let’s start with a little introduction

  • Name, year, program, research interests, etc.

    • Why are you taking this class?

Course Overview

  • Geospatial data analysis in R

  • Major themes:

    • Geospatial data types
      • Shapefiles (vector files), rasters, etc.
    • Data visualization
      • ggplot, tidyterra
    • Getting comfortable with R and R Markdown

Course Overview

  • Today will just be a short introduction

  • For all future classes, please come with R and R Studio installed on your computer

    • You can find instructions on the syllabus
    • You must bring a laptop to class. If you cannot do this, please speak with me.
  • Course website: https://github.com/JoshMerfeld/geospatialdataR

    • You can find slides, assignments, and other materials here
    • It will be updated as we go throughout the semester
      • (I am still making slides!)

Course Overview

  • This is a brand new class, so I will likely be making changes as we go

  • Please check the course website regularly for updates

Detailed outline (tentative)

  • Coordinate reference systems and vector data I (week 2)
    • Shapefiles
    • Coordinate reference systems
    • R package sf
  • Coordinate reference systems and vector data II (week 3)
    • Continued from week 2
  • Data extraction I (week 4)
    • Combining shapefiles
    • Overlap analysis
    • Distance analysis

Detailed outline (tentative)

  • In-class lab (week 5)

  • Rasters (week 6)

    • Introduction to rasters
    • File types
      • .tif
      • .netcdf
    • The R package terra
    • Mapping rasters in R

Detailed outline (tentative)

  • Data extraction II (week 7)
    • Combining shapefiles and rasters; spatial joins
    • The R package exactextractr
  • Accessing geospatial data (week 8)
    • Where does the data come from?
    • Direct downloads
    • APIs
    • Google Earth Engine (less focus due to the use of Javascript)

Detailed outline (tentative)

  • Spatial interpolation (week 9)
    • Inverse distance weighting (IDW)
    • Kriging
  • In-class lab (week 10)

Grading

  • Homework - coding tasks (30%)
    • Two-to-three coding tasks throughout the semester
  • In-class labs (20%)
    • These will be done in groups
    • Two throughout the semester
  • Final exam (40%)
    • This will be a take-home exam focused on coding
  • Participation (10%)
    • I expect everyone to participate in class. That means asking questions, answering questions, and participating in discussions.

TA sections

  • The goal of TA sections is to help you with R and R Markdown

  • For help with the actual material, please come to my office hours

Class on October 21st

  • I will be in Bangkok for a UN workshop the week of the 21st of October

  • I have not decided what to do, but there are two possibilities:

    • Cancel class and make it up during reading week
    • Have an online class

Questions?

  • Any questions about the course?

Next up: R and RStudio!

  • We need to have R and RStudio installed for what’s next
    • Another code editor is also acceptable: VS Code, for example
  • Course website: https://github.com/JoshMerfeld/geospatialdataR
    • This link is also on e-KDIS
    • I strongly recommend you have this website open during class
      • We will sometimes use data from the website

Goal for the rest of class

  • The goal for today is to give you a brief introduction to R and R Markdown

  • We will be using two small datasets to get you familiar with the program

  • A note: if you are completely new to R, the first few weeks will be a slog

    • It will get better, I promise
  • Much of the material covered today comes from two (free!) sources:

What are R and RStudio?

  • R is a commonly used statistical program (and language)
    • It is free and open source, which means you can use this after graduation, without paying for it
    • R is CaSe SeNsItIvE
  • To work with R, we want to use an accompaniment called RStudio
    • RStudio is what is referred to as an integrated development environment (IDE)
    • It is not the only option (VS Code, for example), but it is the most common
    • It makes working with R much easier
  • Whenever you start R, you want to start RStudio
    • RStudio will start R for you

Some important considerations

  • One of our goals is to make reproducible research
    • This means that we want to be able to share our code and have others be able to replicate our results
    • To do this, we will use “scripts” that contain our code
  • A script should be self contained
    • This means that it should contain all of the code necessary to run the analysis
    • A well-written script should allow me to do everything without any additional information
    • Note that more complicated projects can have many scripts! For this class: one script per assignment
  • We will also use R Markdown to create documents
    • R Markdown is a way to combine text and code
    • This allows us to create documents that are reproducible
    • We will use R Markdown to create our homework assignments

The RStudio interface

The RStudio interface

The RStudio interface

The RStudio interface

But we’re missing something… what is it?

The script

Some notes

  • You can add comments to your script using a hashtag (#)
    • At the top of ALL my scripts, I have a comment that says what the script does.
    • At the top of your script, write a comment. It should say “# Week 1 - Introduction to R”
    • I put LOTS of comments in my scripts. This is good practice.
Code
# cleaning the gps data and creating some maps
# to run this, just set your working directory to the folder this script is located in
# Author: Josh Merfeld
# Initial date: September 5th, 2024
Code
# lasso --------------------------------------------
# we have ~60 features. This isn't that many, actually. We didn't create a lot of different possible combinations of the predictors.
# We also don't have any fixed effects. This is just to fix ideas. Nonetheless, let's try lasso!
# we use the glmnet package to implement lasso. It also allows ridge, but we want to make sure to use lasso.
# how do we do this? we want to allocate grid cells across different "folds".

Some notes

  • You can run a line of code by clicking the “Run” button

    • There are also shortcuts. On Mac it is command + enter. On windows it is control + enter. You can change these if you want.
  • You can run multiple lines of code by highlighting them and clicking the “Run” button (or the shortcut)

  • We will practice these later

R Basics

Object types

  • R has a few different types of objects
    • The most common are vectors, matrices, and data frames
      • A “tibble” is a type of data frame used by the tidyverse package (more below)
    • We will use data frames almost exclusively since we are working with datasets, but vectors are common, too
  • You can create a vector using the c() function:
    • Note how we create a new object using the assignment operator, <-. You can also use =.1
Code
vec <- c(1, 2, 3, 4)
vec
[1] 1 2 3 4

Object types

  • You can check what type of object something is by using the class() function
    • For example, if I want to check what type of object vec is, I would write class(vec)
    • Note that the output is “numeric”
    • This is because vec is a vector of numbers
Code
vec <- c(1, 2, 3, 4)
class(vec)
[1] "numeric"
  • If I want to check whether it is a vector, I can write is.vector(vec)
    • Note that the output is TRUE
Code
is.vector(vec)
[1] TRUE

First things first: the working directory

  • The working directory is the folder that R is currently working in
    • This is where R will look for files
    • This is where R will save files
    • This is where R will create files
  • You can always write out an entire file path, but this is tedious
    • More importantly, it makes your code less reproducible since the path is specific to YOUR computer
  • One nice thing about R is that the working directory will automatically be where you open the script from
    • Let’s try this. Save your script to a folder on your computer, then open the script from that folder.

First things first: the working directory

The working directory should be where you opened the file from. Check it like this:

Code
getwd()
[1] "/Users/Josh/Dropbox/KDIS/Classes/geospatialdataR"

R packages

  • R is a language that is built on packages
    • Packages are collections of functions that do specific things
    • R comes with a set of “base” packages that are installed automatically
  • We are going to use one package consistently, called the “tidyverse”
    • This consists of a set of packages that are designed to work together, with data cleaning in mind

R packages

The one exception to always using a script? I install packages in the CONSOLE. You can install packages like this:

Code
install.packages("tidyverse")
  • Note you MUST use quotes around the package name

Loading R packages in your script

We need to load any R packages we want to use at the very top of the script. You should have a comment on line one, so on line two write:

Code
library(tidyverse)

This will load the tidyverse package.

  • Note you do NOT need to use quotes around the package name

Loading data

  • Go to the class website and download the data for today.
    • Put it in your WORKING DIRECTORY (where the script is)
  • We will use the read_csv() function to load the data
    • This function is part of the tidyverse package
    • It will create a data frame
    • We need to NAME the object (data frame). As before, note the assignment operator (<-). You can actually use = though.
Code
library(tidyverse)

# read in the data
data <- read_csv("week1data/data.csv")

Objects in memory

The data frame should show up in the upper right hand corner of RStudio.

Objects in memory

Click on the arrow and it will show more information.

Objects in memory

  • The data frame is a matrix
    • Each row is an observation and each column is a variables
    • Think of what this would look like if you opened it in Excel or Stata. It’s the same.
  • We can also see the names of the columns like this:
Code
colnames(data)
 [1] "res_id"                 "ability"                "age"                   
 [4] "educyears"              "isfarmer"               "yearsfarming"          
 [7] "yearsmanagingfarm"      "outsidewage"            "worriedaboutcropprices"
[10] "worriedaboutcropyields"
  • This is the kind of thing I might do in the console since it’s not really required for the script.

Objects in memory

  • Here’s another handy quick-look functions
Code
glimpse(data)
Rows: 7,209
Columns: 10
$ res_id                 <dbl> 501, 502, 503, 504, 505, 506, 507, 508, 509, 51…
$ ability                <dbl> 74, 42, 67, 54, 57, 72, 51, 65, 54, 24, 24, 49,…
$ age                    <dbl> 83, 27, 49, 50, 70, 45, 58, 41, 45, 70, 24, 45,…
$ educyears              <chr> "16", "7", "7", "7", "4", "7", "7", "7", "7", N…
$ isfarmer               <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ yearsfarming           <dbl> 60, 17, 20, 15, 40, 15, 30, 20, 20, 60, 12, 30,…
$ yearsmanagingfarm      <dbl> 46, 17, 5, 10, 26, 15, 25, 15, 10, 50, 7, 25, 5…
$ outsidewage            <dbl> 3.000e+06, 1.000e+10, 6.000e+06, 1.000e+10, 1.0…
$ worriedaboutcropprices <chr> "Sometimes", "Not at all", "Sometimes", "Not at…
$ worriedaboutcropyields <chr> "Sometimes", "Sometimes", "Not at all", "Someti…

Objects in memory

  • And one more (“structure”)
Code
str(data)
spc_tbl_ [7,209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ res_id                : num [1:7209] 501 502 503 504 505 506 507 508 509 510 ...
 $ ability               : num [1:7209] 74 42 67 54 57 72 51 65 54 24 ...
 $ age                   : num [1:7209] 83 27 49 50 70 45 58 41 45 70 ...
 $ educyears             : chr [1:7209] "16" "7" "7" "7" ...
 $ isfarmer              : chr [1:7209] "Yes" "Yes" "Yes" "Yes" ...
 $ yearsfarming          : num [1:7209] 60 17 20 15 40 15 30 20 20 60 ...
 $ yearsmanagingfarm     : num [1:7209] 46 17 5 10 26 15 25 15 10 50 ...
 $ outsidewage           : num [1:7209] 3e+06 1e+10 6e+06 1e+10 1e+10 ...
 $ worriedaboutcropprices: chr [1:7209] "Sometimes" "Not at all" "Sometimes" "Not at all" ...
 $ worriedaboutcropyields: chr [1:7209] "Sometimes" "Sometimes" "Not at all" "Sometimes" ...
 - attr(*, "spec")=
  .. cols(
  ..   res_id = col_double(),
  ..   ability = col_double(),
  ..   age = col_double(),
  ..   educyears = col_character(),
  ..   isfarmer = col_character(),
  ..   yearsfarming = col_double(),
  ..   yearsmanagingfarm = col_double(),
  ..   outsidewage = col_double(),
  ..   worriedaboutcropprices = col_character(),
  ..   worriedaboutcropyields = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Calling variables in R

  • Some of you might be used to Stata
  • One big difference between the two is that Stata generally only has one data frame in memory at a time
    • This means that you can call a variable without referencing the data frame
  • In R, if you want to look at a variable, you have to tell R which data frame it is in
    • This is done with the $ operator
    • For example, if I want to look at the variable “age” in the data frame “data”, I would write data$age
    • Let’s look at summary statistics for age:
Code
summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   34.00   42.00   43.54   52.00   87.00 

Summary statistics for the entire data frame

  • You can also use summary on the data frame instead of a single column
    • It helps to think of a data frame as rows and columns. For variables, you want to call specific columns.
  • Look at the difference here:
Code
summary(data)
     res_id        ability            age         educyears        
 Min.   : 501   Min.   : 10.00   Min.   :18.00   Length:7209       
 1st Qu.:2783   1st Qu.: 51.00   1st Qu.:34.00   Class :character  
 Median :4714   Median : 59.00   Median :42.00   Mode  :character  
 Mean   :4775   Mean   : 58.66   Mean   :43.54                     
 3rd Qu.:6764   3rd Qu.: 67.00   3rd Qu.:52.00                     
 Max.   :8955   Max.   :100.00   Max.   :87.00                     
                                                                   
   isfarmer          yearsfarming   yearsmanagingfarm  outsidewage       
 Length:7209        Min.   :-9.00   Min.   :-9.00     Min.   :2.000e+03  
 Class :character   1st Qu.:18.00   1st Qu.: 6.00     1st Qu.:3.500e+06  
 Mode  :character   Median :26.00   Median :14.00     Median :1.000e+10  
                    Mean   :28.02   Mean   :15.94     Mean   :7.156e+09  
                    3rd Qu.:38.00   3rd Qu.:22.00     3rd Qu.:1.000e+10  
                    Max.   :70.00   Max.   :70.00     Max.   :1.000e+10  
                    NA's   :219     NA's   :219       NA's   :216        
 worriedaboutcropprices worriedaboutcropyields
 Length:7209            Length:7209           
 Class :character       Class :character      
 Mode  :character       Mode  :character      
                                              
                                              
                                              
                                              

Calling rows/columns of a data frame (matrix)

  • Think about how we refer to rows and columns in a matrix.
    • We use the row and column number, in that order.
    • For example, if I want the first row and second column of a matrix \(X\), mathematically I could write \(X_{1,2}\)
  • We do the same thing in R
  • If I want the first row and second column of the data frame “data”, I would write data[1,2]
    • Note that we use square brackets instead of parentheses
    • Note that we use a comma to separate the row and column
Code
data[1,2]
# A tibble: 1 × 1
  ability
    <dbl>
1      74

Calling columns of a data frame (matrix)

  • We can call entire columns of a data frame by leaving the row blank
    • For example, if I want the second column of the data frame “data”, I would write data[,2]
    • Note that the second column is the ability variable
Code
colnames(data)
 [1] "res_id"                 "ability"                "age"                   
 [4] "educyears"              "isfarmer"               "yearsfarming"          
 [7] "yearsmanagingfarm"      "outsidewage"            "worriedaboutcropprices"
[10] "worriedaboutcropyields"
Code
data[,2]
# A tibble: 7,209 × 1
   ability
     <dbl>
 1      74
 2      42
 3      67
 4      54
 5      57
 6      72
 7      51
 8      65
 9      54
10      24
# ℹ 7,199 more rows

Missing variables R

  • Missing variables are denoted by NA
    • This is different from Stata, which uses a period (.)
  • Note that this is only how the PROGRAM stores missing variables. Sometimes the data itself has different missing values.
  • For example, take a look at the first ten rows of the data frame (also note how I call the first ten rows and leave out the first column!):
Code
data[1:10,-1]
# A tibble: 10 × 9
   ability   age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
     <dbl> <dbl> <chr>     <chr>           <dbl>             <dbl>       <dbl>
 1      74    83 16        Yes                60                46     3000000
 2      42    27 7         Yes                17                17  9999999999
 3      67    49 7         Yes                20                 5     6000000
 4      54    50 7         Yes                15                10  9999999999
 5      57    70 4         Yes                40                26  9999999999
 6      72    45 7         Yes                15                15      800000
 7      51    58 7         Yes                30                25     2000000
 8      65    41 7         Yes                20                15  9999999999
 9      54    45 7         Yes                20                10      300000
10      24    70 <NA>      Yes                60                50  9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
#   worriedaboutcropyields <chr>

Variable types

  • R also has a few different types of variables
    • The most common are numeric, character, and logical
  • Look at the previous code again:
Code
data[1:10,-1]
# A tibble: 10 × 9
   ability   age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
     <dbl> <dbl> <chr>     <chr>           <dbl>             <dbl>       <dbl>
 1      74    83 16        Yes                60                46     3000000
 2      42    27 7         Yes                17                17  9999999999
 3      67    49 7         Yes                20                 5     6000000
 4      54    50 7         Yes                15                10  9999999999
 5      57    70 4         Yes                40                26  9999999999
 6      72    45 7         Yes                15                15      800000
 7      51    58 7         Yes                30                25     2000000
 8      65    41 7         Yes                20                15  9999999999
 9      54    45 7         Yes                20                10      300000
10      24    70 <NA>      Yes                60                50  9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
#   worriedaboutcropyields <chr>

Variable types

  • dbl is short for double, which is a numeric variable (the “type” of numeric variable is about how much memory is needed to store it)
  • chr is short for character, which is a string of characters (text)
    • Surprisingly, in our previous example, educyears was a character string even though it seemed to be a number
    • Let’s look at the possible values of educyears using the unique() function, which outputs a vector:
Code
unique(data$educyears)
 [1] "16"            "7"             "4"             NA             
 [5] "11"            "6"             "13"            "5"            
 [9] "8"             "10"            "12"            "9"            
[13] "2"             "3"             "15"            "14"           
[17] "20"            "18"            "17"            "1"            
[21] "Not Mentioned" "19"           

Variable types

  • Interesting! It seems that there is a “Not Mentioned” value.
    • What if we want to replace those with missing, instead?
  • Let’s talk through the following code
    • First note how it refers to a specific column and then a specific row
    • Also note how it uses two equal signs (==) to check whether the value is “Not Mentioned”
      • This is similar to Stata!
Code
# replace "Not Mentioned" with NA
data$educyears[data$educyears=="Not Mentioned"] <- NA  
# check that it worked by looking at the unique values
unique(data$educyears)              
 [1] "16" "7"  "4"  NA   "11" "6"  "13" "5"  "8"  "10" "12" "9"  "2"  "3"  "15"
[16] "14" "20" "18" "17" "1"  "19"
Code
# turn into numeric
data$educyears <- as.numeric(data$educyears)
class(data$educyears)
[1] "numeric"

Pipes

  • One of the most useful things in R is the pipe operator (|>)
    • This is part of the tidyverse package
    • It allows you to chain commands together
    • It makes your code much easier to read
    • It makes your code much easier to write
    • It makes your code much easier to debug
    • It makes your code much easier to share
    • It makes your code much easier to reproduce
  • It’s easy to use but it will take some time for you to get used to the names of the functions we can use with it
    • This also goes for other tasks in R, not just with the pipe operator

Pipes example

Here is an example of how we can use pipes with the mutate() function in tidyverse

  • We are also going to use ifelse() to make this work
Code
data <- data |>
          mutate(educyears = ifelse(educyears == "Not Mentioned", NA, educyears), # if educyears=="Not Mentioned", replace
                educyears = as.numeric(educyears))    # replace educyears as numeric (instead of character)
summary(data$educyears)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   7.000   7.000   6.735   7.000  20.000    3113 

Note that we could wrap as.numeric() around the ifelse() command to do it on one line!

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
summary(data$educyears)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   7.000   7.000   6.735   7.000  20.000    3113 

Missings and functions in R

In Stata, by default, functions ignore missing values

  • R does not do this by default. Look at this:
Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
mean(data$educyears)
[1] NA

If there are any missing values, the function will evalute to missing!

  • But we can also do this:
Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
mean(data$educyears, na.rm = TRUE) # BE CAREFUL WITH THIS! Make sure it is indeed what you want to do.
[1] 6.735107

Functions and storing values

The mean() function in the previous slide outputs a single value - That means we could store that value as an object:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears))) # wrapped into one line
meaneduc <- mean(data$educyears, na.rm = TRUE)
sdeduc <- sd(data$educyears, na.rm = TRUE)
meaneduc
[1] 6.735107
Code
sdeduc
[1] 2.404086

How is this helpful? We can use these values later in our script!

Functions and mutate()

We can combine the mean() and sd() functions within mutate to create a new, standardized variable:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)), # wrapped into one line
                 educyears_std = (educyears - mean(educyears))/sd(educyears))
summary(data$educyears_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA    7209 

Oh no! what happened?

Functions and mutate()

We can combine the mean() and sd() functions within mutate to create a new, standardized variable:

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)), # wrapped into one line
                 educyears_std = (educyears - mean(educyears, na.rm = T))/sd(educyears, na.rm = T))
summary(data$educyears_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-2.3856  0.1102  0.1102  0.0000  0.1102  5.5176    3113 

Note that we can shorten TRUE to T (or FALSE to F).

Visualizations with ggplot2

  • ggplot2 is a flexible way to create visualizations in R
  • The basic idea is that you create a plot object and then add layers to it
  • Let’s create a histogram of educyears

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# we call ggplot() and NOT ggplot2()
ggplot() +   # note how we use + here, NOT the pipe operator
  geom_histogram(data = data, aes(x = educyears)) # the histogram with geom_histogram
Code
# data = data tells R to use the data frame "data", and the aes() is the aesthetic
# only an x value here since a histogram uses just a SINGLE value

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# we can save the plot as an object
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears))
g1

Visualizations with ggplot2

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
# lots of ways to change the plot
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears)) +
        labs(title = "Histogram of educyears",
             x = "Years of education",
             y = "Count")
g1

One more example

Code
data <- data |>
          mutate(educyears = as.numeric(ifelse(educyears == "Not Mentioned", NA, educyears)))
g1 <- ggplot() +
        geom_histogram(data = data, aes(x = educyears)) +
        labs(title = "Histogram of educyears",
             x = "Years of education",
             y = "Count") +
        theme_bw()
g1

Let’s try this with a NEW dataset

First install a new package that has a dataset we will use (you can do this in the console):

Code
install.packages("nycflights13")

Now let’s see:

Code
library(nycflights13)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Dates with lubridate

There’s a nice package called lubridate that makes working with dates much easier.

Code
library(lubridate)
# create a date variable
flights$date <- as_date(paste0(flights$year, "-", flights$month, "-", flights$day))
head(flights$date)
[1] "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01"
[6] "2013-01-01"

Dates with lubridate

Departure time/arrival time is in the format HHMM (e.g., 1530 is 3:30pm). We can add this to the date

Code
flights$dep_time_new <- hm(paste0(flights$dep_time %/% 100, ":", flights$dep_time %% 100))
head(flights$dep_time_new, n = 20)
 [1] "5H 17M 0S" "5H 33M 0S" "5H 42M 0S" "5H 44M 0S" "5H 54M 0S" "5H 54M 0S"
 [7] "5H 55M 0S" "5H 57M 0S" "5H 57M 0S" "5H 58M 0S" "5H 58M 0S" "5H 58M 0S"
[13] "5H 58M 0S" "5H 58M 0S" "5H 59M 0S" "5H 59M 0S" "5H 59M 0S" "6H 0M 0S" 
[19] "6H 0M 0S"  "6H 1M 0S" 

One more example

Lubridate also lets us work with “periods”

Code
flights$dep_delay_new <- as.period(flights$dep_delay, unit = "minute")
# NOTE: You have to be very careful with taking means/medians, etc.
head(flights$dep_delay_new)
[1] "2M 0S"  "4M 0S"  "2M 0S"  "-1M 0S" "-6M 0S" "-4M 0S"

Let’s look at some new tidyverse functions

Let’s get the average departure delay by NYC airport:

Code
# Remember I said be careful with means of periods/durations! Using the original value here.
flights |> 
    group_by(origin) |> # this groups ROWS based on their origin value
    summarize(avg_dep_delay = mean(dep_delay, na.rm = T)) # this summarizes the data, creating means absed on the grouping!
# A tibble: 3 × 2
  origin avg_dep_delay
  <chr>          <dbl>
1 EWR             15.1
2 JFK             12.1
3 LGA             10.3

Note that this does not create a single value. Instead it creates a tibble (a data frame) summarizing the data by our grouping variable.

Let’s look at some new tidyverse functions

What if we want to save that tibble instead?

Code
summat <- flights |> 
            group_by(origin) |> # this groups ROWS based on their origin value
            summarize(avg_dep_delay = mean(dep_delay, na.rm = T)) # this summarizes the data, creating means based on groups!
summat # print the 3x2 matrix in the console
# A tibble: 3 × 2
  origin avg_dep_delay
  <chr>          <dbl>
1 EWR             15.1
2 JFK             12.1
3 LGA             10.3

I could then output this to a table if I wanted to (using Markdown, more on this later):

origin avg_dep_delay
EWR 15.10795
JFK 12.11216
LGA 10.34688

Let’s look at a new plot

How does departure delay vary by time of day?

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay))

Let’s look at a new plot

We can color code by origin, too!

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay, color = origin))

Make it prettier

Code
ggplot() + 
  geom_smooth(data = flights, aes(x = sched_dep_time, y = dep_delay, color = origin), se = FALSE) +
  labs(x = "Scheduled departure time",
       y = "Departure delay (minutes)") +
  theme_bw() + guides(color = guide_legend(title = "Departure airport"))

R Markdown

What is R Markdown?

  • R Markdown is a way to combine text and code

    • This allows us to create documents that are reproducible
    • We will use R Markdown to create our homework assignments
  • These slides were all created in R Markdown

  • My papers are written in R Markdown (well, some of them are, anyway)

    • Here is an example
  • Yihui Xie, J. J. Allaire, and Garrett Grolemund have an awesome – free! – resource on R Markdown,

Installing R Markdown

You’ll need to install LaTeX and R Markdown. You can do this in the console:

Code
# first do this
install.packages('tinytex')
# then do this
tinytex::install_tinytex()  # install TinyTeX
# finally do this
install.packages("rmarkdown")

Creating an R Markdown document in RStudio

Creating an R Markdown document in RStudio

Creating an R Markdown document in RStudio

Go ahead and save this document

  • Go ahead and save this document in your working directory.
    • One think about Markdown files is that it will ALWAYS set the working directory to where the file is saved whenever you “knit” the document.
  • What is “knitting”?
    • Knitting is the process of turning your R Markdown document into a pdf, html, or word document.
    • We will just focus on pdfs for now.
    • Note: I am using Quarto in an HTML format for these slides!

Knit it!

Check out the document you just created

  • Go to your working directory and open the pdf to see what it looks like.
    • It will always create the pdf in the same folder as the .Rmd file.

YAML header

  • At the very top of the document is some information about the document
    • This is called the YAML header
    • It tells R Markdown what kind of document to create
    • It also allows you to set some options
    • DO NOT DELETE THE --- AT THE TOP AND BOTTOM OF THE YAML HEADER!
  • You can change the title and date as you please
    • For today’s date, you can use Sys.Date() within R inline code (more in a second):
Code
date: "`r Sys.Date()`"

The setup chunk

  • Just below the YAML header you’ll see a “code chunk” called “setup” (r setup, include = FALSE)

  • Note how it has \(```\) and \(```\) at the top and bottom. This differentiates the “code chunk” from the rest of the document.

    • Whenever you want to add a code chunk, you must have the \(```\) at the top and bottom of it, at the beginning of the line.
  • Use the setup code chunk to load any packages or data that you want to use in the rest of the document.

    • Later code chunks are “local”: they will be able to access things from the setup chunk but not from other code chunks.

The setup chunk

This is an example of what the setup chunk looks like.


```{r setup, include=FALSE}
# universal chunk options.
# echo = TRUE will show the code in the document.
# echo = FALSE will not.
knitr::opts_chunk$set(echo = TRUE)  

# load any packages you want to use throughout the document.
library(tidyverse) 

# load any data you want to use throughout the document.
data <- read_csv("week1data/data.csv") 
```

Code chunks

Here is an example of a regular code chunk.

```{r chunkexample, include = TRUE}
# note that I named the chunk.
# all chunks must have a UNIQUE name.
# you will get an error if they don't

# I already loaded by data above
ggplot(flights) + 
  geom_histogram(aes(x = air_time), binwidth = 10)
```

Code chunks

Here is the output of that chunk:

Code
# note that I named the chunk.
# all chunks must have a UNIQUE name.
# you will get an error if they don't

# I already loaded by data above
ggplot(flights) + 
  geom_histogram(aes(x = air_time), binwidth = 10)

Code chunks

Oh no! It looks bad! Changes:

Code
# note that I named the chunk.
# all chunks must have a UNIQUE name.
# you will get an error if they don't

# I already loaded by data above
ggplot(flights) + 
  geom_histogram(aes(x = air_time), binwidth = 10)

Code chunks

How did I do that?

```{r chunkexample, include = TRUE, warning = FALSE, out.width = "55%", fig.align = "center"}
# note that I named the chunk.
# all chunks must have a UNIQUE name.
# you will get an error if they don't

# I already loaded by data above
ggplot(flights) + 
  geom_histogram(aes(x = air_time), binwidth = 10)
```
  • NOTE: The start of the chunk must be on ONE line. It is wrapped here just for presentation.

Code chunks options

  • There are lots of code chunk options.

  • You can find the code chunk options here (https://rpubs.com/Lingling912/870659)

  • This will get easier to use as you get more and more practice.

Starting new sections/subsections

# This will create a new section

## This will create a new sub-section

### This will create a new sub-sub-section
Don't do this.

You can add R inline code

Code
- You can add R inline code using the `r' operator.
  - If I want to add the date, I can write `r Sys.Date()`
  - There are `r ncol(flights)` columns in the flights data
  - There are `r nrow(flights)` rows in the flights data
  • You can add R inline code using the \(`r'\) operator.
    • If I want to add the date, I can write 2024-09-10
    • There are 22 columns in the flights data.
    • There are 336776 rows in the flights data.

Enumerated lists/bullets

Code
- I like lists.
  - With indentations.
    - Even two!
  • I like lists.
    • With indentations.
      • Even two!

Enumerated lists/bullets

Code
1. I don't use many numbered lists.
2. Do you?
    - Note this has two tabs!
  1. I don’t use many numbered lists.
  2. Do you?
    • Note this has two tabs!

latex

  • R Markdown uses LaTeX to create pdfs. This allows you to do some cool things.
- For example, it is easy to add equations with latex, using $:

$y = x + \varepsilon$
  • For example, it is easy to add equations with latex, using $:

\(y = x + \varepsilon\)

LaTeX

- I can center it, too:

$$y = x + \varepsilon$$ 
  • I can center it, too:

\[y = x + \varepsilon\]

LaTeX

  • LaTeX is particularly helpful for rendering math

  • You can find a handy reference guide here (https://icl.utk.edu/~mgates3/docs/latex.pdf)

Creating tables

  • There are lots of ways to create tables in R Markdown.
    • I will show you one way using the kable() function in the knitr package.
    • You do not need to download this package, it is already installed with R Markdown.
      • There is extra functionality in the kableExtra package. You need to download this and load it if you want to use it.

Creating tables

Code
summat <- flights |> 
  # this groups ROWS based on their origin value
  group_by(origin) |> 
  # create means by group
  summarize(avg_dep_delay = mean(dep_delay, na.rm = T),
            avg_arr_delay = mean(arr_delay, na.rm = T),
            avg_air_time = mean(air_time, na.rm = T),
            flights = n())
# output 
kable(summat, align = "cccc")
origin avg_dep_delay avg_arr_delay avg_air_time flights
EWR 15.10795 9.107055 153.3000 120835
JFK 12.11216 5.551481 178.3490 111279
LGA 10.34688 5.783488 117.8258 104662

I don’t like that at all! Let’s make it pretty.

Code
summat <- flights |> 
            # this groups ROWS based on their origin value
            group_by(origin) |> 
            # create means by group, ROUNDING to two decimal places
            summarize(avg_dep_delay = round(mean(dep_delay, na.rm = T), 2),
                      avg_arr_delay = round(mean(arr_delay, na.rm = T), 2),
                      avg_air_time = round(mean(air_time, na.rm = T), 2),
                      flights = n())
# rename columns
colnames(summat) <- c("Origin", "Departure Delay", "Arrival Delay", "Flight Time", "Flights")
# output 
kable(summat, caption = "Averages by origin (minutes)",
      align = "ccc", linesep = "", digits = 2, format.args = list(big.mark = ","),
      booktabs = TRUE) |> # this is from kablextra. You don't have to use it, but I like it.
  # all of the below are also from kablextra
  column_spec(1, width = "2cm") |>
  column_spec(2:5, width = "3.5cm") |>
  kable_styling(full_width = F) |>
  kable_classic_2() 
Averages by origin (minutes)
Origin Departure Delay Arrival Delay Flight Time Flights
EWR 15.11 9.11 153.30 120,835
JFK 12.11 5.55 178.35 111,279
LGA 10.35 5.78 117.83 104,662

Enough for now

  • That’s enough on tables for now

  • As you can see, there are lots of ways to customize tables

  • Where this becomes really powerful is when you combine it with R code to create tables dynamically

    • I will teach you to use a package called fixest that helps automate some of this
    • If you change your specification, your tables will update AUTOMATICALLY!
    • Ever tried to manually change a table in Word? Never again.

Some tips

  • When I write a paper in Markdown, I generally do not do all of my analysis in the Markdown document

  • Instead, I do the analysis in another script and then save the resulting tables

  • I then load these tables in the setup chunk of my Markdown document and use them in the document

    • For figures, it depends. For a simple summary figure, I might load the data in the Markdown document and create the figure there.

Try it!

  • Try to knit an R Markdown document
    • Include one table and one figure
  • I’ll walk around and help out in a few minutes

First assignment

  • Assignment for next week (due two weeks from today):
  • Before next class, you will turn in on e-KDIS:
    • R script (if there is one)
    • R Markdown script
    • pdf of the R Markdown document