Data Science for Social Impact

Intro to R

Raymond Wieser

RStudio

The RStudio Interface

  • The goal of this lab is to introduce you to R and RStudio,

  • which we’ll be using throughout the course both

    • to learn the statistical concepts discussed in the course
    • and to analyze real data and come to informed conclusions.
  • To clarify which is which:

    • R is the name of the programming language itself
    • and RStudio is an integrated development environment (IDE).

The RStudio Interface

As the labs progress,

  • you are encouraged to explore beyond what the labs dictate;
    • a willingness to experiment will make you a much better programmer.

Before we get to that stage, however,

  • you need to build some basic fluency in R.

The RStudio Interface

Today we begin with the fundamental building blocks of R and RStudio:

  • the interface,
  • reading in data,
  • and basic commands.

The Rstudio Interface Cont.

Go ahead and launch RStudio.

You should see a window that looks like the image shown below.

RStudio Panels

The panel on the lower left is where the action happens.

  • It’s called the console
    • On startup it will display information about the version of R you are running
    • We can directly type code into this panel
    • It can also be used as a calculator!

RStudio Panels

The panel in the upper right

  • contains your environment
  • as well as a history of the commands
    • that you’ve previously entered.

Any plots that you generate will show up

  • in the panel in the lower right corner.
  • This is also where you can
  • browse your files,
  • access help,
  • manage packages, etc.

A Simple Overview of R

Intro to some R: Data Types

  • Primitives (numeric, integer, character, logical, factor)
  • Data Frames
  • Lists
  • Tables
  • Arrays
  • Environments
  • Others (functions, closures, promises..)

Simple Types, and their class

  x <- 1
  class(x)
## [1] "numeric"
   
  y <- "Hello World"
  class(y)
## [1] "character"
   
  z <- TRUE
  class(z)
## [1] "logical"
   
  as.integer(z)
## [1] 1

Simple Types - Vectors

  • The basic type unit in R is a vector
  x <- c(1,2,3)
  x
## [1] 1 2 3
  x <- 1:3
  x[1]
## [1] 1
  x[0]
## integer(0)
  x[-1]
## [1] 2 3

Generating Vectors

  • R provides lots of convenience functions for data generation:
  rep(0, 5)
## [1] 0 0 0 0 0
  seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
  seq(1,2,.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
  seq(1,2,length.out = 6)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0

Indexing, c is concatenate

  • to see the help on c()
    • type help(c)
  x <- c(1, 3, 4, 10, 15, 20, 50, 1, 6)
  x > 10
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
  which(x > 10)
## [1] 5 6 7
  x[x > 10]
## [1] 15 20 50
  x[!x > 10]
## [1]  1  3  4 10  1  6
  x[x <= 10]
## [1]  1  3  4 10  1  6
  x[x > 10 & x < 30]
## [1] 15 20

Functions

Functions

  • Usually take code in scripts, make functions from them
  square <- function(x) x^2
  square(2)
## [1] 4
   
  pow <- function(x, p=2) x^p
  pow(10)
## [1] 100
  pow(10,3)
## [1] 1000
  pow(p = 3,10)
## [1] 1000

Functions Inputs

  • Functions can be passed as data:
  g <- function(x, f) f(x)
  g(10, square)
## [1] 100
   
  h <- function(x,f,...) f(x,...)
  h(10, pow, 3)
## [1] 1000

R is Vectorized

  • Example - multiplying two vectors:
  mult <- function(x,y) {  
    z <- numeric(length(x))
    for (i in 1:length(x)) {
      z[i] <- x[i] * y[i]
      }
    z
    }
   
  mult(1:10,1:10)
 [1]   1   4   9  16  25  36  49  64  81 100

R is Vectorized

  • Multiplying two vectors ‘the R way’:
  1:10 * 1:10
 [1]   1   4   9  16  25  36  49  64  81 100
  • NOTE: R recycles vectors of unequal length:
  1:10 * 1:2
 [1]  1  4  3  8  5 12  7 16  9 20

Random Number Generation

  • R contains a huge number of
    • built-in random number generators
    • for various probability distributions
      • Many different distributions available
# Normal variates, mean=0, sd=1
  rnorm(10)
 [1]  0.2289510 -0.4378631  0.4570428  0.1053231  2.0986755 -0.2465272
 [7] -0.6211719  0.7882495  0.2044265  1.2244146
  rnorm(10, mean = 100, sd = 5)
 [1]  93.62562  94.52929 104.89869 101.94875 100.53409  97.19305  90.50357
 [8] 109.88282 104.47033 102.89821

Dataframes

Data Frames

Data Frames are fundamental

  • Data frames are the fundamental structure
    • used in data analysis
    • Similar to a database table in spirit
    • (named columns, distinct types)
  d <- data.frame(x = 1:6, y = "AUDUSD", z = c("one","two"))
  d
  x      y   z
1 1 AUDUSD one
2 2 AUDUSD two
3 3 AUDUSD one
4 4 AUDUSD two
5 5 AUDUSD one
6 6 AUDUSD two

Data Frames can be indexed

  • Data frames can be indexed like a vector or matrix:
  # First row
  d[1,]
##   x      y   z
## 1 1 AUDUSD one
   
  # First column
  d[,1]
## [1] 1 2 3 4 5 6
   
  # First and third cols, first two rows
  d[1:2,c(1,3)]
##   x   z
## 1 1 one
## 2 2 two

Generate a Data Frame

  • Let’s generate some dummy data:
    • Using data.frame
  generateData <- function(N) data.frame(time = Sys.time() + 1:N, 
    sym = "AUDUSD", 
    bid = rep(1.2345,N) + runif(min = -.0010,max = .0010,N),
    ask = rep(1.2356,N) + runif(min = -.0010,max = .0010,N),
    exch = sample(c("EBS","RTM","CNX"),N, replace = TRUE)) 
   
  prices <- generateData(50)
  head(prices, 5)
                 time    sym      bid      ask exch
1 2023-09-07 16:03:45 AUDUSD 1.233667 1.235822  EBS
2 2023-09-07 16:03:46 AUDUSD 1.234914 1.236369  CNX
3 2023-09-07 16:03:47 AUDUSD 1.233837 1.236454  CNX
4 2023-09-07 16:03:48 AUDUSD 1.234259 1.234837  EBS
5 2023-09-07 16:03:49 AUDUSD 1.234900 1.234914  CNX

Data Frames

  • We can add/remove columns on the fly:
  prices$spread <- prices$ask - prices$bid
  prices$mid <- (prices$bid + prices$ask) * 0.5
  head(prices)
                 time    sym      bid      ask exch       spread      mid
1 2023-09-07 16:03:45 AUDUSD 1.233667 1.235822  EBS 2.155507e-03 1.234744
2 2023-09-07 16:03:46 AUDUSD 1.234914 1.236369  CNX 1.454971e-03 1.235641
3 2023-09-07 16:03:47 AUDUSD 1.233837 1.236454  CNX 2.616981e-03 1.235146
4 2023-09-07 16:03:48 AUDUSD 1.234259 1.234837  EBS 5.787885e-04 1.234548
5 2023-09-07 16:03:49 AUDUSD 1.234900 1.234914  CNX 1.417177e-05 1.234907
6 2023-09-07 16:03:50 AUDUSD 1.234711 1.235277  EBS 5.656128e-04 1.234994

Operations on Data Frames

  • Some basic operations on data frames:
  names(prices)
## [1] "time"   "sym"    "bid"    "ask"    "exch"   "spread" "mid"
   
  table(prices$exch)
## 
## CNX EBS RTM 
##  20  21   9
   
  summary(prices$mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.234   1.235   1.235   1.235   1.235   1.236

These Operators Are Functions

  `(`
.Primitive("(")
  (1 + 2)
[1] 3
  `(` <- function(x) 42
  (1 + 2)
[1] 42
  rm("(")

Examples in R

Example: Median Absolute Deviation

\[MAD(x) = median\left(\left|Y_i - \hat{Y}\right|\right)\]

function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
    low = FALSE, high = FALSE) 
{
    if (na.rm) 
        x <- x[!is.na(x)]
    n <- length(x)
    constant * if ((low || high) && n%%2 == 0) {
        if (low && high) 
            stop("'low' and 'high' cannot be both TRUE")
        n2 <- n%/%2 + as.integer(high)
        sort(abs(x - center), partial = n2)[n2]
    }
    else median(abs(x - center))
}
<bytecode: 0x55ab66b63b70>
<environment: namespace:stats>

Example: Simulating Coin Tosses

  • What is the probability of
    • exactly 3 heads in 10 coin tosses
    • for a fair coin?

Using binomial identity:

  # $\binom{n}{k}p^{k}(1-p)^{(n-k)} = # \binom{10}{3}\left(\frac{1}{2}\right)^{3}\left(\frac{1}{2}\right)^{7}$
  
  choose(10,3)*(.5)^3*(.5)^7
[1] 0.1171875

Using binomial distribution density function:

 dbinom(prob = 0.5, size = 10, x = 3)
[1] 0.1171875

Using simulation (100,000 tosses):

  sum(replicate(100000,sum(rbinom(prob = 1/2, size = 10, 1)) == 3))/100000
[1] 0.11733

Example: Random Walk

  • Generate 1000 up-down movements based on a fair coin toss and plot:
  x <- (cumsum(ifelse(rbinom(prob = 0.5, size = 1, 10000) == 0,-1,1)))
  plot(x, type = 'l', main = 'Random Walk')

Example: Generating Random Data

  randomWalk <- function(N)(cumsum(ifelse(rbinom(prob = 0.5, size = 1, N) == 0,-1,1)))
  AUDUSD <- 1.2345 + randomWalk(1000)*.0001
  
  plot(AUDUSD, type = 'l')

Software Libaries

R Packages

  • R is an open-source programming language,

  • meaning that users can contribute packages that make our lives easier,

  • and we can use them for free.

For this lab, and many others in the future,

  • we will use the following R packages:
    • The suite of tidyverse packages:
    • for data wrangling and data visualization
  • openintro: for data and custom functions with the OpenIntro resources

Installing Packages

If these packages were not already available in your R environment,

  • then you would install them by typing the following three lines of code
  • into the console of your RStudio session,
  • pressing the enter/return key after each one.
  • Note that you can check to see
  • which packages (and which versions) are installed
  • by inspecting the Packages tab
  • in the lower right panel of RStudio.
# install.packages("tidyverse")
# install.packages("openintro")

Sourcing Packages

You may be asked to select a server from which to download;

  • any of them will work.

Next, you need to load these packages

  • in your working “R environment”.
  • We do this with the library function.

Run the following three lines in your console.

library(tidyverse)
library(openintro)

You only need to install packages once,

  • but you need to load them each time you relaunch RStudio.

The Tidyverse

The Tidyverse packages

  • share common philosophies
  • and are designed to work together.

You can find more about the packages in the tidyverse

Reporting

Creating a reproducible lab report

RMarkdown

Going forward you should refrain

  • from typing your code directly in the console,
  • and instead type any code
  • (final correct answer, or anything you’re just trying out)
    • in the R Markdown file
    • and run the chunk using either
      • the Run button on the chunk (green sideways triangle)
      • or by highlighting the code and clicking Run
        • on the top right corner of the R Markdown editor.

If at any point you need to start over,

  • you can Run All Chunks above the chunk you’re working in
  • by clicking on the down arrow in the code chunk.

Additional Resources

Resources for learning R and working in RStudio

  • That was a short introduction to R and RStudio,

    • but we will provide you with more functions
    • and a more complete sense of the language as the course progresses.

In this course we will be using the suite of R packages from the tidyverse.

The book R For Data Science by Grolemund and Wickham

  • is a fantastic resource for data analysis in R with the tidyverse.

Online Help

If you are googling for R code,

  • make sure to also include these package names in your search query.
  • For example, instead of googling “scatterplot in R”,
    • google “scatterplot in R with the tidyverse”.

These cheatsheets may come in handy throughout the semester:

Glossary

RStudio

  • Console: interface for R language
  • Prompt: Area to type individual lines of code
  • Environment: Collection of all of the objects that have been loaded into R

Introduction to R

  • Primitives: a type of data in R, including numeric, integer, character, logical and factor
  • Vectors: Basic data structure in R
  • Lists: a collection of elements in a sequence, comprised of vectors
  • Dataframes: A way of organizing measurements into a coherent structure, organized by columns and rows
  • Functions: a set of pre-defined operations to be applied to a certain object
  • Vectorized: a way of programming that inherently understands mathematical operations as linear algebra

Software Libraries

  • Package: a collection of functions used for a specific purpose
  • Open Source: an approach to programming that allows for users to read and change the underlying software
  • Library: the collection of packages that have been loaded into R to be used when running code

Reporting

  • Reproducible: an approach to programming that instills the underlying principle of reusing code, that other members of the community can also use / test conclusions
  • Markdown: a computer language used to create visual documents from programming scripts