Data Science for Social Impact

Introduction to Python

Thomas Ciardi

Python Basics

  • Python is a high level, general purpose programming language 1

  • Popular programming language due to its easy syntax, large community, and extensive set of libraries

  • Used data science, machine learning, web development, application development, data visualization, and more

Basic Data Types

  • The standard Python library data types include: integers, booleans, floating points, and strings.
  • A summary of the data types is shown in the table below:
  Data Type Example
Number Integer x = 4
  Long integer x = 15L
  Floating point x = 3.142
  Boolean x = True
Text Character x = ‘c’
  String x = "hello" or x = ‘hello’

Basic Operations

We can set a variable to a value and check the data type:

# set value of variable x
x = 4

# check variable type of x
type(x)
<class 'int'>

Variables of the same type can have operators applied:

# set value of variable y
y = 4

# add x and y
x + y
8

Control Flow Statements

We can use standard logic statements as well.

# set value of x
x = 10

# check if x is divisible by 2, return even if true
if x % 2 == 0:
    print("x is even")
else:
    print("x is odd")
x is even

User Defined Functions

We can combine various operations and logic in callable functions.

# create a function that accepts a variable and returns it multiplied by 4
def func(x):
    return x * 4

# call the function, passing in 2
func(2)
8

Python vs R

Python R
Overview High level, general purpose programming language Language for statistical computing and graphics
Advantages -Production ready (e.g. deploying a model into a website or application)
-Superior natural language processing libraries
-Superior time series analysis and statistical libraries
-CRAN has a more comprehensive screening process
Disadvantages -Less intuitve/clean visualizations
-Less alternative packages
-Less readability
-Lacks robust image analysis libraries

Variable Assignment

There are functional and syntax differences between R and Python.

For example, setting a variable in Python uses = while R uses ->

# set value of py_apples in Python
py_apples = 5

# show value of py_apples
py_apples
5
# set value of r_apples in R
r_apples <- 5

# show value of r_apples
r_apples
[1] 5

Lists and Indexing

Python indexes from 0 while R indexes from 1

# create list in Python
py_list = [1, 'two', 3.0]

# show index 1 of list
py_list[1]
'two'
# create list in R
r_list <- list(1, 'two', 3.0)

# show index 1 of list
r_list[1]
[[1]]
[1] 1

Vectors

R has a native way to create vectors while Python leverages the numpy library and its native list data type

import numpy as np

# create array in Python
py_vector = np.array([0, 1, 2])

# return index 1 of array
py_vector[1]
1
# create vector in R
r_vector <- c(0, 1, 2)

# return index 1 of vector
r_vector[1]
[1] 0

DataFrames

Python uses pandas for dataframes while R has a native dataframe

# import pandas library
import pandas as pd

# create a dataframe from the py_vector array
pd.DataFrame(py_vector)
   0
0  0
1  1
2  2
# create a dataframe from the r_vector vector
data.frame(r_vector)
  r_vector
1        0
2        1
3        2

Reticulate Package

“The reticulate package provides a comprehensive set of tools for interoperability between Python and R”3

Core functions include:

  • Calling Python from R in a variety of ways: R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively

  • Translating between R and Python objects (between R and Pandas data frames, or between R matrices and NumPy arrays)

Let’s load the recirculate package first

# import reticulate
library(reticulate)

# set console messages off
options(reticulate.repl.quiet = TRUE)

Reticulate Basics

Once reticulate is imported, it is as easy at setting the chunk to use python with {python}.

# set value of a
a = "Hello" + " World"
print(a)
Hello World

Note: the variables created in your Python environment will not be contained in your R environment.

# check if a exists
exists('a')
[1] FALSE

To get around this, we can pass the variable from one environment to another.

# return value of a by calling py environment
py$a
[1] "Hello World"

Likewise with R:

# set value of b
b <- 5
# return value of b by calling R env
r.b
5.0

Reticulate Libraries

Basic variable manipulation is not the only Python feature available. More advanced Python can be leveraged with the ability to import Python libraries.

# import os library

import os

# get current working directory
os.getcwd()
'/mnt/rstor/CSE_MSE_RXF131/cradle-members/casf/eib14/git/23-bootcamp-data-science-social-impact/topics'
  • Reticulate can also be used outside of .Rmd files where you can specify the cell language
  • It can be run in R scripts as well. Here is a sample of how a Python library would be called and used in an R script
# import os library
os <- import("os")

# get current working directory
os$getcwd()
[1] "/mnt/rstor/CSE_MSE_RXF131/cradle-members/casf/eib14/git/23-bootcamp-data-science-social-impact/topics"

These libraries can be leveraged to do classic Python manipulations.

# import numpy (specify no automatic Python to R conversion)
np <- import("numpy", convert = FALSE)

# create numpy array of 1-4
a <- np$array(c(1:4))

# apply cumulative sum to array
sum <- a$cumsum()

# convert object to R
py_to_r(sum)
[1]  1  3  6 10

Reticulate Functions

One can design and run a function in Python as well.

pyFunction <- "def print_message():
                print('Hello world!')"

py_run_string(pyFunction)

py$print_message()
Hello world!

One can even write a Python script into a .py file then run the script using reticulate

py_run_file("reticulate-script.py")
Hello world!

Glossary

Python: A high-level programming language known for its readability, versatility, and extensive library support, making it popular for web development, data science, scientific computing, and many other applications.

integer: A whole number that can be either positive, negative, or zero (e.g., -3, 0, 42). It does not have any fractional parts.

boolean: A data type that has only two possible values: True or False. It represents the logical concepts of true and false.

floating point: A number that has both an integer and a fractional part, separated by a decimal point (e.g., 3.14, -0.001). Floating point numbers are used in Python and R to represent real numbers.

string: A sequence of characters enclosed in either single (’ ’) or double (” “) quotes. Strings can include letters, numbers, symbols, and whitespace (e.g.,”Hello, World!“).

operator: A symbol or keyword that performs a specific operation on one or more operands. Examples include + (addition), - (subtraction), * (multiplication), and == (equality comparison).

function: A block of reusable code designed to perform a specific task. Functions can take in parameters (input values) and return a result.

index: The position of an item in a sequence, such as a string, list, or array. In both Python and R, indexing starts at 0 for the first element.

environment: In the context of programming, it refers to a space where variables, functions, and other objects reside and can be accessed or modified. In R, environments are especially important and are used to manage the scope of variables and functions.

script: A file containing a sequence of instructions written in a programming language. When run, the instructions are executed in order. In Python, scripts typically have the .py extension, while in R they might have the .R or .Rscript extension.

library: A collection of pre-written code, functions, or routines that can be used to perform specific tasks or operations in a program. Libraries help in reducing the amount of code a developer needs to write by providing standardized solutions. In Python, libraries are often referred to as “modules” and are imported using the import statement. In R, libraries are packages of functions and data sets, and they can be loaded into the session using the library() function.

Further Resources

What is Python? Executive Summary. Python.org.

R vs Python: R or Python? - Reasons behind this Cloud War. Analytics Vidhya

R Interface to Python. Interface to Python • reticulate.