1 / 64

2 / 64

Image credit: [Unsplash]https://unsplash.com/photos/gcgves5H_Ac)

Outline

Data Cleaning
Tidyverse
Data Profiling
- Numerical Data
- Categorical Data

3 / 64

Prerequistes

R is installed
RStudio is installed
download the workshop materials
- https://github.com/Standard-Deviator/CPA_2020_data_cleaning_tidyverse
open "CPA_2020_data_cleaning_tidyverse.Rproj" to launch Rstudio
the following packages are installed

install.packages(c("tidyverse", "here","snakecase"))

You can learn more about R projects at https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects.

4 / 64

What is Data Cleaning?

5 / 64

What is Data Cleaning?

"Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data"

Wikipedia

5 / 64

What is Data Cleaning?

"Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data"

Wikipedia

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data

Dasu and Johnson, 2003

5 / 64

What is Data Cleaning?

"Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data"

Wikipedia

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data

Dasu and Johnson, 2003

Garbage in, garbage out.

Depends on who you ask, mid 20th century?

5 / 64

Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning (Vol. 479). John Wiley & Sons.

Wu, S. (2013), "A review on coarse warranty data and analysis" (PDF), Reliability Engineering and System, 114: 1–11, doi:10.1016/j.ress.2012.12.021

What is the Tidyverse?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

www.tidyverse.org

6 / 64

What is the Tidyverse?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

www.tidyverse.org

There is nothing that can be done using the tidyverse which cannot be accomplished using base R or other packages.
I prefer these packages because I found a significant leap forward in the robustness of my code, quicker debugging time, and quicker to perform checks on my data.

6 / 64

Further, I recommend the book R for Data science which is freely available online at https://r4ds.had.co.nz/

7 / 64

What is the Tidyverse?

The packages within the tidyverse are categorized into two parts: 8 core packages and 15 non-core packages.

There is a higher level management package called tidyverse which helps maintain the whole collection of packages. All 23 packages can be installed with a single function.

install.packages("tidyverse")

8 / 64

The Tidyverse

The core packages are likely to be used each time you sit down to write code, and loading JUST the core 8 packages can be done with a single function.

library(tidyverse)

## -- Attaching packages ------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.2
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------- tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

9 / 64

The Tidyverse

The core packages are likely to be used each time you sit down to write code, and loading JUST the core 8 packages can be done with a single function.

library(tidyverse)

## -- Attaching packages ------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.2
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------- tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

9 / 64

A nice thing to note here is that once the tidyverse package is loaded, it explicitly tells you that the core 8 packages are attached. It also supplies a message about which functions are now masked by the newly loaded package functions.

Tip: At anytime, you can run the function tidyverse_conflicts() to see the message again.

Tidyverse: Core Packages

10 / 64

Dplyr

Dplyr provides a simple set of “verb” functions that make basic data manipulation easier:

filter()
- creates a subset of the data by extracting rows which meet certain criteria

11 / 64

Dplyr

Dplyr provides a simple set of “verb” functions that make basic data manipulation easier:

filter()
- creates a subset of the data by extracting rows which meet certain criteria
select()
- creates a subset of the data by extracting a set of columns we identify by name

11 / 64

Dplyr

Dplyr provides a simple set of “verb” functions that make basic data manipulation easier:

filter()
- creates a subset of the data by extracting rows which meet certain criteria
select()
- creates a subset of the data by extracting a set of columns we identify by name
mutate()
- creates/transforms variables using pre-existing variables and adds them to the end of our tibble

11 / 64

Dplyr

Dplyr provides a simple set of “verb” functions that make basic data manipulation easier:

filter()
- creates a subset of the data by extracting rows which meet certain criteria
select()
- creates a subset of the data by extracting a set of columns we identify by name
mutate()
- creates/transforms variables using pre-existing variables and adds them to the end of our tibble
group_by()
- adds grouping meta-data to the data

11 / 64

Dplyr

Dplyr provides a simple set of “verb” functions that make basic data manipulation easier:

filter()
- creates a subset of the data by extracting rows which meet certain criteria
select()
- creates a subset of the data by extracting a set of columns we identify by name
mutate()
- creates/transforms variables using pre-existing variables and adds them to the end of our tibble
group_by()
- adds grouping meta-data to the data
summarise()
- applies summary functions to columns in our data and returns a tibble of the results

11 / 64

12 / 64

Dplyr

This link will download the cheat sheet:

https://www.rstudio.org/links/data_transformation_cheat_sheet

This link has all of the help documentation and listing of available funcions:

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8

13 / 64

Magrittr

The pipe operator %>% is a very useful way of writing human-readable code, while avoiding intermediate objects within R to hold results.

This operator can be read as the phrase "and then...". It takes the results from the left side of the operator and inserts it as the FIRST argument of the right side of the operator.

# original function call
f(x,y)
# becomes
x %>% 
  f(y)

# original function call
filter(mtcars, cyl == 4)
# becomes
mtcars %>% 
  filter(cyl == 4)

14 / 64

Magrittr

The pipe operator %>% is a very useful way of writing human-readable code, while avoiding intermediate objects within R to hold results.

This operator can be read as the phrase "and then...". It takes the results from the left side of the operator and inserts it as the FIRST argument of the right side of the operator.

# original function call
f(x,y)
# becomes
x %>% 
  f(y)

# original function call
filter(mtcars, cyl == 4)
becomesmtcars %>% 
  filter(cyl == 4)

There is a keyboard shortcut in RStudio to insert the pipe operator ctrl+shift+m or cmd+shift+m

14 / 64

Tab completion for variable names

Can chain as many functions together as you would like. Though if the chain grows to more than a sequence of 10, it is recommended to save the results as an intermediate object and begin a new chain.

If you need to pipe the leftside of the operator into another argument, use the . placholder

Data Cleaning/Cleansing

15 / 64

Data Cleaning/Cleansing

Data Quality can be broken down into a few categories:

Validity
- In the sense of either valid or invalid, and not the degree to which a measurement corresponds to a true (possibly unknown) value.
  - What is your first name? 3 would not be a valid response

15 / 64

Data Cleaning/Cleansing

Data Quality can be broken down into a few categories:

Validity
- In the sense of either valid or invalid, and not the degree to which a measurement corresponds to a true (possibly unknown) value.
  - What is your first name? 3 would not be a valid response
Accuracy
- This is what psychologists typically refer to as validity or closeness of a value to a known standard or true value.

15 / 64

Data Cleaning/Cleansing

Data Quality can be broken down into a few categories:

Validity
- In the sense of either valid or invalid, and not the degree to which a measurement corresponds to a true (possibly unknown) value.
  - What is your first name? 3 would not be a valid response
Accuracy
- This is what psychologists typically refer to as validity or closeness of a value to a known standard or true value.
Completeness
- The degree to which all required measures are known.
- For us, this pertains to missing data

15 / 64

Data Cleaning/Cleansing

Data Quality can be broken down into a few categories:

Validity
- In the sense of either valid or invalid, and not the degree to which a measurement corresponds to a true (possibly unknown) value.
  - What is your first name? 3 would not be a valid response
Accuracy
- This is what psychologists typically refer to as validity or closeness of a value to a known standard or true value.
Completeness
- The degree to which all required measures are known.
- For us, this pertains to missing data
Consistency
- Pertains to incompatabilities within the data itself or as can be corroborated with other existing data

15 / 64

Data Cleaning/Cleansing

Data Quality can be broken down into a few categories:

Validity
- In the sense of either valid or invalid, and not the degree to which a measurement corresponds to a true (possibly unknown) value.
  - What is your first name? 3 would not be a valid response
Accuracy
- This is what psychologists typically refer to as validity or closeness of a value to a known standard or true value.
Completeness
- The degree to which all required measures are known.
- For us, this pertains to missing data
Consistency
- Pertains to incompatabilities within the data itself or as can be corroborated with other existing data
Uniformity
- Are all the values within a set of measures specified using the same units of measurement (inches vs cm)

15 / 64

Helpful blog which provides a great summary and description of data cleaning https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

Wikipedia entry which the blog was based upon https://en.wikipedia.org/wiki/Data_cleansing

Data-Profiling

Data-profiling is a technique for getting to know data at a deeper level. It consists of a series questions and checks we perform as we explore the data and see whether certain constraints were met.

We can do this using descriptive statistics and/or visualizations.

We will use both!

16 / 64

Data cleaning and profiling is an iterative process

We won't get everything right the first time around
We might improve or find a better way
There isn't always a clear place to start, but we have to start somewhere