Better Data Cleaning using R and the Tidyverse

1 / 27

“Happy families are all alike; every unhappy family is unhappy in its own way.”

Leo Tolstoy

2 / 27

“Happy families are all alike; every unhappy family is unhappy in its own way.”

Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

Hadley Wickham

2 / 27

Tidy Data

There are three aspects of tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

If we can get data into this form, then analyses and plots are much easier to generate.

3 / 27

The Usual Suspects

Often, it's the first guideline that I find is the usual culprit.

Multiple variables are encoded together within a single column.

e.g., Participant ID
- 100m25
- This is the 100th subject who is a 25 year-old male

4 / 27

The Usual Suspects

Often, it's the first guideline that I find is the usual culprit.

Multiple variables are encoded together within a single column.

e.g., Participant ID
- 100m25
- This is the 100th subject who is a 25 year-old male

Another culprit is to have information captured in the variable names.

This is often done regarding study design features
- e.g., Condition or time value was recorded

But sometimes, it can be hard to see "problems". So let's try an example.

4 / 27

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

Example 1: `mtcars`

Let's take a look at the mtcars data set. It has 32 rows and 11 columns. Only 8 columns are shown here.

Show entries

Search:

	mpg	cyl	disp	hp	drat	wt	qsec	vs
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0
Valiant	18.1	6	225	105	2.76	3.46	20.22	1
Duster 360	14.3	8	360	245	3.21	3.57	15.84	0

Showing 1 to 7 of 32 entries

Previous1 2 3 4 5Next

5 / 27

But it also has information stored as rownames, these should be a vector in the data.frame. Furthermore, each row name contains the make and model of each car. Let's break these apart.

Example 1: `mtcars`

The first step is to extract the row names and make them a variable within our data.frame/tibble.

Remember - always give good, descriptive names

mtcars %>% 
  rownames_to_column(var = "make_model") %>%
  View()

6 / 27

Example 1: `mtcars`

Let's check the results. Again, only a subset of columns are displayed here.

Show entries

Search:

	make_model	mpg	cyl	disp	hp	drat	wt
1	Mazda RX4	21	6	160	110	3.9	2.62
2	Mazda RX4 Wag	21	6	160	110	3.9	2.875
3	Datsun 710	22.8	4	108	93	3.85	2.32
4	Hornet 4 Drive	21.4	6	258	110	3.08	3.215
5	Hornet Sportabout	18.7	8	360	175	3.15	3.44
6	Valiant	18.1	6	225	105	2.76	3.46
7	Duster 360	14.3	8	360	245	3.21	3.57

Showing 1 to 7 of 32 entries

Previous1 2 3 4 5Next

7 / 27

Example 1: `mtcars`

For the next step we want to break apart or separate() the make_model column into two pieces.

8 / 27

Example 1: `mtcars`

For the next step we want to break apart or separate() the make_model column into two pieces.

The easy part is finding a function which can do this.

8 / 27

Example 1: `mtcars`

For the next step we want to break apart or separate() the make_model column into two pieces.

The easy part is finding a function which can do this.

The hard part is telling the function exactly where to perform the splitting operation so that we have the model in one column and the make of the car in the other.

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = ,
           into = ,
           sep = )

8 / 27

Example 1: `mtcars`

For the next step we want to break apart or separate() the make_model column into two pieces.

The easy part is finding a function which can do this.

The hard part is telling the function exactly where to perform the splitting operation so that we have the model in one column and the make of the car in the other.

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = ,
           into = ,
           sep = )

col specifies the column we want to break apart
into specifies the names of the columns we want to break it into
sep is the break point

8 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].

9 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].

The first warning tells us that using a space character as the break point of the make_model column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).

9 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].

The first warning tells us that using a space character as the break point of the make_model column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).

The second warning tells us that 1 row in our data set did not have any spaces at all.

9 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].

The first warning tells us that using a space character as the break point of the make_model column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).

The second warning tells us that 1 row in our data set did not have any spaces at all.

This means that we need to tell our separate() function how we want to handle this potential loss of information.

9 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ",
           extra = "merge",
           fill = "right")

10 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ",
           extra = "merge",
           fill = "right")

10 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ",
           extra = "merge",
           fill = "right")

extra tells the function what to do with the extra pieces of information. In this case, I opted to merge the remaining pieces with the second column

10 / 27

Example 1: `mtcars`

mtcars %>% 
  rownames_to_column(var = "make_model") %>% 
  separate(col = "make_model",
           into = c("make","model"),
           sep = " ",
           extra = "merge",
           fill = "right")

extra tells the function what to do with the extra pieces of information. In this case, I opted to merge the remaining pieces with the second column
fill tells the function that if there are not enough pieces to fill our two new columns then fill the columns with NA starting from the right.

10 / 27

Example 1: `mtcars`

Show entries

Search:

	make	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am
1	Mazda	RX4	21	6	160	110	3.9	2.62	16.46	0	1
2	Mazda	RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1
3	Datsun	710	22.8	4	108	93	3.85	2.32	18.61	1	1
4	Hornet	4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0
5	Hornet	Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0
6	Valiant		18.1	6	225	105	2.76	3.46	20.22	1	0
7	Duster	360	14.3	8	360	245	3.21	3.57	15.84	0	0
8	Merc	240D	24.4	4	146.7	62	3.69	3.19	20	1	0
9	Merc	230	22.8	4	140.8	95	3.92	3.15	22.9	1	0

Showing 1 to 9 of 32 entries

Previous1 2 3 4Next

11 / 27

Long vs Wide Format

12 / 27

Long vs Wide Format

Often our data is easier to enter in wide format. Each row contains all attributes pertaining to a particular subject. Often, some R functions require our data to be in long format.

12 / 27

Long vs Wide Format

Often our data is easier to enter in wide format. Each row contains all attributes pertaining to a particular subject. Often, some R functions require our data to be in long format.

I generated some fictitious data to illustrate this.

In this example, I "administered" a math test to 5 subjects. Each subject wrote the same test at three time points, an initial baseline, a high-stress condition, and a low-stress condition.

## # A tibble: 5 x 5
##      id   age baseline high_stress low_stress
##   <int> <dbl>    <dbl>       <dbl>      <dbl>
## 1     1    24     39.0        40.3       65.4
## 2     2    18     57.9        36.9       67.6
## 3     3    21     53.9        30.1       58.9
## 4     4    24     53.5        48.8       66.9
## 5     5    24     48.2        36.6       53.2

12 / 27

What might not be immediately evident is that there is a variable buried in the names of our columns, a factor called condition with three levels.

Long to Wide Format

We can convert our data into wide-format using functions from the tidyr package.

pivot_longer() will work.

there is another function, gather() which is commonly used

We need to supply it with a few arguments:

the data we want to transform
the columns we want to "gather" together into a new column/variable
- the new column is referred to as the "key"
the column we want the values from the cells to be placed into
- the new column is referred to as the "value" column

13 / 27

`pivot_longer()`

data_wide %>% 
  pivot_longer(cols = c(baseline,high_stress,low_stress),
               names_to = "condition",
               values_to = "math_score")

## # A tibble: 15 x 4
##       id   age condition   math_score
##    <int> <dbl> <chr>            <dbl>
##  1     1    24 baseline          39.0
##  2     1    24 high_stress       40.3
##  3     1    24 low_stress        65.4
##  4     2    18 baseline          57.9
##  5     2    18 high_stress       36.9
##  6     2    18 low_stress        67.6
##  7     3    21 baseline          53.9
##  8     3    21 high_stress       30.1
##  9     3    21 low_stress        58.9
## 10     4    24 baseline          53.5
## 11     4    24 high_stress       48.8
## 12     4    24 low_stress        66.9
## 13     5    24 baseline          48.2
## 14     5    24 high_stress       36.6
## 15     5    24 low_stress        53.2

14 / 27

`gather()`

data_wide %>% 
  gather(key = "condition",
         value = "math_score",
         baseline, high_stress, low_stress)

## # A tibble: 15 x 4
##       id   age condition   math_score
##    <int> <dbl> <chr>            <dbl>
##  1     1    24 baseline          39.0
##  2     2    18 baseline          57.9
##  3     3    21 baseline          53.9
##  4     4    24 baseline          53.5
##  5     5    24 baseline          48.2
##  6     1    24 high_stress       40.3
##  7     2    18 high_stress       36.9
##  8     3    21 high_stress       30.1
##  9     4    24 high_stress       48.8
## 10     5    24 high_stress       36.6
## 11     1    24 low_stress        65.4
## 12     2    18 low_stress        67.6
## 13     3    21 low_stress        58.9
## 14     4    24 low_stress        66.9
## 15     5    24 low_stress        53.2

15 / 27

Long to Wide Format

data_long %>% 
    pivot_wider(names_from = "condition",
                values_from = "math_score")

## # A tibble: 5 x 5
##      id   age baseline high_stress low_stress
##   <int> <dbl>    <dbl>       <dbl>      <dbl>
## 1     1    24     39.0        40.3       65.4
## 2     2    18     57.9        36.9       67.6
## 3     3    21     53.9        30.1       58.9
## 4     4    24     53.5        48.8       66.9
## 5     5    24     48.2        36.6       53.2

data_long %>% 
  spread(key = "condition", value = "math_score")

## # A tibble: 5 x 5
##      id   age baseline high_stress low_stress
##   <int> <dbl>    <dbl>       <dbl>      <dbl>
## 1     1    24     39.0        40.3       65.4
## 2     2    18     57.9        36.9       67.6
## 3     3    21     53.9        30.1       58.9
## 4     4    24     53.5        48.8       66.9
## 5     5    24     48.2        36.6       53.2

16 / 27

Regular Expressions

17 / 27

Regular Expressions

Sometimes when you are cleaning up character data you might need ensure that data conforms to a particular pattern.

One common example would be to perform operations on a set of variables that follow a unique naming convention.

e.g., BDI1 is the first item from the Beck Depression Inventory

Regular expression can be used to check that data follow a particular form, we can also use regular expression to split our variables into multiple columns.

17 / 27

Regular Expressions: Crash Course

Regular expressions are written as strings (surrounded by quotation marks or double quotation marks).

The pattern is matched a single character at time, unless told otherwise.

So a pattern of "hello" would find exact matches for those characters.

There are pre-defined character class to make it easier to create more complex patterns.

There are many shorhand ways to write out these pre-defined character classes:

\d - a digit
\s - a white space character (tab, space)
\w - any word character

18 / 27

More Complex Patterns

Quantifiers

? zero or one matches
* zero or more matches
+ one or more matches
{n} exactly n matches
{n,} n or more matches

Look-Arounds

a(?=c)
- a is followed by c
a(?!c)
- a is not followed by c
(?<=b)a
- a is preceded by b
(?<!b)a
- a is not preceded by b

Alternates

a|b
- means a or b
[abc]
- defines your own character class, one of the set
[a-c]
- defines a logical sequence of characters

Anchors

^
- Start of a string
$
- End of a string

19 / 27

Regular Expression: Matching A Date

Say we want to ensure that dates to conform to ISO 8601: yyyy/mm/dd

To create a regular expression, we can go character-by-character and replace the string with character classed.

Year:
- \d\d\d\d
Month:
- \d\d
Day:
- \d\d
The forward slashes "/"
- /
- We want to exactly match the forward slash, so it remains as is.

20 / 27

Regular Expression: Matching A Date

Say we want to ensure that dates to conform to ISO 8601: yyyy/mm/dd

To create a regular expression, we can go character-by-character and replace the string with character classed.

Year:
- \d\d\d\d
Month:
- \d\d
Day:
- \d\d
The forward slashes "/"
- /
- We want to exactly match the forward slash, so it remains as is.

Together, our regular expression looks like this: "\d\d\d\d/\d\d/\d\d"

We can shorten it to: "\d{4}/\d{2}/\d{2}"

20 / 27

So, this pattern can be read as: 4 digits, a forward slash, 2 digits, forward slash, followed by 2 digits.

stringr

This link will download the cheat sheet:

https://github.com/rstudio/cheatsheets/raw/master/strings.pdf

This link has all of the help documentation and listing of available funcions:

https://www.rdocumentation.org/packages/stringr/versions/1.4.0

21 / 27

Regular Expression Testing

# Check a column of dates for ISO 8601 format
str_detect(string = c("2020/07/19",
                      "20/07/19"),
           pattern = "\\d{2,4}/\\d{2}/\\d{2}")

## [1] TRUE TRUE

# Examine where (if at all) our pattern can be found in a string
str_view_all("Hello world!", pattern = "(?<=d)\\!")

# stringr also has extra character classes (see the cheat sheet)
str_view_all("Hello world! Good-bye.", pattern = "[:punct:]")

22 / 27

More complicated example

I found an openly available dataset examining longitudinal change in sleep and daytime sleepiness in postpartum women (Filtness, MacKenzie, & Armstrong, 2014).

Self report measures were taken for 7 consecutive days for postpartum weeks 6, 12, and 18
TSS - is the Total Sleep Time
KSS - is the Karolinski Sleepiness Scale
ESS - Epworth Sleepiness Scale

This work is licensed under a Creative Commons Attribution 4.0 International License.

23 / 27

Examine the Column Names to Find a Pattern

Let's take a look at the data

Show entries

Search:

	.
1	Participant number
2	Age
3	Recruitment time (1 = pre birth, 2 = post birth)
4	Marital status (1 = Married / De facto / Partner)
5	Other child (1 = no, 2 = yes)
6	Education level (1= Junior certificate, 2 = senior certificate, 3 = associate diploma, 4 = diploma, 5 = degree, 6 = honours degree, 7 = masters or Ph.D.)
7	Returned to full time work by week 18 (0 = no, 1 = yes)

Showing 1 to 7 of 43 entries

Previous1 2 3 4 5 6 7Next

24 / 27

Complicated Example Solution25 / 27

Complicated Example Solution

data_sleep_depriv %>% 
  pivot_longer(cols = starts_with("Week"),
               names_to = c("Week","Measure"),
               names_prefix = "Week",
               values_to = c("Score"),
               names_sep = "(?<=\\d) (?!=)")

25 / 27

Take Home MessageHaving data in a "tidy" format gives us a lot of power and flexibility.
Converting between long and wide formats can be necessary for particular analyses in R, so I suggest practising.
Anytime you need to match patterns, regular expression can be indispensable
26 / 27

Resources:

Filtness, A.J.; MacKenzie, J.; Armstrong, K. (2014): Sleep and daytime sleepiness in postpartum women at 6, 12 and 18 weeks. Queensland University of Technology. (Dataset) https://doi.org/10.4225/09/586b2c39dbec8

Regular Expressions:

Online exercises - https://regexone.com/

Tidyr: https://tidyr.tidyverse.org/articles/pivot.html

27 / 27

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Tidy Data

The Usual Suspects

The Usual Suspects

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Example 1: mtcars

Long vs Wide Format

Long vs Wide Format

Long vs Wide Format

Long to Wide Format

pivot_longer()

gather()

Long to Wide Format

Regular Expressions

Regular Expressions

Regular Expressions: Crash Course

More Complex Patterns

Quantifiers

Look-Arounds

Alternates

Anchors

Regular Expression: Matching A Date

Regular Expression: Matching A Date

stringr

Regular Expression Testing

More complicated example

Examine the Column Names to Find a Pattern

Complicated Example Solution

Complicated Example Solution

Take Home Message

Resources:

Help

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

Example 1: `mtcars`

`pivot_longer()`

`gather()`