“Happy families are all alike; every unhappy family is unhappy in its own way.”
“Happy families are all alike; every unhappy family is unhappy in its own way.”
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
Hadley Wickham
There are three aspects of tidy data
If we can get data into this form, then analyses and plots are much easier to generate.
Often, it's the first guideline that I find is the usual culprit.
Multiple variables are encoded together within a single column.
Often, it's the first guideline that I find is the usual culprit.
Multiple variables are encoded together within a single column.
Another culprit is to have information captured in the variable names.
But sometimes, it can be hard to see "problems". So let's try an example.
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.
mtcars
Let's take a look at the mtcars
data set. It has 32 rows and 11 columns. Only 8 columns are shown here.
mpg | cyl | disp | hp | drat | wt | qsec | vs | |
---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 |
Mazda RX4 Wag | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 |
Duster 360 | 14.3 | 8 | 360 | 245 | 3.21 | 3.57 | 15.84 | 0 |
But it also has information stored as rownames, these should be a vector in the data.frame. Furthermore, each row name contains the make and model of each car. Let's break these apart.
mtcars
The first step is to extract the row names and make them a variable within our data.frame/tibble.
Remember - always give good, descriptive names
mtcars %>% rownames_to_column(var = "make_model") %>% View()
mtcars
Let's check the results. Again, only a subset of columns are displayed here.
mtcars
For the next step we want to break apart or separate()
the make_model column into two pieces.
mtcars
For the next step we want to break apart or separate()
the make_model column into two pieces.
The easy part is finding a function which can do this.
mtcars
For the next step we want to break apart or separate()
the make_model column into two pieces.
The easy part is finding a function which can do this.
The hard part is telling the function exactly where to perform the splitting operation so that we have the model in one column and the make of the car in the other.
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = , into = , sep = )
mtcars
For the next step we want to break apart or separate()
the make_model column into two pieces.
The easy part is finding a function which can do this.
The hard part is telling the function exactly where to perform the splitting operation so that we have the model in one column and the make of the car in the other.
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = , into = , sep = )
col
specifies the column we want to break apartinto
specifies the names of the columns we want to break it intosep
is the break pointmtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
The first warning tells us that using a space character as the break point of the make_model
column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
The first warning tells us that using a space character as the break point of the make_model
column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).
The second warning tells us that 1 row in our data set did not have any spaces at all.
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 4, 29].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
The first warning tells us that using a space character as the break point of the make_model
column resulted in three pieces when we expected only 2 pieces (though only for 3 rows in our dataset).
The second warning tells us that 1 row in our data set did not have any spaces at all.
This means that we need to tell our separate()
function how we want to handle this potential loss of information.
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ", extra = "merge", fill = "right")
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ", extra = "merge", fill = "right")
mtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ", extra = "merge", fill = "right")
extra
tells the function what to do with the extra pieces of information. In this case, I opted to merge the remaining pieces with the second columnmtcars
mtcars %>% rownames_to_column(var = "make_model") %>% separate(col = "make_model", into = c("make","model"), sep = " ", extra = "merge", fill = "right")
extra
tells the function what to do with the extra pieces of information. In this case, I opted to merge the remaining pieces with the second column
fill
tells the function that if there are not enough pieces to fill our two new columns then fill the columns with NA
starting from the right.
mtcars
make | model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Mazda | RX4 | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 | 1 |
2 | Mazda | RX4 Wag | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 | 1 |
3 | Datsun | 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 |
4 | Hornet | 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 |
5 | Hornet | Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 | 0 |
6 | Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 | 0 | |
7 | Duster | 360 | 14.3 | 8 | 360 | 245 | 3.21 | 3.57 | 15.84 | 0 | 0 |
8 | Merc | 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20 | 1 | 0 |
9 | Merc | 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | 0 |
Often our data is easier to enter in wide format. Each row contains all attributes pertaining to a particular subject. Often, some R
functions require our data to be in long format.
Often our data is easier to enter in wide format. Each row contains all attributes pertaining to a particular subject. Often, some R
functions require our data to be in long format.
I generated some fictitious data to illustrate this.
In this example, I "administered" a math test to 5 subjects. Each subject wrote the same test at three time points, an initial baseline, a high-stress condition, and a low-stress condition.
## # A tibble: 5 x 5## id age baseline high_stress low_stress## <int> <dbl> <dbl> <dbl> <dbl>## 1 1 24 39.0 40.3 65.4## 2 2 18 57.9 36.9 67.6## 3 3 21 53.9 30.1 58.9## 4 4 24 53.5 48.8 66.9## 5 5 24 48.2 36.6 53.2
What might not be immediately evident is that there is a variable buried in the names of our columns, a factor called condition with three levels.
We can convert our data into wide-format using functions from the tidyr
package.
pivot_longer()
will work.
gather()
which is commonly usedWe need to supply it with a few arguments:
pivot_longer()
data_wide %>% pivot_longer(cols = c(baseline,high_stress,low_stress), names_to = "condition", values_to = "math_score")
## # A tibble: 15 x 4## id age condition math_score## <int> <dbl> <chr> <dbl>## 1 1 24 baseline 39.0## 2 1 24 high_stress 40.3## 3 1 24 low_stress 65.4## 4 2 18 baseline 57.9## 5 2 18 high_stress 36.9## 6 2 18 low_stress 67.6## 7 3 21 baseline 53.9## 8 3 21 high_stress 30.1## 9 3 21 low_stress 58.9## 10 4 24 baseline 53.5## 11 4 24 high_stress 48.8## 12 4 24 low_stress 66.9## 13 5 24 baseline 48.2## 14 5 24 high_stress 36.6## 15 5 24 low_stress 53.2
gather()
data_wide %>% gather(key = "condition", value = "math_score", baseline, high_stress, low_stress)
## # A tibble: 15 x 4## id age condition math_score## <int> <dbl> <chr> <dbl>## 1 1 24 baseline 39.0## 2 2 18 baseline 57.9## 3 3 21 baseline 53.9## 4 4 24 baseline 53.5## 5 5 24 baseline 48.2## 6 1 24 high_stress 40.3## 7 2 18 high_stress 36.9## 8 3 21 high_stress 30.1## 9 4 24 high_stress 48.8## 10 5 24 high_stress 36.6## 11 1 24 low_stress 65.4## 12 2 18 low_stress 67.6## 13 3 21 low_stress 58.9## 14 4 24 low_stress 66.9## 15 5 24 low_stress 53.2
data_long %>% pivot_wider(names_from = "condition", values_from = "math_score")
## # A tibble: 5 x 5## id age baseline high_stress low_stress## <int> <dbl> <dbl> <dbl> <dbl>## 1 1 24 39.0 40.3 65.4## 2 2 18 57.9 36.9 67.6## 3 3 21 53.9 30.1 58.9## 4 4 24 53.5 48.8 66.9## 5 5 24 48.2 36.6 53.2
data_long %>% spread(key = "condition", value = "math_score")
## # A tibble: 5 x 5## id age baseline high_stress low_stress## <int> <dbl> <dbl> <dbl> <dbl>## 1 1 24 39.0 40.3 65.4## 2 2 18 57.9 36.9 67.6## 3 3 21 53.9 30.1 58.9## 4 4 24 53.5 48.8 66.9## 5 5 24 48.2 36.6 53.2
Sometimes when you are cleaning up character data you might need ensure that data conforms to a particular pattern.
One common example would be to perform operations on a set of variables that follow a unique naming convention.
Regular expression can be used to check that data follow a particular form, we can also use regular expression to split our variables into multiple columns.
Regular expressions are written as strings (surrounded by quotation marks or double quotation marks).
The pattern is matched a single character at time, unless told otherwise.
So a pattern of "hello" would find exact matches for those characters.
There are pre-defined character class to make it easier to create more complex patterns.
There are many shorhand ways to write out these pre-defined character classes:
?
zero or one matches*
zero or more matches+
one or more matches{n}
exactly n matches{n,}
n or more matchesSay we want to ensure that dates to conform to ISO 8601: yyyy/mm/dd
To create a regular expression, we can go character-by-character and replace the string with character classed.
Say we want to ensure that dates to conform to ISO 8601: yyyy/mm/dd
To create a regular expression, we can go character-by-character and replace the string with character classed.
Together, our regular expression looks like this: "\d\d\d\d/\d\d/\d\d"
We can shorten it to: "\d{4}/\d{2}/\d{2}"
So, this pattern can be read as: 4 digits, a forward slash, 2 digits, forward slash, followed by 2 digits.
This link will download the cheat sheet:
https://github.com/rstudio/cheatsheets/raw/master/strings.pdf
This link has all of the help documentation and listing of available funcions:
https://www.rdocumentation.org/packages/stringr/versions/1.4.0
# Check a column of dates for ISO 8601 formatstr_detect(string = c("2020/07/19", "20/07/19"), pattern = "\\d{2,4}/\\d{2}/\\d{2}")
## [1] TRUE TRUE
# Examine where (if at all) our pattern can be found in a stringstr_view_all("Hello world!", pattern = "(?<=d)\\!")
# stringr also has extra character classes (see the cheat sheet)str_view_all("Hello world! Good-bye.", pattern = "[:punct:]")
I found an openly available dataset examining longitudinal change in sleep and daytime sleepiness in postpartum women (Filtness, MacKenzie, & Armstrong, 2014).
This work is licensed under a Creative Commons Attribution 4.0 International License.
Let's take a look at the data
. | |
---|---|
1 | Participant number |
2 | Age |
3 | Recruitment time (1 = pre birth, 2 = post birth) |
4 | Marital status (1 = Married / De facto / Partner) |
5 | Other child (1 = no, 2 = yes) |
6 | Education level (1= Junior certificate, 2 = senior certificate, 3 = associate diploma, 4 = diploma, 5 = degree, 6 = honours degree, 7 = masters or Ph.D.) |
7 | Returned to full time work by week 18 (0 = no, 1 = yes) |
data_sleep_depriv %>% pivot_longer(cols = starts_with("Week"), names_to = c("Week","Measure"), names_prefix = "Week", values_to = c("Score"), names_sep = "(?<=\\d) (?!=)")
Regular Expressions:
Tidyr: https://tidyr.tidyverse.org/articles/pivot.html
“Happy families are all alike; every unhappy family is unhappy in its own way.”
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |