Transforming Data in R
Class #7 (Tue., Jan 28)
PDF version
Reading:
Required Reading (everyone):
- R for Data Science, Ch. 2–3.
Reading Notes:
Overview of the reading
- See the homework assignment for the exercises you whould do.
Chapter 2
This very short chapter goes over the basics of coding in R. It discusses creating variables to store data or other information, and calling functions to transform data.
Chapter 3
This chapter is longer and gets into how we organize data in R. There are many ways to organize and work with data, and for this book and this course, we will follow Hadley Wickham’s principles of tidy data. The principles of tidy data are explained in section 5.2, which we will read for Thu., Jan. 30.
This chapter focuses on the dplyr package, which is part of the
larger tidyverse package.
The dplyr package (pronounced “Dee-Plier”, like “pliers”) defines a
data structure called a tibble (which stands for “Tidy Table”), and
a suite of functions for transforming tibbles.
A tibble is a variation on an R data.frame, and you can mostly
consider the two to be the same.
In a tibble, each row represents a different measurement or observation,
and it contains columns, which represent the different variables you
record for each observation:
tibbleThe dplyr package defines many functions to manipulate and transform
tibbles. The most important for this chapter are:
* filter to create a new tibble by choosing certain rows that match
conditions given in filter.
* arrange to rearrange a tibble by putting the rows in order of
different columns, from smallest to largest or largest to smallest.
* select to create a new tibble by selecting a subset of the
columns, and optionally renaming and re-ordering them.
* rename and rerlocate can change the names of columns or the order
of columns in the table.
* mutate to change the values of certain columns or create new columns.
You can use other columns to set the value of the new column to things
like the sum, difference, product, ratio, or other functions of other
columns.
* The pipe |> lets you easily connect multiple functions to create
a new tibble from a combination of these functions.
* group_by and ungroup let you perform operations on groups of
rows in a tibble. For instance, if a tibble contains measurements
of rocks retrieved from five different sites, you can use group_by
to report the average silica content of rocks at each site.
* summarize lets you generate summary statistics, such as mean,
median, standard deviation, maximum, or minimum, for selected
columns.
* slice functions let you select certain rows from a tibble, such
as the 5 rows with the greatest values for a certain column.
There is a lot more to dplyr, but this will get us started.