Summarizing your Data

My file organization: Directory: “R Course”, Project: “Organizing Data”, File: “Summarizing Data”

For this course you will need to install and load the package “dplyr”.

Don’t forget to run the code! If you don’t want to use CTRL+Enter everytime, Rmarkdown has a way to “run the chunk”:

The “dplyr” package allows us to do multiple functions at once by using this sign: %>%. Think of it similar to a plus [+] sign.

A summary of common calculations to summarize your data:

  • count() = find the sample size for each in the specified measurement
  • top_n() =find the top value(s)
  • summarize() = use calculations to summarize parts of your data
  • mutate() = calculations of a column

A summary of the common functions to organize your data. You will need to use these functions if your want calculations for specific variables/columns/etc:

  • select() = selects columns
  • starts_with() = pick columns that start with specified
  • ends_with() = pick columns that end with specified character
  • rename() = renaming column, can be done similarly with select() within the code
  • group_by()= grouping by certain variables
  • ungroup() = use after group_by() to ungroup the variables

Ok let’s do some data summary!

Count() – How many plants had sepal length of the specified measurement. So for example 10 plants had a Sepal Length of 5.0!

top_n() – what is the longest Sepal Length.

Here we see that the longest Sepal Length is 7.9, belonging to the virginica species and the other characteristics for that row.

Now we can show the smallest Sepal Length, by changing the “1” to “-1”:

The smallest Sepal Length is 4.3, and the corresponding characteristics in that row.

summarize() = Let’s find the mean and standard deviation of Sepal Length and Width:

You can use a multitude of different calculations for summarize(), which you can check out here

mutate() = let’s transform our data with more calculations and create a new column with these calculations. You can mutate by numbers or by columns.

I’ve created two new columns to my data set with mutate. My first column “Sepal.Length.meters” I transformed my data from centimeters to meters by multiplying the Length column by 0.01. I also found the correlation/ratio value for Petal Length over Petal Width. Notice How I was able to add both columns with two functions of mutate with %>%

But what if I wanted to do all of these functions to specific variables, instead of the data overall?

Well this is where our organization functions can come into play.

What if I want the top values only for Sepal Length and not the other information that came along with it? Well I’d select() just the Sepal Length column:

You can select multiple columns, or select columns with similar characters using starts_with() or ends_with():

Here I selected columns that start with “Sepal”, but I am ignoring a case match, so R can match all columns regardless of capital or non-capital letters

Speaking of letter cases, how about we rename the columns so we do not have to capitalize the letters when typing the into R!

Have you noticed how the first two columns have changed letter cases? Oh also I accidentally misspelt “length” and made a column with that mistake!

Last but not least lets try group_by(). Lets say you don’t care for the average of all of Sepal.Length in total, but instead want to know the average Sepal.Length for each species.

First group by the categorical variable you want the calculations for, and then do your calculation!

This is just the beginning to dplyr, but I hope this helps you start on the basics of data summarization. But check out this page on dplyr to find out more.