Overview and aims
The goal for this module is to help you to learn the basics of using R so you can go away and start to tackle your own data. R is a programming language that was first launched back in 2000. There are many reasons to use R:
- it is easy to use and has excellent graphic capabilities
- it has lots of new statistical methods to be used in a straightforward manner
- it’s open source and free.
- because of ⬆️, it is supported by a large user network.
- because of ⬆️⬆️, it has lots and lots of new analysis packages being released all of the time
- it is also old, but because of ⬆️⬆️⬆️, you can use the more intuitive new functions which does things much faster.
- it can be run on Windows, Linux and Mac
Within the research discipline, what most people think of data analysis is that they think of the following below - turning data, perhaps curated in a spreadsheet like excel, to some shiny figures that are presentable.
There is now the exciting research discipline of Data Science that allows you to “turn raw data into understanding, insight and knowledge”. Before we start, we need to cite the materials that we have been inspired by and reformatted to make use of helminth related data in this course:
- R for Data Science written by Hadley Wickham and Garrett Grolemund (will now be referred as R4DS. https://r4ds.had.co.nz/ )
- A ModernDive into R and the Tidyverse written by Chester Ismay and Albert Y. Kim (will be referred as ModernDive; https://moderndive.com/index.html )
These books have been designed for absolute beginners to learn all the basics of R. They are free online and we encourage you to read further in your own time.
The goal of data analysis is to explore, gain insights, and interpret your data. A analysis cycle usually looks something like this:
Part 1: Introduction to R
In this module, we are introducing you to the programming language R via RStudio. So what is the difference between the two? You should think of RStudio as a car’s shiny dashboard and R as the engine. “More precisely, R is a programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.” (ModernDive) .
R can certainly be run independently from RStudio - you can type “R” on the command line (providing it is installed - it is installed on your VM) - and the R command line prompt will appear. Sometimes in may in fact be necessary to run from the commandline, especially if you have a large computational task that requires sending jobs to a high preformance computing cluster. However, for the most part, you will find RStudio will be sufficient for what you need. If you want to download RStudio later on your own personal computer, R will automatically be installed with it in the background.
Let’s get started by opening Rstudio on your VM by double-clicking on the RStudio icon on the VM desktop.
After RStudio opens, you should see something similar like Figure 3
PS. If you see a window saying Updates Available , this means a new version of R studio is available. For the purpose of this exercise you can just click Ignore Update.
R as a calculator
To code in R, you can start by clicking on the Console panel. It is typically the bottom left panel. Type something, and press Enter/Return
# Some housekeeping - if you see text after the hash '#' symbol throguhout the code, it is a
# comment or instruction and not actual code (just like this!). It means that you do not have
# to type it in to the command line. Making comments in your code is actually really important
# - it allows you to remember what you have done, describe a result or process, and makes it
# much easier for someone else to read, understand, and reuse your code.
# It is a great habit to have!
# R in its basic form is a fancy calculator. Lets try this out. Once you have typed the equation,
# press enter to complete the expression
# First, a simple one. We are going to do 2 plus 3.
2 + 3
# We can do multiplications
2 * 3
# and division
2 / 3
# By simply pressing return/enter, you get a complete expression
1
# An incomplete expression will result in continuation prompt, indicated by the "+" symbol.
3+
1111+
1000
Simple 😎 , isn’t it?
Good habit of writing in a R script.
It’s usually good practice to save your codes in a script,
so you can organise your workings in a file that can be reused later! In
R studio, you can do it by File
-> New File
-> R script
.
Now your R studio should have a empty R script where you can type
your command. Note you can just run the line of code where your cursor
“|” is at the script window by pressing left Ctrl
+
Enter
. Alternatively, you can also click the Run button on
the top of the script window. Try it!
For the rest of the exercise you should try to copy and paste each code chunk into the R script window and go through it line by line. Then run the code and execute it whenever it’s required.
Assignment operator
The calculations alone will be meaningless if we don’t store them for
later use. We can save the output of our command using the assignment
operator <-
to create a new
variable.
# "<-" is the assignment operator. Let's assign the number 5 to the variable "x".
<- 5
x
# Next, assign 10 to y
<- 10
y
# To check if the assignments are successful, you can always simply type x or y and press return
x
y
# We can perform calculations on data stored in a variable
+ y
x * y
x
# Like all programming languages, R is case sensitive. For example, the lower-case variable
# "x" is not the same as the upper-case variable "X"
<- 20
X
# Test this by comparing the two variables
x
X
# We can overwrite variables by simply reassigning it. Previously, we assigned 5 to x.
# However, once x is assigned to 10. x is no longer 5.
<- 10
x
x
# In the same way that we can perform caluclations using data stored in variables,
# we can also assign the output of that calculation to a new variable.
<- x + y + X
z
z
# A reminder: Have you copied and pasted this code chunk to R script and
# go through them line by line?
Exercises
1. Calculate the number of minutes in a single day by first assigning
and then multiplying a hour
variable and a
minutes
variable
# how many hours are in a day? I (or everyone) wish there were 48 but there are only 24 hours..
<- 24
hour # please continue
- Congratulations! you have won £100 (only for the exercise doh! ), but have to split it with your classmates. How much money will each of you get? So first you’d have to find out how many participants are in this course and assign a varibale. Will you share with the trainers? Please?
R data structure
In order to start appreciating the power of R, the data needs to be organised or “tidied” in such a way so that calculations can be performed on them. There are many data types and structures in R. For simplicity we will introduce only two data structures: vector and data frame, and two data types: numeric (numbers) and character (text based).
A vector is the simplest data structure in R. It essentially contains a series of values of the same type. Data frames are a collection of vectors in columns. Think of it like Excel spreadsheets, where each row represents a sample, and each variable measured in a column. To iterate:
Rows represent samples E.g., sample A in Row 1, sample B in Row 2, and so on….
And all the values of the same variable must go in the same column. We can have multiple columns with multiple, different data types, for example, age, sex, RPKM, numbers, whatever.
We note here that there are newer and more intuitive ways of
manipulation of data frame-type data, and we will use
tibble
from the dplyr
package later in the
module.
# Lets explore what a vector looks like. Assign a vector of 1 to 10 to the variable x.
<- c(1,2,3,4,5,6,7,8,9,10)
x
#> The "c" outside of the brackets means to "combine" or "concatenate" the values within
# the brackets
# Now that we have a vector, we can do calculations on all the values within a vector at
# the same time! Try the following:
* 2
x
# and
/ 10 + 1
x
# you can even perform calculations on data stored in multiple vectors.
<- c(1,2,3,4,5,6,7,8,9,10)
y
* y
x
# using vectors becomes a very efficient way of storing and manipulating large data.
Functions
Note in the previous code block, we actually introduced the
function c
. A function provides a
simplfied way to perform a more complex operation on data. We can write
our own functions (which is perhaps beyond the scope of this tutorial),
however, R conveniently contains many built-in functions that are
waiting to be discovered! A function is applied using round
brackets , and can take arguments and
options.
function (arg1, arg2, arg3… , option1=,option2=...)
# Previously, we defined a vector of numbers from 1 to 10 using the following command
<- c(1,2,3,4,5,6,7,8,9,10)
x
# However, we can use the function "seq" to achieve the same outcome, much more efficiently.
<- seq(1,10)
x
# As you can see, using a function can save time and help with reproducibility. You are more
# likely to make a mistake typing all of the numbers in the first command. Similarly, you
# wouldn't want to type out all numbers if you have 1000s or more. Moreover, if you are doing
# repetitive tasks, defining a function can save a lot of time and typing.
# Try adding the option "by = 2" in the seq function. What does it do?
<- seq(1,10, by = 2)
x
x
# A useful hint to better understand an in-built functon: add "?" in front of the function and
# press enter to view what options a particular function has available.
?seq
Exercise
Try out all of these functions: length
, mean
,
mean
, median
, max
, on
x
to see what they produce!
Hint: For example, mean(x)
We will end this first section with a plot
function,
based on what you have been taught today.
# Lets set two vectors of 25 numbers, in two different ways.
<- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
x
# note, we can do the same using the "rep" function:
<- rep(1:3, each=10)
x
# for our second vector, lets generate some random numbers sampled from a normal
# distribution using the "rnorm" function
<- rnorm(30)
y
# Lets generate a simple plot using "baseR" graph functions to visualise the relationship
# between x and y.
# Note that each set of numbers will be paired according to their order in the vector
plot(x, y)
# Lets add some colour using the "col" option
plot(x, y, col="red")
# We can also colour using our data. In our current data, the x variable is categorical, so this
# might be good to set as a colour to distingush each group.
plot(x, y, col=x)
# We can also change the size and shape of the points using the cex and pch options, respectively
plot(x, y, col=x, pch=20, cex=2)
# Try changing the "pch" value from between 1 and 20 and see what you come up with!
# many functions have additional parameters. Let's try a boxplot.
# note we have used a "~" rather than a ",". This indicates we want to show "y" data for
# each "x" category as one might expect in a box plot.
boxplot(y ~ x)
# I like hotpink
boxplot(y ~ x, col = c("hotpink", "yellow", "red") )
boxplot(y ~ x, col = c("hotpink", "yellow", "red"), main="My first plot" )
# can use the ? sign to find out more about function
?boxplot
Exercise
1. Actually I am kidding about hotpink. Can you try to regenerate the
boxplot using different colours?
2. In the code above, we used the rnorm
to generate some
random numbers samples from a normal distribution. Let’s visualise that
this in fact a normal distribution. Can you set a new variable
containing 1000 randomly sampled numbers, and visualise this using the
hist
function?
Part 2: Data Wrangling
A large part of any analysis is data wrangling, which is a term used in data science describing the process of getting your raw data into a format that can be used in R for further analysis and visualisation. Raw data is often very messy, will likely require some sort of manipulation to get it into a suitable format, and may represent a significant amount of the overall analysis time! In the old days, we typically copied and pasted from different excel spreadsheets. The problem with this approach is that if new data arrives, the whole manual process has to be repeated again, which is time consuming and may lead to human caused errors. In general, to enable reproducibility we want to minimise the amount of manual manipulation and instead, use commands that allow us to automate these processes.
In this section of the module, you will see that a lot of actions can be performed using existing in-built functions in R.
Packages
R consists of a core set of in-built functions which can be supplemented with additional packages. Packages are collections of R functions, data, and compiled code, with a well-defined format that ensures easy installation and a basic standard set of documentation, enabling portability and reliability.
Here, we will use the tidyverse
package, which is highly
recommended for manipulating data structures.
# Normally, you would have to install a package the first time you want to use it. There is
# a function for this, called "install.packages()", for example:
# install.packages("tidyverse")
# We have already installed tidyverse on the VM for you, so you do not need to do this (hence,
# it is commented with a #).
# Once a package has been installed, to access it, you will need to use the "library()" function
# to load the package each time you start R and want to use it:
library(tidyverse)
Data import
In this exercise, we will be using helminth prevalence data downloaded from the Global Atlas of Helminth Infection (GAHI; http://www.thiswormyworld.org ). There will be three files which are available to download from the Github page.
Ascaris_prevalence_data.txt
Hookworms prevalence_data.txt
Schisoma_mansoni_prevalence_data.txt
They are text files where each column of data is separated by tab, i.e., a tab delimiter. This allows R to know how data is separated in each column. Note that there are different types of delimiters, and that these often cause problems with getting data in R in the first place if not correctly set or formatted correctly in the data. Worth paying attention to this in your own data, and for ease, try sticking to one type.
Let’s get started. First we are going to download and read the
Ascaris_prevalence.txt
file into R from GitHub using the
read_delim
function in R. We need to tell R that is it tab
delimited. Once it’s read successfully, the file will be a
tibble
format, which allows a lot of nice functions to act
on it.
<- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Ascaris_prevalence_data.txt'
ascaris_data_url <- read_delim(file = ascaris_data_url, delim ="\t")
ascaris_data
# Success! You should have read some messages!
# Now ascaris is a tibble
ascaris_data
## # A tibble: 989 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.59 13.6 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.63 13.7 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.61 13.6 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.62 14.2 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 8 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 9 Africa Angola AO Bengo -8.64 14.0 2010 2010 0 80
## 10 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## # … with 979 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# What columns does the table have?
colnames(ascaris_data)
## [1] "Region" "Country" "ISO_code"
## [4] "ADM1" "Latitude" "Longitude"
## [7] "Year_start" "Year_end" "Age_start"
## [10] "Age_end" "Individuals_surveyed" "Number_Positives"
## [13] "Prevalence"
# A quick tip is you can select one column using the $ symbol
# The result will be a vector containing all the values from the selected column
# In the example below we are selecting the Country column from ascaris_data
$Country ascaris_data
Data tranformation
In this section, you are going to learn five key dplyr
functions that allow you to solve the vast majority of your data
manipulation challenges:
- Pick observations by their values (
filter()
). - Reorder the rows (
arrange()
). - Pick variables by their names (
select()
). - Create new variables with functions of existing variables
(
mutate()
). - Collapse many values down to a single summary
(
summarise()
).
These can all be used in conjunction with a sixth function -
group_by()
- which changes the scope of each function from
operating on the entire dataset to operating on it group-by-group.
Overall, these functions provide the verbs for a language of data
manipulation. (R4DS)
Filtering
filter()
allows you to subset observations based on
their values. The first argument is the name of the data frame. The
second and subsequent arguments are the expressions that filter the
tibble.
# Let's filter our data based on the region "Africa". Here, we use the logical operator "==" to
# find all words equal to "Africa" in the "Region" column
filter(ascaris_data, Region == "Africa")
## # A tibble: 856 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.59 13.6 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.63 13.7 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.61 13.6 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.62 14.2 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 8 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 9 Africa Angola AO Bengo -8.64 14.0 2010 2010 0 80
## 10 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## # … with 846 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# Note that there are different logical orerators that can be used to filter your data in many
# different ways. Also note the previous command only displays the output, which is normal when
# you are trying to explore different functions and options.
# In order to save the output, you need to redirect the output into a new tibble.
<- filter(ascaris_data, Region == "Africa")
ascaris.africa
ascaris.africa
## # A tibble: 856 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.59 13.6 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.63 13.7 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.61 13.6 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.62 14.2 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 8 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 9 Africa Angola AO Bengo -8.64 14.0 2010 2010 0 80
## 10 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## # … with 846 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# We can also select for multiple conditions within our data by using "%in%". In this case, we
# want to select samples from Ghana AND Malawi. Note we are using the "c" function as we have
# done previously.
<- filter(ascaris_data,Country %in% c("Ghana", "Malawi"))
ascaris.GhanaAndMalawi ascaris.GhanaAndMalawi
## # A tibble: 110 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Ghana GH Asha… 7.30 -1.69 2008 2008 9 18
## 2 Africa Ghana GH Cent… 5.13 -1.20 2008 2008 8 16
## 3 Africa Ghana GH Cent… 5.52 -1.33 2008 2008 8 15
## 4 Africa Ghana GH Cent… 5.58 -1.47 2008 2008 8 17
## 5 Africa Ghana GH Nort… 9.08 -1.83 2008 2008 10 18
## 6 Africa Ghana GH Asha… 6.59 -1.86 2008 2008 10 15
## 7 Africa Ghana GH Uppe… 11.0 -0.484 2008 2008 11 19
## 8 Africa Ghana GH Asha… 6.58 -1.12 2008 2008 9 18
## 9 Africa Ghana GH Nort… 8.47 -2.17 2008 2008 8 18
## 10 Africa Ghana GH Nort… 9.68 0.173 2008 2008 7 18
## # … with 100 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# Sometimes, we want to exclude certain data. Add a "!" in front of your argument means we want
# all the rows with the Country EXCEPT Angola.
<- filter(ascaris_data,! Country == "Angola")
ascaris.NotAngola ascaris.NotAngola
## # A tibble: 951 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Burundi BI Buru… -3.86 29.5 2007 2007 10 16
## 2 Africa Burundi BI Cibi… -2.84 29.3 2007 2007 10 16
## 3 Africa Burundi BI Ngozi -2.84 29.7 2007 2007 10 16
## 4 Africa Burundi BI Buba… -3.16 29.4 2007 2007 10 16
## 5 Africa Burundi BI Buju… -3.42 29.4 2007 2007 10 16
## 6 Africa Burundi BI Mwaro -3.53 29.7 2007 2007 10 16
## 7 Africa Burundi BI Maka… -4.20 29.7 2007 2007 10 16
## 8 Africa Burundi BI Muyi… -2.90 30.4 2007 2007 10 16
## 9 Africa Burundi BI Kiru… -2.53 30.4 2007 2007 10 16
## 10 Africa Burundi BI Cank… -3.23 30.6 2007 2007 10 16
## # … with 941 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# Similarly, instead of the "!" in front of Country, we could have also used "!=" which indicates
# "does not equal".
<- filter(ascaris_data, Country != "Angola")
ascaris.NotAngola ascaris.NotAngola
## # A tibble: 951 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Burundi BI Buru… -3.86 29.5 2007 2007 10 16
## 2 Africa Burundi BI Cibi… -2.84 29.3 2007 2007 10 16
## 3 Africa Burundi BI Ngozi -2.84 29.7 2007 2007 10 16
## 4 Africa Burundi BI Buba… -3.16 29.4 2007 2007 10 16
## 5 Africa Burundi BI Buju… -3.42 29.4 2007 2007 10 16
## 6 Africa Burundi BI Mwaro -3.53 29.7 2007 2007 10 16
## 7 Africa Burundi BI Maka… -4.20 29.7 2007 2007 10 16
## 8 Africa Burundi BI Muyi… -2.90 30.4 2007 2007 10 16
## 9 Africa Burundi BI Kiru… -2.53 30.4 2007 2007 10 16
## 10 Africa Burundi BI Cank… -3.23 30.6 2007 2007 10 16
## # … with 941 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
Exercises
1. Can you create a tibble
that contains all the rows
except those from Senegal
and
Philippines
?
2. Can you create a tibble
with all sites with a
prevalence
greater than 0.5?
Select
It’s not uncommon to get datasets with hundreds or even thousands of
variables. In this case, the first challenge is often narrowing in on
the variables you’re actually interested in. select()
allows you to rapidly zoom in on a useful subset using operations based
on the names of the variables.
# Lets subset the data to extract Country and Prevalence
select(ascaris_data, Country, Prevalence)
## # A tibble: 989 × 2
## Country Prevalence
## <chr> <dbl>
## 1 Angola 0.143
## 2 Angola 0.167
## 3 Angola 0.0964
## 4 Angola 0.381
## 5 Angola 0.0952
## 6 Angola 0.05
## 7 Angola 0.170
## 8 Angola 0
## 9 Angola 0.0635
## 10 Angola 0.677
## # … with 979 more rows
Mutate
Besides selecting sets of existing columns, it’s often useful to add
new columns that are functions of existing columns. That’s the job of
mutate()
.
mutate() always adds new columns at the end of your dataset, so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View().
# The prevalence is determined by dividing the number of positive infection cases by the total
# number of individuals tested.
<- mutate(ascaris_data, Prevalence_2 = Number_Positives / Individuals_surveyed)
ascaris_data_check ascaris_data_check
## # A tibble: 989 × 14
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.59 13.6 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.63 13.7 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.61 13.6 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.62 14.2 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 8 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 9 Africa Angola AO Bengo -8.64 14.0 2010 2010 0 80
## 10 Africa Angola AO Bengo -8.60 13.6 2010 2010 0 80
## # … with 979 more rows, 4 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, Prevalence_2 <dbl>, and
## # abbreviated variable names ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end,
## # ⁵Age_start
# There's a lot of columns so it's difficult to visualise them
# Let's select only the two columns Prevalence and Prevalence_2 to check if the calculations are consistent.
select(ascaris_data_check, Prevalence, Prevalence_2)
## # A tibble: 989 × 2
## Prevalence Prevalence_2
## <dbl> <dbl>
## 1 0.143 0.143
## 2 0.167 0.167
## 3 0.0964 0.0964
## 4 0.381 0.381
## 5 0.0952 0.0952
## 6 0.05 0.05
## 7 0.170 0.170
## 8 0 0
## 9 0.0635 0.0635
## 10 0.677 0.677
## # … with 979 more rows
Exercise
Can you make a new column called study_time
, and in it,
determine the number of years that sampling took place?
Hint: you could could calculate the difference between
Year_end
and Year_start
. Try to name save the
result in a new tibble
as it is often good practice to keep
raw data (table) and save the processed data into new variables /
tibbles.
Arrange
arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order. It takes
a data frame and a set of column names (or more complicated expressions)
to order by. If you provide more than one column name, each additional
column will be used to break ties in the values of preceding
columns:
# First, lets arrange the data by parasite prevalence in ascending order
arrange(ascaris_data, Prevalence)
## # A tibble: 989 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.39 13.8 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.54 13.9 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.64 13.8 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.44 13.8 2010 2010 0 80
## 8 Africa Burundi BI Buru… -3.86 29.5 2007 2007 10 16
## 9 Africa Burundi BI Cibi… -2.78 29.1 2007 2007 10 16
## 10 Africa Burundi BI Maka… -4.30 29.6 2007 2007 10 16
## # … with 979 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
# Use desc() to re-order by a column in descending order:
arrange(ascaris_data, desc(Prevalence))
## # A tibble: 989 × 13
## Region Country ISO_c…¹ ADM1 Latit…² Longi…³ Year_…⁴ Year_…⁵ Age_s…⁶ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 East a… Philip… PH "Reg… 16.0 120. 2000 2000 5 7
## 2 East a… China CN "Yun… 21.8 100. 2006 2006 4 99
## 3 Africa Uganda UG "Kis… -1.34 29.8 2005 2005 5 16
## 4 Africa Nigeria NG "Ogu… 6.44 4.41 2003 2005 5 16
## 5 Africa Nigeria NG "Ogu… 7.23 3.53 2003 2005 5 16
## 6 Africa Uganda UG "Kis… -1.19 29.7 2005 2005 5 16
## 7 East a… Philip… PH " " 14.4 121. 2000 2000 5 7
## 8 East a… Philip… PH "Reg… 16.0 120. 2000 2000 5 7
## 9 East a… Philip… PH "Reg… 15.3 121. 2000 2000 5 7
## 10 Africa Uganda UG "Kab… -1.23 29.9 2006 2006 1 50
## # … with 979 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹ISO_code, ²Latitude, ³Longitude, ⁴Year_start, ⁵Year_end, ⁶Age_start
# We can also sort on two columns. The order in which they sorted is based on the order in which
# they are plced in the command.
arrange(ascaris_data, Country, Prevalence)
## # A tibble: 989 × 13
## Region Country ISO_code ADM1 Latit…¹ Longi…² Year_…³ Year_…⁴ Age_s…⁵ Age_end
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Angola AO Bengo -8.62 13.8 2010 2010 0 80
## 2 Africa Angola AO Bengo -8.39 13.8 2010 2010 0 80
## 3 Africa Angola AO Bengo -8.53 13.7 2010 2010 0 80
## 4 Africa Angola AO Bengo -8.54 13.9 2010 2010 0 80
## 5 Africa Angola AO Bengo -8.64 13.8 2010 2010 0 80
## 6 Africa Angola AO Bengo -8.63 14.0 2010 2010 0 80
## 7 Africa Angola AO Bengo -8.44 13.8 2010 2010 0 80
## 8 Africa Angola AO Bengo -8.52 13.7 2010 2010 0 80
## 9 Africa Angola AO Bengo -8.57 13.5 2010 2010 0 80
## 10 Africa Angola AO Bengo -8.39 13.7 2010 2010 0 80
## # … with 979 more rows, 3 more variables: Individuals_surveyed <dbl>,
## # Number_Positives <dbl>, Prevalence <dbl>, and abbreviated variable names
## # ¹Latitude, ²Longitude, ³Year_start, ⁴Year_end, ⁵Age_start
Group_by and Summarise
The last key verb is summarise(). It collapses a data frame to a single row.
Together, group_by()
and summarise()
will
allow you to produce grouped summaries, one of the more
useful features of dplyr and one you will likely use a lot.
# let's group the studies by Country, and save it into another tibble by_country
<- group_by(ascaris_data, Country)
by_country
# examine the dataset to see what's the difference
by_country
summarise()
collapses a data frame to a single row and
do calculations depending on the given function. It will be very useful
if you pair the function with group_by()
. Then, when you
use the dplyr verbs on a grouped data frame they’ll be automatically
applied “by group”. For example, if we applied exactly the same code to
a data frame grouped by country, we get the average Prevalence by
country
# average prevalence of all studies in the dataset
# you get a number but this may not be as informative
summarise(ascaris_data, average.Prevalence = mean(Prevalence) )
## # A tibble: 1 × 1
## average.Prevalence
## <dbl>
## 1 0.102
This one number is the mean of Prevalence across entire studies, which may not be very informative if we were looking at regional differences. Hence we do the following:
# now we want to get the mean prevalence of each study for each country
summarise(by_country, average.Prevalence = mean(Prevalence) )
## # A tibble: 13 × 2
## Country average.Prevalence
## <chr> <dbl>
## 1 Angola 0.123
## 2 Burundi 0.155
## 3 Cameroon 0.592
## 4 China 0.926
## 5 Cote D'Ivoire 0.051
## 6 Eritrea 0.00188
## 7 Ghana 0.0226
## 8 Malawi 0.0127
## 9 Nigeria 0.443
## 10 Philippines 0.283
## 11 Senegal 0.0942
## 12 Sierra Leone 0.0710
## 13 Uganda 0.0633
Grouped summaries, generated using group_by()
and
summarise()
, provide a really useful tool for exploring
data when working with dplyr. But before we go any further with this, we
need to introduce a powerful new idea: the pipe.
Pipe
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this.
#Let's first group the data by country
<- group_by(ascaris_data, Country)
by_country
# now we want to get the mean prevalence per country
<- summarise(by_country, average.Prevalence = mean(Prevalence), Num.sites = n() )
by_country_summary
# note: here we introduce another function n() which tallies the number of rows in a group.
# Lets take a look:
by_country_summary
## # A tibble: 13 × 3
## Country average.Prevalence Num.sites
## <chr> <dbl> <int>
## 1 Angola 0.123 38
## 2 Burundi 0.155 22
## 3 Cameroon 0.592 1
## 4 China 0.926 1
## 5 Cote D'Ivoire 0.051 1
## 6 Eritrea 0.00188 40
## 7 Ghana 0.0226 77
## 8 Malawi 0.0127 33
## 9 Nigeria 0.443 20
## 10 Philippines 0.283 132
## 11 Senegal 0.0942 106
## 12 Sierra Leone 0.0710 52
## 13 Uganda 0.0633 466
# Let's imagine there are a lot of countries (even though there are only 15 of them),
# and we want to sort by prevalence in order to pick up the country with the highest Prevalence
# Let's arrange the tibble by prevalence.
<- arrange(by_country_summary, desc(average.Prevalence) )
by_country_summary_sortHighest
# take a look
by_country_summary_sortHighest
## # A tibble: 13 × 3
## Country average.Prevalence Num.sites
## <chr> <dbl> <int>
## 1 China 0.926 1
## 2 Cameroon 0.592 1
## 3 Nigeria 0.443 20
## 4 Philippines 0.283 132
## 5 Burundi 0.155 22
## 6 Angola 0.123 38
## 7 Senegal 0.0942 106
## 8 Sierra Leone 0.0710 52
## 9 Uganda 0.0633 466
## 10 Cote D'Ivoire 0.051 1
## 11 Ghana 0.0226 77
## 12 Malawi 0.0127 33
## 13 Eritrea 0.00188 40
# We noticed that some countries only have very few sites, maybe they are not representative enough
# and we would like to remove them. We could remove these using a filter:
<- filter(by_country_summary_sortHighest, Num.sites > 1 )
by_country_summary_sortHighest_removeSomeCountry
# take a look
by_country_summary_sortHighest_removeSomeCountry
## # A tibble: 10 × 3
## Country average.Prevalence Num.sites
## <chr> <dbl> <int>
## 1 Nigeria 0.443 20
## 2 Philippines 0.283 132
## 3 Burundi 0.155 22
## 4 Angola 0.123 38
## 5 Senegal 0.0942 106
## 6 Sierra Leone 0.0710 52
## 7 Uganda 0.0633 466
## 8 Ghana 0.0226 77
## 9 Malawi 0.0127 33
## 10 Eritrea 0.00188 40
Notice that we have used four steps to prepare the tibble
by_country_summary_sortHighest_removeSomeCountry
:
- Group by
Country
to produce a tibbleby_country
- Summarise to compute average prevalence
average.Prevalence
, and number of sitesNum.sites
to produce a tibbleby_country_summary
- Arrange the tibble so that the highest average Prevalence is
displayed top.
by_country_summary_sortHighest
tibble is produced - Filter countries where only one site was investigated to produce a
final tibble
by_country_summary_sortHighest_removeSomeCountry
This code is a little frustrating to write because we have to give each intermediate data frame a name and data stored in a variable, even though we don’t care about them. Naming things can be annoying and time consuming, so this slows down our analysis.
There’s another way to tackle the same problem with the
pipe, %>%
. The same code above can be
rewritten as follows:
# now you don't need to specify a new name to the tibble because it's "piped" to the next
# command. Importantly, only one tibble is saved.
<-
aveage.prevalence.by.country group_by(ascaris_data, Country) %>%
summarise(average.Prevalence = mean(Prevalence), Num.sites = n() ) %>%
arrange( desc(average.Prevalence) ) %>%
filter( Num.sites > 1 )
# let's have a look if it's the same as the tedious
# by_country_summary_sortHighest_removeSomeCountry tibble
aveage.prevalence.by.country
## # A tibble: 10 × 3
## Country average.Prevalence Num.sites
## <chr> <dbl> <int>
## 1 Nigeria 0.443 20
## 2 Philippines 0.283 132
## 3 Burundi 0.155 22
## 4 Angola 0.123 38
## 5 Senegal 0.0942 106
## 6 Sierra Leone 0.0710 52
## 7 Uganda 0.0633 466
## 8 Ghana 0.0226 77
## 9 Malawi 0.0127 33
## 10 Eritrea 0.00188 40
This focuses on the transformations, not what’s being transformed,
which makes the code easier to read. You can read it as a series of
imperative statements, one per line of code: group_by()
,
then summarise()
, then arrange()
, then
filter()
. As suggested by this reading, a good way to
pronounce %>% when reading code is “then”.
Note also in the code that we have used an tab indent on the second and subsequent lines to show that this is a single block of code.
The point of the pipe is to help you write code in a way that is easier to read and understand. Let’s try one more example.
# we want to examine the variations of prevalences in each site since we are exploring
# the data, we are only interested in a subset of columns
# -> this means select and we want to see the site with the highest prevalences
# -> this means arrange from highest to lowest prevalence
%>%
ascaris_data select(Country, Prevalence) %>%
arrange(desc(Prevalence))
## # A tibble: 989 × 2
## Country Prevalence
## <chr> <dbl>
## 1 Philippines 0.952
## 2 China 0.926
## 3 Uganda 0.893
## 4 Nigeria 0.871
## 5 Nigeria 0.847
## 6 Uganda 0.825
## 7 Philippines 0.811
## 8 Philippines 0.771
## 9 Philippines 0.767
## 10 Uganda 0.762
## # … with 979 more rows
# From the output it seems some sites in Phillippines are quite high?
Working with the pipe is one of the key criteria for belonging to the tidyverse.
Combine table
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.
To work with relational data you need verbs that work with pairs of
tables. There are three families of verbs designed to work with
relational data: filtering joins, mutating
joins and set operations. Here we are going to
demonstrate the bind_rows()
function, which is part of set
operations as shown in Figure 13 below.
# first load the schisto data followed by the hookworm data.
<- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Schistosoma_mansoni_prevalence_data.txt'
schisto_data_url <- read_delim(file = schisto_data_url, delim ="\t")
schisto_data
<- 'https://github.com/WGCAdvancedCourses/Helminths_2021/raw/main/manuals/module_5_R/Hookworms_prevalence_data.txt'
hookworm_data_url <- read_delim(file = hookworm_data_url, delim ="\t")
hookworm_data
# the argument .id takes a string which each row from individual tibbles
# will be assigned a name into a new "disease" column.
# In this example, the ascaris tibble will be assigned "ascaris" in the "disease"
# column in the new combined.tb tibble
<- bind_rows("ascaris" = ascaris_data,"schisto" = schisto_data, .id = "disease")
combined.tb
combined.tb
# you can merge multiple files with the same column format
<- bind_rows("ascaris" = ascaris_data,"schisto" = schisto_data, "hookroom" = hookworm_data, .id = "disease")
combined.tb2 combined.tb2
Part 3: Visusalisation using ggplot2
Data visualisation ggplot2
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
We will go and visualise the data using ggplot2
that
we’ve been wrangling in the previous sections. ggplot2
is
part of the tidyverse
package which you’ve already loaded
this morning.
“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.” R4DS
With ggplot2, you begin a plot with the function
ggplot()
. ggplot()
creates a coordinate system
that you can add layers to. The first argument of ggplot()
is the dataset to use in the graph. So
ggplot(data = ascaris)
creates an empty graph, but it’s not
very interesting so we are not going to show it here.
You complete your graph by adding one or more layers to
ggplot()
. The function geom_point()
adds a
layer of points to your plot, which creates a scatterplot. ggplot2 comes
with many geom functions that each add a different type of layer to a
plot. We will show a few more examples later.
Let’s look at the ascaris
tibble in more details, using
ggplot to explore. Among the variables we have the number of individuals
surveyed (Individuals_surveyed
) and the number of positive
infection cases (Number_Positives
).
Each geom function in ggplot2 takes a mapping argument. This defines
how variables in your dataset are mapped to visual properties. The
mapping argument is always paired with aes()
- the
aesthetics - and the x and y arguments of aes()
specify
which variables to map to the x and y axes. ggplot2 looks for the mapped
variables in the data argument, in this case, ascaris
.
To begin to explore the ascaris
data, run this code to
put Individuals_surveyed
on the x-axis and
Number_Positives
on the y-axis:
ggplot(ascaris_data) +
geom_point( aes(x = Individuals_surveyed, y = Number_Positives))
The plot shows a positive relationship between
Individuals_surveyed
and Number_Positives
. In
other words, more individuals surveyed also means more cases. Does this
confirm or refute our hypothesis about number of individuals and
positives?
Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
Aesthetic mappings
You can add a third variable, like country
, to a two
dimensional scatterplot by mapping it to an aesthetic.
An aesthetic is a visual property of the objects in your plot.
Aesthetics include things like the size, the shape, or the color of your
points. You can display a point (like the one below) in different ways
by changing the values of its aesthetic properties.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to each variable to reveal the country of each site.
# We note here the different spelling of color and colour are equally okay
ggplot(ascaris_data) +
geom_point( aes(x = Individuals_surveyed, y = Number_Positives, colour=Country))
Facets
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, which are subplots that each display one subset of the data.
To facet your plot by a single variable, use
facet_wrap()
. The first argument of
facet_wrap()
should be a formula, which you create with ~
followed by a variable name (here “formula” is the name of a data
structure in R, not a synonym for “equation”). The variable that you
pass to facet_wrap()
should be discrete.
ggplot(ascaris_data) +
geom_point( aes(x = Individuals_surveyed, y = Number_Positives, colour=Country)) +
facet_wrap(~ Country, nrow = 4)
# Note that the axes of each facet are set to be the same - they are "fixed". This can be
# really useful to compare between facets. However, this is not always so useful...
# In this case, having fixed axes is difficult to interpret the data as the number of
# individuals sampled can differ vastly by countries. We can improve this by adding another
# option to make the axes scales "free" or independent per facet plot. Try the following:
ggplot(ascaris_data) +
geom_point( aes(x = Individuals_surveyed, y = Number_Positives, colour=Country)) +
facet_wrap(~ Country, nrow = 4, scales="free")
Boxplot
From the quick scatterplot it seems that there are variations in
patterns of parasite prevalence in different countries. Another useful
plot would be a boxplot which can be called by the
geom_boxplot()
function. Here we list out the commands that
improve the plot bit by bit to generate a polished figure.
# Let's make a default boxplot and separated by Country
ggplot(ascaris_data) +
geom_boxplot(aes(x = Country, y = Prevalence))
# Colour or "Fill" the boxes with a colour representing each "Region" of the world the
# samples come from.
# Note: you could alternatively the fill the box with colours based by Country, but such
# information is redundant (other than being pretty) because you already have them in x-axis
ggplot(ascaris_data) +
geom_boxplot(aes(x = Country, y = Prevalence, fill = Region))
# Reorder the country based on Prevalence
ggplot(ascaris_data) +
geom_boxplot(aes(x = reorder(Country, Prevalence), y = Prevalence, fill = Region))
# We are usually interested in which country has the highest prevalence, so let's reorder the
# boxes based on descending order of prevalence
ggplot(ascaris_data) +
geom_boxplot(aes(x = reorder(Country, -Prevalence), y = Prevalence, fill = Region))
# The previous plot shows the country text at x-axis has overlapped.
# Let's tilt and adjust accordingly
ggplot(ascaris_data) +
geom_boxplot(aes(x = reorder(Country, -Prevalence), y = Prevalence, fill = Region)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Exercises
Can you polish the figure further with the following commands?
1. try writing a different x label. Use ?xlab
command to
look in details
2. or alternatively type ?labs
3. try have a different colour pallette of boxplot. Hint:
?scale_fill_brewer
More data wragling + ggplot
Using what we’ve just learnt, now let’s go through a case study where we do some data wrangling and produce a final plot.
# Let's show you another shortcut (bioinformaticians love a short cut) - using a pipe,
# we can direct data from our dataset/tibble directly to ggplot
%>%
ascaris_data ggplot() +
geom_boxplot(aes(x = reorder(Country, -Prevalence), y = Prevalence, fill = Region)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
# Similarly, we can perform our "data wrangling" and directly pipe that to ggplot, thus producing
# a cleaner code bock.
# From the 'by_country_summary_sortHighest' tibble,
# we know there are ten countries with just more than one site,
<- by_country_summary_sortHighest %>%
by_country_More_than_one_site filter(Num.sites > 1) %>%
select(Country)
# maybe we'll just include these countries for now?
%>%
ascaris_data filter(Country %in% by_country_More_than_one_site$Country ) %>%
ggplot() +
geom_boxplot(aes(x = reorder(Country,-Prevalence), y = Prevalence, fill = Region)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
# Another shortcut and "cleaner" code is we do the wrangling within the code chunk,
# So the two code chunks above can be rewritten as follows and give the same result
%>%
ascaris_data filter(Country %in% select( filter(by_country_summary_sortHighest, Num.sites > 1) , Country)$Country ) %>%
ggplot() +
geom_boxplot(aes(x = reorder(Country,-Prevalence), y = Prevalence, fill = Region)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
# We can now reuse the code previously with the combined table
# ascaris + schisto
%>%
combined.tb2 ggplot() +
geom_boxplot(aes(x = reorder(Country,-Prevalence), y = Prevalence, fill=disease)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
# ascaris + schisto + hookworms
%>%
combined.tb2 ggplot() +
geom_boxplot(aes(x = reorder(Country,-Prevalence), y = Prevalence, fill=disease)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
# this figure is a bit messy. Can we improve that later?
# Ok. now we are interested in the prevalence of diseases by country so we may want to
# group by disease, then by country
<- combined.tb2 %>%
prevalence.by.disease.and.country group_by(disease, Country) %>%
summarise( sites = n(),
total.ind = sum(Individuals_surveyed),
total.positives = sum(Number_Positives),
average.Prevalence = mean(Prevalence) )
prevalence.by.disease.and.country
## # A tibble: 35 × 6
## # Groups: disease [3]
## disease Country sites total.ind total.positives average.Prevalence
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 ascaris Angola 38 1647 263 0.123
## 2 ascaris Burundi 22 1314 205 0.155
## 3 ascaris Cameroon 1 76 45 0.592
## 4 ascaris China 1 215 199 0.926
## 5 ascaris Cote D'Ivoire 1 118 6 0.051
## 6 ascaris Eritrea 40 1607 3 0.00188
## 7 ascaris Ghana 77 4518 104 0.0226
## 8 ascaris Malawi 33 1965 18 0.0127
## 9 ascaris Nigeria 20 1013 480 0.443
## 10 ascaris Philippines 132 12847 4005 0.283
## # … with 25 more rows
# in fact, we may be only interested in countries with data from all three diseases, so we
# need to find out whcih countries they are
<- prevalence.by.disease.and.country %>%
country.with.data.of.all.diseases group_by(Country) %>%
summarise( num.diseases = n() ) %>%
filter (num.diseases == 3)
country.with.data.of.all.diseases
## # A tibble: 2 × 2
## Country num.diseases
## <chr> <int>
## 1 Sierra Leone 3
## 2 Uganda 3
# Repeat the previous graph, but only include the countries from the last command
%>%
combined.tb2 filter ( Country %in% country.with.data.of.all.diseases$Country) %>%
ggplot() +
geom_boxplot(aes(x = reorder(Country,-Prevalence), y = Prevalence, fill=disease)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Summary
In this module, we have shown you:
- Basics of R
- importing / exporing data
- manipulating / tidying data
- visualising data
But there’s so much more. In addition to the R4DS and ModernDive that were mentioned above, there are extensive resources online that will enable you to tackle more challenging data. Here we list a few more below that we find useful and hope you can go home and start using R more extensively in your daily research tasks.
- RStudio Education A website containing resources tailored to different R expertise (https://education.rstudio.com/ )
- datacamp - Introduction to R (https://www.datacamp.com/courses/free-introduction-to-r)
- GGplot Cheatsheet (https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)