Genetic Variation

Steve Doyle, 2023

Overview and Aims
Quality control of raw sequencing data
Preparing your reference sequence prior to mapping
Short read mapping
Calling SNPs in our mapped sample
Visualising mapped reads and variants using Artemis
Mapping reads from multiple samples
Calling SNPs in multiple samples at the same time
Visualing SNP data using WormBase ParaSite and Artemis
Analysis of genetic variation using R
Principal component analysis of genetic diversity
Exploring genetic data using phylogenetic trees
Integrating genetic and geographic data: maps

1. Overview and Aims

Genetic variation can tell us a lot about an organisms evolutionary past, broad and fine-scale relationships within and between species, and the mechanisms by which organisms adapt to new environments or selection pressures such as drug treatmennt. The use of genomics to understand genetic variation offers insight into these processes at a range of resolutions, from single nucleotide polymorphisms genome-wide to chromosomal rearrangements.

Modern genomic approaches such as high-throughput sequencing has enabled not only the rapid increase in reference genome assemblies, but also the ability to resequence genomes from many individuals within a species. The re-sequencing of a genome typically aims to align or map new sequence data to a reference genome (please note that we will use the terms “aligning” and “mapping” interchangeably), and identify differences between the newly sequenced sample and the reference genome. These differences can vary in number and size, ranging from single-base substitutions called single nucleotide polymorphisms (SNPs), INsertions and DELetions (INDELs) that can range from one to many base pairs, to copy number variants (CNVs) of, for example, one or more genes. By comparing these different types of genetic variations between closely related populations or individual organisms, in is possible to learn how genetic differences that may cause phenotypic differences, such as drug resistance or increased virulence in pathogens, or changed susceptibility to disease in humans.

One important prerequisite for the mapping of sequence data to work is that the reference and the re-sequenced subject have the same genome architecture, ie, the genome sequence can not be too dissimilar to the resequenced data, else, the reads will not map to the genome accurately.

In this exercise, you will be analyzing genetic variation in the gastrointestinal helminth Haemonchus contortus. H. contortus is an important pathogen of wild and domesticated ruminants worldwide, and has a major impact on the health and economic viability of sheep and goat farming in particular. It is also a genetically tractable model used for drug discovery, vaccine development, and anthelmintic resistance research. A chromosome-scale reference genome assembly and manually curated genome annotation are both available to download and explore at WormBase Parasite. The sequencing data you will be using in this module is from two published studies - Salle et al 2019 Nature Communications and Doyle et al. 2020 Communications Biology - which describe the global and genome-wide genetic diversity of H. contortus, all of which was generated at the Wellcome Sanger Institute. Analysis of global diversity allows you to understand aspects of the species biology, such as how different populations are connected (which may be important to understand the spread of a pathogen or ongoing transmission), whether populations are growing or declining (perhaps in response to drug treatment), or the impact of selection on regions or specific genes throughout the genome.

Although whole-genome sequencing data was generated for these samples, we have extracted only the mitochondrial DNA-derived reads for you to work with. The main reason for this is that at this scale, the data should be able to be analysed efficiently on your computer without the need for high performance computing infrastructure and/or capacity.

To analyse these data, we will be working in both the unix and R command line environments. This is because we typically manipulate high throughput sequencing data such as those you will be using in unix, i.e., read mapping and SNP calling, whereas the population genetic analyses are commonly written using R tools. Although some graphical user interface (GUI) tools such as CLC Genomics and Genious are available (at a cost) to do similar tasks, using the command line gives you much greater flexibility in the analyses that you can do and the scale that you can do it, and it is freely available. There will be a considerable amount of coding in this module - this may be daunting at first, however, with some practice, it will become much easier.

We will also be using Artemis, a computer program designed to view and edit genomic sequences and visualise annotations in a highly interactive graphical format. In this exercise, we will use Artemis to visualise our genomic sequencing data mapped to our reference sequence, and identify the variants that differ between our samples and the reference.

The aims of this module are to familiarize you with tools and concepts that will allow you to:

map high-throughput sequencing reads to a genome;
bioinformatically identify and filter single nucleotide polymorphisms in your samples;
visualize sequencing reads and genetic variants in your samples;
analyse patterns of genetic diversity in your data, and link these patterns to metadata to uncover biological insights in your species.

Genetic Variation

Table of Contents

1. Overview and Aims

2. Quality control of raw sequencing data

2.1. Running and interpreting FastQC

2.2. Questions:

3. Preparing your reference sequence prior to mapping

3.1. Questions:

4. Short read mapping

4.1. Mapping reads from a single sample

4.2. Mapping QC

5. Calling SNPs in our mapped sample

5.1. Question:

5.2. Using mpileup and bcftools to identify variants

6. Visualising mapped reads and variants using Artemis

7. Mapping reads from multiple samples

8. Calling SNPs in multiple samples at the same time

8.1. Questions:

9. Visualising SNP data using WormBase ParaSite and Artemis

9.1. Analysing your SNPs in WormBase ParaSite

9.2. Questions:

9.3. Visualising SNPs in Artemis

10. Analysis of genetic variation using R

10.1. Setting up R and loading R libraries

10.2. Import and prepare your data for analysis

11. Principal component analysis of genetic diversity

11.1. Questions:

11.2. Questions:

12. Exploring genetic data using phylogenetic trees

12.1 Making trees using ggtree

12.1. Questions:

13. Integrating genetic and geographic data: maps

13.1 Calculating allele frequencies per country

13.2 Making maps using data

14. Summary

License