Here we are going to use the software packages SAMtools and Qualimap to explore how the genome sequence reads are distributed across the viral genome. In other words, we are going to generate a coverage plot. In addition to the coverage plot, the Qualimap software also provides several other reports and metrics that can be used for quality control (QC). You can read about Qualimap in this paper:
The SAMtools package contains many really useful tools for analysing and manipulating BAM files. You can read more about SAMtools here:
Qualimap and SAMtools are already installed on the virtual machine. However, it you need to install the Qualimap software, you can obtain it from: http://qualimap.conesalab.org/ and SAMtools from http://samtools.sourceforge.net/.
We can run the Qualimap software either:
During this workshop activity, we will try both modes of running the software.
We will use SAMtools only on the command line; there is not graphical interface for this software package.
Qualimap examines coverage of the reference genome sequence by aligned sequence reads. Therefore, it requires that the reads have already been aligned against the reference genome sequence and that the alignment is presented in BAM format.
We already discussed BAM format in the session “NGS file formats and data and QC”. You can find a formal description of the BAM file format here: https://samtools.github.io/hts-specs/SAMv1.pdf.
The specific viral sequence datasets that we will examine are:
These datasets are already downloaded and provided to you in the virtual machine. Now, navigate to the appropriate directory with this command:
cd /home/manager/ViralBioinfAsia2022/course_data/Coverage_Plots_Stats
Now, when you list the contents of the current directory with ls -lh
, you should see something like this:
total 16M
-rw-rw-r-- 1 manager manager 2.9K Jul 25 16:20 ARTIC_amplicons.bed
-rw-rw-r-- 1 manager manager 8.0M Jul 25 16:20 ERR8261957.bam
-rw-rw-r-- 1 manager manager 152 Jul 25 16:20 ERR8261957.bam.bai
-rw-rw-r-- 1 manager manager 7.6M Jul 25 16:20 ERR8261968.bam
-rw-rw-r-- 1 manager manager 152 Jul 25 16:20 ERR8261968.bam.bai
drwxrwxr-x 2 manager manager 4.0K Jul 25 16:20 images
-rw-rw-r-- 1 manager manager 1.2K Jul 25 16:20 readME.md
So, you can see we have the two BAM files, along with their accompanying index files. These BAM files consist of genomic sequence reads aligned against Alignments of sequence reads against the Wuhan-Hu-1 reference genome sequence.
Let’s gather some information about these alignments. First, let’s use the rather basic flagstat tool from the SAMools package:
samtools flagstat ERR8261957.bam
samtools flagstat ERR8261968.bam
Examine the output from these two commands. Can you figure out:
Now, let’s use Qualimap to gather some information about these same two alignments. Execute the following command lines:
qualimap bamqc -bam ERR8261957.bam
qualimap bamqc -bam ERR8261968.bam
Now, if you list the directory contents with ls -lh
, you will notice new directories called ERR8261957_stats
and ERR8261968_stats
. In each of these directories, you will find an HTML file that contains a detailed report on the BAM alignment.
Execute the command ls -lh *_stats
to see the contents of these directories. Now, let’s open the HTML files in the Firefox web browser:
firefox *_stats/*.html
You should now see something like this:
Note that the web browser contains two tabs; one contains the report for ERR8261957.bam and the other contains report for ERR8261968.bam.
Take a look at the reports to see the range of information provided. Then close the Firefox web browser.
As an alternative to the command-line, we can launch a graphical interface for Qualimap.
Start the Qualimap graphical interface by executing the command qualimap &
in the Terminal. Don’t worry about the warning message concerning missing R packages; just click the Ok button. You should now see something like this:
Notice the File menu item at the top left corner of the Qualimap window. Select File -> New analysis -> BAM QC and then select one of the BAM files. When the analysis is complete you will see the results in the Qualimap window and you can click on the various sections of the results report in the menu on the left:
Note that under the File menu, we can export the results report as an HTML or PDF file.
You can find some good information about interpreting the Qualimap results here: http://qualimap.conesalab.org/doc_html/analysis.html#output.
For our datasets, try to answer these questions:
If you have any spare time then optionally, let’s visually inspect the alignment data using the Integrative Genome Viewer (IGV). This will provide us with an overview of the data. (Alternatively, you can use Tablet instead of IGV, if you pr).
To do this, we need the BAM files, obviously; we also need files containing the reference genome sequence. Lets download those now. First, make sure that we are in the correct directory:
cd /home/manager/ViralBioinfAsia2022/course_data/Coverage_Plots_Stats
Then use wget to download the reference-genome files:
wget https://github.com/WCSCourses/ViralBioinfAsia2022/blob/main/course_data/Coverage_Plots_Stats/reference_genome.fasta https://github.com/WCSCourses/ViralBioinfAsia2022/blob/main/course_data/Coverage_Plots_Stats/reference_genome.fasta.fai https://github.com/WCSCourses/ViralBioinfAsia2022/raw/main/course_data/Coverage_Plots_Stats/reference_genome.gff3
Now, when you execute ls -lh
, you will notice we now have the following files: reference_genome.fasta
, reference_genome.fasta.fai
and reference_genome.gff3
. We are ready to run the IGV genome browser software by executing the command igv
.
Click on Genomes -> Load genome from file and select reference_genome.fasta
.
Click on File -> Load from file and select reference_genome.gff3
.
Click on File -> Load from file and select ERR8261957.bam
and ERR8261968.bam
.
You can now explore the two alignments in the IGV browser, which should look something like this:
Alignments of sequence reads against the Wuhan-Hu-1 reference genome, visualised using IGV.
What do you notice about the patterns of coverage of the reference genome? Are there any differences between the Illumina sequencing data versus the Oxford Nanopore sequencing data?