sudo apt install fastqc
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp
sudo apt install bowtie2
sudo apt install samtools
sudo apt install bedtools
sudo apt install spades
Download simulated tutorial data. The data is simulated using InSilicoSeq and includes reads from Influenza virus, coronavirus 2 isolate (B.1.1.519) and some human genome reads.
wget https://raw.githubusercontent.com/WCSCourses/ViralBioinfAsia2022/main/course_data/Pathogen_sequence_detection_using_metagenomics/tutorial_R1.fastq.gz
wget https://raw.githubusercontent.com/WCSCourses/ViralBioinfAsia2022/main/course_data/Pathogen_sequence_detection_using_metagenomics/tutorial_R2.fastq.gz
The goal of this approach is to generate genome assemblies of all present viruses in the metagenomic samples. Briefly, steps involved in this approach are filtering, decontamination or reference based selection of reads, genome assembly, identification and genome annotation.
QC and Filtering: The raw metagenomic reads are filtered and trimmed on the basis of quality scores assigned to each bases by the sequencers.
# run on both forward and reverse reads
fastqc tutorial_R1.fastq.gz
fastqc tutorial_R2.fastq.gz
fastp -i tutorial_R1.fastq.gz -I tutorial_R2.fastq.gz -o tutorial_filtered_R1.fastq.gz -O tutorial_filtered_R2.fastq.gz -z 9 -q 15 -n 10 -e 20 -l 250
-i: read1 input file name
-I: read2 input file name
-o: read1 output file name
-O: read2 output file name
-z: compression level for output
-q: the quality value that a base is qualified
-n: if one read’s number of N base is >n_base_limit, then this read/pair is discarded
-e: if one read’s average quality score <avg_qual, then this read/pair is discarded.
-l: reads shorter than length_required will be discarded
Decontamination/reference selection: The metagenomes from any environment can contain the contamination for example, DNA/RNA fragments from other microbial organisms or host fragments. To remove these reads decontamination process is required. Our simulated data contains viral reads from Influenza virus, coronavirus 2 isolate (B.1.1.519) and some human genome reads. This analysis is focussed on viral genome only, hence, we will download genomes of Orthomyxoviridae and Coronaviridae families. If the analysis is not focussed on any one or multiple virus families, then, the whole virus genome database can be downloaded and used https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10239&sort=taxonomy
cat Orthomyxoviridae_genomes.fasta Coronaviridae_genomes.fasta > reference_genomes.fasta
bowtie2-build reference_genomes.fasta reference
bowtie2-align-s -1 tutorial_filtered_R1.fastq.gz -2 tutorial_filtered_R2.fastq.gz -x reference -S reference_mapped.sam
samtools view -S -b -F 4 reference_mapped.sam >reference_mapped.bam
bamToFastq -i reference_mapped.bam -fq reference_mapped_R1.fastq -fq2 reference_mapped_R2.fastq
spades.py -1 reference_mapped_R1.fastq -2 reference_mapped_R2.fastq -o genome_assembly --meta
Binning: This step is required when metagenome is complex (include multiple viral genomes or organisms). If the genome quality is good then this step can be skipped.
Identification & annotation: For identification of viral genomes,
# make blastdb
makeblastdb -in reference_genomes.fasta -dbtype nucl -title refgenomes -out blastdb_ref
# run blast
blastn -query genome_assembly/scaffolds.fasta -db blastdb_ref -out blast_out.txt -outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 1
Read more about blast output format 6
Annotation/Identification server CORONAVIRUS ANTIVIRAL & RESISTANCE DATABASE
The aim of this approach is to calculate the abundance of each virus in the metagenomic sample. There are multiple tools available to do so. One popular tool is MetaPhlAn. The filtering and decontamination step remains the same followed by the tutorial given at MetaPhlAn webpage.
Another approach discussed in Molina-Mora, Jose Arturo, et al. “Metagenomic pipeline for identifying co-infections among distinct SARS-CoV-2 variants of concern: Study cases from Alpha to Omicron.” Scientific Reports 12.1 (2022): 1-10., aims to indetify co-infection of SARS-CoV-2 variants.