#Module 6: Pathogen Variant Calling ##Day 1, practical session 1: Alignment to reference
In this module we will be analyzing the case of four Mycobacterium tuberculosis isolates (3 multi-drug resistant and 1 susceptible) from TB patients to investigate following aspects from whole genome sequence data:
To answer each of these questions we will first perform the analysis on one isolate and the steps would be repeated for others as part of the assignment.
Today we will analyze the data to answer the first two questions. Before we jump into the steps for analysis, we need to make sure we have all the required tools/softwares installed and a dataset in one place. Follow the following steps:
pwd
cd course_data/variant_calling/pathogen
ls dataset
mkdir practical
mkdir practical/MD001
cp dataset/reference.fa practical/MD001/
cp dataset/reference.gbk practical/MD001/
cp dataset/MD001_R1.gz practical/MD001/
cp dataset/MD001_R2.gz practical/MD001/
or
cp dataset/MD001_*.gz practical/MD001/
bwa
samtools
picard -h
bcftools
snpEff -h
igv
cd practical/MD001
bwa index reference.fa
bwa mem reference.fa MD001_R1.fastq.gz MD001_R2.fastq.gz >MD001_aln.sam
samtools sort -T temp -O bam -o MD001_aln_sorted.bam MD001_aln.bam
samtools index MD001_aln_sorted.bam
picard MarkDuplicates I=MD001_aln_sorted.bam O=MD001_aln_markdup.bam M=metrics.txt
grep -A 2 “^##METRICS” metrics.txt
-A
option allows us to print number of lines (2 in above case) after the line with the matching string (“##METRICS)
Q1.1: How many read pairs were assigned as duplicates?
Q1.2: What proportion of the mapped sequence was marked as duplicates?
samtools index MD001_aln_markdup.bam
samtools stats MD001_aln_markdup.bam >MD001_bamstats.txt
grep “^SN” MD001_bamstats.txt > MD001stats.txt
Q1.3: What is the total number of mapped reads?
Q1.4: What is the total number of unmapped reads?
Q1.5: What is the total number of mapped and properly paired reads?
Q1.6: What is the average insert size?
Q1.7: What is the percentage of reads properly paired?
samtools mpileup -t DP -Bug -m 4 -f reference.fa MD001_aln_markdup.bam >MD001_variants.bcf
bcftools call -mv -O v -o MD001_variants.vcf MD001_variants.bcf
head -n100 MD001_variants.vcf
Q2.1: At what position is the first variant in the unfiltered vcf file for MD001?
bcftools filter -i 'type="snp"' -g10 -G10 MD001_variants.vcf -o MD001_SNPs.vcf
bcftools filter -i 'type="snp" && QUAL>=50 && FORMAT/DP>5 && MQ>=30' -g10 -G10 MD001_variants.vcf -o MD001_SNPs_filtered_try1.vcf
bcftools filter -i 'type="snp" && QUAL>=50 && FORMAT/DP>5 && MQ>=30 && DP4[2]/(DP4[2]+DP4[0])>=0.80 && DP4[3]/(DP4[3]+DP4[1])>=0.80' -g10 -G10 MD001_variants.vcf -o MD001_SNPs_filtered.vcf
Q2.2: What does the DP4 value represent?
snpEff databases | grep h37rv
sed ‘s|NC_000962.3|Chromosome|’ MD001_SNPs_filtered.vcf > MD001_SNPs_filtered_renamed.vcf
MD001_SNPs_filtered_snpEff.ann.vcf ```
cat MD001_SNPs_filtered_snpEff.ann.vcf | grep MODERATE\|rpoB
Q2.5: How many HIGH effect variants were there for sample MD001?
Q2.6: What was the TS TV ratio? (Look in the MD001_snpEff.csv summary file)
Q2.7: Are there any other mutations in resistance related genes?