#Module 6: Pathogen Variant Calling ##Day 1, practical session 1: Alignment to reference
In this module we will be analyzing the case of four Mycobacterium tuberculosis isolates (3 multi-drug resistant and 1 susceptible) from TB patients to investigate following aspects from whole genome sequence data:
To answer each of these questions we will first perform the analysis on one isolate and the steps would be repeated for others as part of the assignment.
Today we will analyze the data to answer the first two questions. Before we jump into the steps for analysis, we need to make sure we have all the required tools/softwares installed and a dataset in one place. Follow the following steps:
pwdcd course_data/variant_calling/pathogenls datasetmkdir practicalmkdir practical/MD001cp dataset/reference.fa practical/MD001/cp dataset/reference.gbk practical/MD001/cp dataset/MD001_R1.gz practical/MD001/cp dataset/MD001_R2.gz practical/MD001/or
cp dataset/MD001_*.gz practical/MD001/bwasamtoolspicard -hbcftoolssnpEff -higvcd practical/MD001bwa index reference.fabwa mem reference.fa MD001_R1.fastq.gz MD001_R2.fastq.gz >MD001_aln.samsamtools sort -T temp -O bam -o MD001_aln_sorted.bam MD001_aln.bamsamtools index MD001_aln_sorted.bampicard MarkDuplicates I=MD001_aln_sorted.bam O=MD001_aln_markdup.bam M=metrics.txtgrep -A 2 “^##METRICS” metrics.txt-A option allows us to print number of lines (2 in above case) after the line with the matching string (“##METRICS)Q1.1: How many read pairs were assigned as duplicates?
Q1.2: What proportion of the mapped sequence was marked as duplicates?
samtools index MD001_aln_markdup.bamsamtools stats MD001_aln_markdup.bam >MD001_bamstats.txtgrep “^SN” MD001_bamstats.txt > MD001stats.txtQ1.3: What is the total number of mapped reads?
Q1.4: What is the total number of unmapped reads?
Q1.5: What is the total number of mapped and properly paired reads?
Q1.6: What is the average insert size?
Q1.7: What is the percentage of reads properly paired?
samtools mpileup -t DP -Bug -m 4 -f reference.fa MD001_aln_markdup.bam >MD001_variants.bcfbcftools call -mv -O v -o MD001_variants.vcf MD001_variants.bcfhead -n100 MD001_variants.vcfQ2.1: At what position is the first variant in the unfiltered vcf file for MD001?
bcftools filter -i 'type="snp"' -g10 -G10 MD001_variants.vcf -o MD001_SNPs.vcfbcftools filter -i 'type="snp" && QUAL>=50 && FORMAT/DP>5 && MQ>=30' -g10 -G10 MD001_variants.vcf -o MD001_SNPs_filtered_try1.vcfbcftools filter -i 'type="snp" && QUAL>=50 && FORMAT/DP>5 && MQ>=30 && DP4[2]/(DP4[2]+DP4[0])>=0.80 && DP4[3]/(DP4[3]+DP4[1])>=0.80' -g10 -G10 MD001_variants.vcf -o MD001_SNPs_filtered.vcfQ2.2: What does the DP4 value represent?
snpEff databases | grep h37rvsed ‘s|NC_000962.3|Chromosome|’ MD001_SNPs_filtered.vcf > MD001_SNPs_filtered_renamed.vcfMD001_SNPs_filtered_snpEff.ann.vcf ```
cat MD001_SNPs_filtered_snpEff.ann.vcf | grep MODERATE\|rpoBQ2.5: How many HIGH effect variants were there for sample MD001?
Q2.6: What was the TS TV ratio? (Look in the MD001_snpEff.csv summary file)
Q2.7: Are there any other mutations in resistance related genes?