WWPG_2022

Working with Pathogen Genomes 2022

View the Project on GitHub WCSCourses/WWPG_2022

Linux Scripting

Table of Contents

  1. Introduction and aims
  2. Getting started on the command line - Basic unix
  3. Files and directories
  4. Looking inside files
  5. Searching the content of files using grep
  6. Processing columns with awk
  7. Loops
  8. Bash scripts
  9. UNIX quick reference guide

Introduction and Aims

Introducing Linux

Unix is the standard operating system on most large computer systems in scientific research, in the same way that Microsoft Windows is the dominant operating system on desktop PCs.

Unix and MS Windows both perform the important job of managing the computer’s hardware (screen, keyboard, mouse, hard disks, network connections, etc…) on your behalf. They also provide you with tools to manage your files and to run application software. They both offer a graphical user interface (desktop). These desktop interfaces look different between the operating systems, use different names for things (e.g. directory versus folder) and have different images but they mostly offer the same functionality.

Unix is a powerful, secure, robust and stable operating system which allows dozens of people to run programs on the same computer at the same time. This is why it is the preferred operating system for large-scale scientific computing. It runs on all kinds of machines, from mobile phones (Android), desktop PCs… to supercomputers.

Why Linux

Increasingly, the output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suitable for working with such files and has several powerful and flexible commands that can be used to process and analyse this data. One advantage of learning Unix is that many of the commands can be combined in an almost unlimited fashion. So if you can learn just six Unix commands, you will be able to do a lot more than just six things.

Unix contains hundreds of commands, but to conduct your analysis you will probably only need 10 or so to achieve most of what you want to do. In this course we will introduce you to some basic Unix commands followed by some more advanced commands and provide examples of how they can be used in bioinformatics analyses.

General points to consider

Some useful Linux commands

Command What it does
ls Lists the contents of the current directory
mkdir Makes a new directory
mv moves or renames a file
cp copies a file
rm removes a file
cat concatenates two or more files
less displays the contents of a file one page at a time
head displays the first ten lines of a file
tail displays the last ten lines of a file
cd change directory
pwd print the working directory
find find files matching an expression
grep search for a pattern within a file
wc count the lines, words, characters or bytes in a file
kill stop a process
jobs list the processes that are running

Tips to get you started


Back to top

Getting started on the command line - Basic unix

Introduction to the terminal

# your first command – move to the working directory to get started!

cd /home/manager/Module_2_Linux_Scripting


Command line arguments

# List the contents of a directory
$ ls

# List the contents of a directory with extra information about the files
$ ls –l

# List all contents including hidden files & directories
$ ls –al 	

# List the contents of the directory called basic with extra information
$ ls –l basic

# Suggested usage – this will be the most frequent command used as a bioinformatician!
$ ls –ltr

# where:
	–l gives the long format,
	-t sort the output by time,
	–r reverse sorts the output.

# this will therefore provide a detailed list, with the most recent files at the bottom. This is really useful if you have a lot of files in the same directory


Permissions

Permission What it does
Read (r) permission to read from a file/directory
Write (w) permission to modify a file/directory
Execute (x) Tells the operating system that the file contains code for the computer to run, as opposed to a file of text which you open in a text editor.

Back to top

Files and Directories


pwd - find where you are

# To find out where you are, type this into your terminal.
$ pwd

$ cd basic

$ pwd


cd - change current working directory

# Move into the genome_1 directory using the cd command
$ cd genome_1/

# Use the pwd command to check you are in the right place
$ pwd

# it is often useful to list the contents of your new location after moving
$ ls -lrt

Command What it means
. Current directory (one full stop)
.. Directory above (two full stops)
~ Home directory (tilda)
/ Root of the file system (like C: in Windows)
# List contents of current directory
$ ls .

# List the contents of directory above your current location
$ ls ..

# list the contents of the home directory
$ ls ~


Tab completion - “make tab-it and hab-it”


cp - copy a file

# To copy the file genome_1.gff to a new file called genome_1.withseq use:
$ cp genome_1.gff genome_1.withseq.gff

# Use ls to check the contents of the current directory for the copied file:
$ ls -lrt

mv - move a file

# To move the file genome_1.withseq.gff from the current directory to the directory above use:
$ mv genome_1.withseq.gff ..

# Use the ls command to check the contents of the current directory and the directory above to see
that genome_1.withseq.gff has been moved.
$ ls –lrt
$ ls –lrt ../

# you could also change directory to check the file moved
$ cd ../
$ ls -lrt

rm - delete a file

# To remove the copy of the genome_1 gff file, called genome_1.withseq.gff use:
$ rm genome_1.withseq.gff

# Use ls to check the contents of the current directory for the copied file:
$ ls -lrt


Exercises

  1. Use the ls command to show the contents of the basic directory.
  2. How many files are there in the genome_2 directory?
  3. What is the largest file in the genome_2 directory?
  4. Move into the genome_2 directory.
  5. How many files are there in the fasta directory?
  6. Copy the file genome_2.bed in the genome_2 directory into the annotation subdirectory.
  7. Move all the fasta files in the directory genome_2 to the fasta subdirectory.
  8. How many files are there in the fasta directory?

Back to top

Looking inside files

less

# Use the less command to open a gff
$ less genome_1.gff

head and tail

# To look at the beginning of the file genome_1.gff file use:
$ head genome_1.gff

# To look at the end of genome_1.gff use:
$ tail genome_1.gff

# To look at the last 25 lines of genome_1.gff use:
$ tail –n 25 genome_1.gff

Saving time


Getting help: man , -h , –help

# I’m stuck – help!
$ man tail
Or
$ tail –h
Or
$ tail --help

Writing to files

# Extract the first line of genome_1.gff and output to a new file
$ head -1 genome_1.gff > first_genome_1_line.txt


cat

# Read you new file using the cat command
$ cat first_genome_1_line.txt

# we don’t actually need this file, so lets remove it
rm first_genome_1_line.txt

# Join the two files using the cat command
$ cat genome_1.noseq.gff genome_1.fa > genome_1.concatenated.gff

# lets check that the new file has been generated
$ ls -lrt


wc - counting

# use the wc command on the file directly
$ wc -l genome_1.gff

# use cat to open the file, and “pipe” the result to the wc command
$ cat genome_1.gff | wc -l

# For example to count the number of files that are listed by ls use:
$ ls | wc –l

# You can connect as many commands as you want. For example:
$ ls | grep ".gff" | wc -l


sort - sorting values

# For example, to sort the contents of a BED file use:
$ sort genome_2.bed | head

# look at the other end of the file using tail
$ sort genome_2.bed | tail

# To sort the contents of a BED file on position, type the following command.
$ sort -k 2 -n genome_2.bed

uniq - finding unique values

# To get the list of chromosomes in the genome_2 bed file use:
$ awk '{ print $1 }' genome_2.bed | sort | uniq

# Lets see what happens when we build a command using pipes
$ awk '{ print $1 }' genome_2.bed | less
$ awk '{ print $1 }' genome_2.bed | sort | less
$ awk '{ print $1 }' genome_2.bed | sort | uniq | less

Exercises


Back to top

Searching the content of files using grep

# First we need to go to the correct directory
$ cd /home/manager/Module_2_Linux_Scripting/grep

Simple pattern matching

# Use cat to view the file contents
$ cat gene_expression.bed

# We are interested in chromosome 2 so wish to find all lines involving it using grep.
$ grep chr2 gene_expression.bed

# We can search the output of the grep search using a pipe
$ grep chr2 gene_expression.bed | grep +

# We will modify our original search slightly to find all data on chromosome 1
$ grep chr1 gene_expression.bed

# Look at another bed file we have provides
$ cat gene_expression_sneaky.bed

# See what happens when we grep for genes on chromosome 1, on the negative strand. (Note, we put the minus sign in quotes to stop Unix interpreting this as an option in grep
$ grep chr1 gene_expression_sneaky.bed | grep "-"


Regular expressions

# Repeat the first part of our search but including ^. Note, to be safe, we will put the search term in quotes.
$ grep '^chr1' gene_expression_sneaky.bed

# This can be done by searching for a tab character following the chromosome name. Tab is represented by ‘\t’. For reasons beyond the scope of this course,we must start the search term with a dollar symbol to recognise tab.

$ grep $'^chr1\t' gene_expression_sneaky.bed

# Searching for a string at the end of the line is done using a $ symbol at the end of the search term. In this case, we will backslash the - symbol for safety.
$ grep $'^chr1\t' gene_expression_sneaky.bed | grep '\-$'


Useful grep command line options

# We will repeat a previous search but include the -c option to count matches rather than just returning them.
$ grep -c  $'^chr1\t' gene_expression_sneaky.bed

# Consider the fasta file sequences.fasta.
$ cat sequences.fasta
# A simple search for ACGT will not hit all relevant sequences.
$ grep ACGT sequences.fasta

# The -i option does this
$ grep -i ACGT sequences.fasta

# The -v option does this
$ grep -v $'^chr1\t' gene_expression_sneaky.bed


Replacing matches to regular expressions

# As an example, we wish to replace each incidence of the characters ‘chr’ at the beginning of the line in gene_expression.bed with ‘chromosome
$ sed 's/^chr/chromosome/' gene_expression.bed

# For example:
$ sed '/^chr/chromosome/' gene_expression.bed > gene_expression_new.bed


Back to top

Processing columns with awk

$ cd ~/Module_2_Linux_Scripting/awk/

# First we will view the GFF file to look at its structure.
$ cat genes.gff

# We can ask awk just to give us the first column of a file. awk calls the columns $1, $2 etc. with $0 representing the full line.
$ awk -F"\t" '{print $1}' genes.gff


Filtering input files

# The filtering criteria can be added before the braces. For example, this will extract just chromosome 1 data from the file.
$ awk -F"\t" '$1=="chr1" { print $0 }' genes.gff

# In this example we will search for just the genes from chromosome 1.
$ awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff

# In this example we will search for features which are on chromosome 1 or are repeats
$ awk -F"\t" '$1=="chr1" || $3=="repeat"' genes.gff

# In this example we will search for genes on chromosome 1 which start before base position 1100
$ awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100' genes.gff

# Note that -F"\t" can be omitted here. As we're searching the whole line, the column delimiter is not relevant.
$ awk '/repeat/' genes.gff

# Here we simply look for the inverse of the previous search.
$ awk ‘!/repeat/’ genes.gff

Sanity checking files

# One thing we may want to do is check that each gene has been assigned a strand. To do this, we need to check whether column 7 contains either a + or - symbol.
$ awk -F"\t" '$3=="gene" && !($7 == "+" || $7 == "-")' genes.gff

# To do this, we simply need to check that the end coordinate of the feature is not less than the start coordinate.
$ awk -F"\t" '$5 < $4' genes.gff

# We do this using a special variable in awk, “NF”, which is the number of columns in a line. Remember to distinguish this from “$NF”, which referes specifically to the final column. This search will give no output if all features pass.
$ awk -F"\t" 'NF<8 || NF>9' genes.gff


Changing the output

# As a simple example, we will change the value in the source column (column 2) to a new value for each line.
$ awk -F"\t" '{$2="new_source"; print $0}' genes.gff

# This is achieved by adding "BEGIN{OFS="\t"}" to the code, as below. Before awk reads any lines of the file, it reads the BEGIN block of code, in this case, changing OFS to a tab character.
$ awk -F"\t" 'BEGIN{OFS="\t"} {$2="new_source"; print $0}' genes.gff


Exercises

  1. Looking at the file grep/exercises.fasta, write a grep command to only output the sequence names.
    • How many sequences does this file contain?
    • How many sequences contain unknown bases (denoted by “n” or “N”)?
    • Do any sequences have the same name? You don’t need to find the repeated names, just how many names are repeated.
      • Hint: You may need to look back at some earlier Unix commands.
  2. Looking at the files awk/exercises.bed, find the names of the contigs in the file.
    • How many contigs are there?
    • How many features are on the positive strand?
    • And, how many on the negative strand?
    • How many genes are there?
    • How many genes have no strand assigned to them? (i.e. no final column)
    • How many genes have repeated names? You don’t need to find the names.

Back to top

Loops

# We will use a for loop to run wc on the files in the directory loop_files/
$ for filename in loop_files/*; do wc ${filename}; done
# Next we will use a while read a file line-by-line, and only print lines for chromosome 1 and on the sense strand
$ while read -r chr start end name strand; do \
		if [[ $chr == "01" && $strand == "1" ]]; then \
		echo $chr $start $end $name $strand; \
		fi; \
		done < loop_files/file.1

Back to top

BASH scripts

Your first script

# In a terminal window, navigate to your home directory and create a directory called scripts
$ cd
$ mkdir scripts
$ cd scripts

# Open a text editor to create your script. Do not use a word processor. An example is gedit. If you don’t have a favourite text editor already run this.
$ gedit &

# In the editor window type ‘echo “Hello world!”’ and save the file with the name hello.sh.

# First check to see whether the file in place then run it.
$ ls hello.sh
$ bash hello.sh


Setting up a generic directory for scripts

#!/usr/bin/env bash

# chmod changes the permissions of the file
$ chmod +x hello.sh

# First we want to check what our PATH currently is.
$ echo $PATH

# We can modify the PATH environment variable in the current terminal
$ export PATH=$PATH:~/scripts

# To check the change has worked, open a new terminal and run your script with no location set.
$ hello.sh

$ cd ~/scripts
$ touch myscript.sh
$ chmod +x myscript.sh


Getting command line options and adding output text

# We can view this example using the cat commend we’ve seen earlier
$ cd ~/Module_2_Linux_scripting/bash_scripts/scripts
$ cat options_example.sh

$ ./options_example.sh test_file 2

# We have provided a second version of the script which is more readable
$ cat options_example2.sh


Exercises

  1. Write a script which takes a file name from the user, if the file exists, print a human readable message telling the user how many lines the file has.
  2. Navigate to the base Module_2_Linux_scripting directory. Use a loop to run the script written in exercise 1 on the files in the loop_files subdirectory.
  3. Write a script that takes a GFF filename as input. Make the script produce a summary of various properties of the file.
    • An example input file is provided called bash_scripts/exercise_3.gff.
    • Use your imagination as to what you want to summarise.
    • You may want to look back at the awk section of the manual for inspiration.

Back to top

UNIX quick reference guide

1. Looking at files and moving them around

command what is it doing
pwd Tell me which directory I’m in
ls What else is in this directory
ls .. What is in the directory above me
ls foo/bar/ What is inside the bar directory which is inside the foo/ directory
ls -lah foo/ Give the the details (-l) of all files and folders (-a) using human readable file sizes (-h)
cd ../.. Move up two directories
cd ../foo/bar Move up one directory and down into the foo/bar/ subdirectories
cp -r foo/ baz/ Copy the foo/ directory into the baz/ directory
mv baz/foo .. Move the foo directory into the parent directory
rm -r ../foo remove the directory called foo/ from the parent directory
find foo/ -name “*.gff” find all the files with a gff extension in the directory foo/

2. Looking in files

command what is it doing
less bar.bed scroll through bar.bed
grep chrom bar.bed | less -S Only look at lines in bar.bed which have ‘chrom’ and don’t wrap lines (-S)
head -20 bar.bed show me the first 20 lines of bar.bed
tail -20 bar.bed show me the last 20 lines
cat bar.bed show me all of the lines (bad for big files)
wc -l bar.bed how many lines are there
sort -k 2 -n bar.bed sort by the second column in numerical order
awk ‘{print $1}’ bar.bed | sort | uniq show the unique entries in the first column

3. Grep

command what is it doing
grep foo bar.bed show me the lines in bar.bed with ‘foo’ in them
grep foo baz/* show me all examples of foo in the files immediately within baz/
grep -r foo baz/ show me all examples of foo in baz/ and every subdirectory within it
grep ‘^foo’ bar.bed show me all of the lines begining with foo
grep ‘foo$’ bar.bed show me all of the lines ending in foo
grep -i ‘^[acgt]$’ bar.bed show me all of the lines which only have the characters a,c,g and t (ignoring their case)
grep -v foo bar.bed don’t show me any files with foo in them

4. awk

command what is it doing
awk ‘{print $1}’ bar.bed just the first column
awk ‘$4 ~ /^foo/’ bar.bed just rows where the 4th column starts with foo
awk ‘$4 == “foo” {print $1}’ bar.bed the first column of rows where the 4th column is foo
awk -F”\t” ‘{print $NF}’ bar.bed ignore spaces and print the last column
awk -F”\t” ‘{print $(NF-1)}’ bar.bed print the penultimate column
awk ‘{sum+=$2} END {print sum}’ bar.bed print the sum of the second column
awk ‘/^foo/ {sum+=$2; count+=1} END {print sum/count}’ bar.bed print the average of the second value of lines starting with foo

5. Piping, redirection and more advanced queries


grep -hv '^#' bar/*.gff | awk -F"\t" '{print $1}' | sort -u
# grep => -h: don't print file names
# -v: don't give me matching files
# '^#': get rid of the header rows
# 'bar/*.gff': only look in the gff files in bar/
# awk => print the first column
# sort => -u: give me unique values


awk 'NR%10 == 0' bar.bed | head -20
# awk => NR: is the row number
# NR%10: is the modulo (remander) of dividing by 10
# awk is therefore giving you every 10th line
# head => only show the first 20


awk '{l=($3-$2+1)}; (l<300 && $2>200000 && $3<250000)' exercises.bed
# Gives:
# contig-2 201156 201359 gene-67 24.7 -
# contig-4 245705 245932 gene-163 24.8 +
# Finds all of the lines with features less than 300 bases long which start
# after base 200,000 and end before base 250,000
# Note that this appears to have the action before the pattern. This is
# because we need to calculate the length of each feature before we use it
# for filtering. If they were the other way around, you'd get the line
# immediatly after the one you want:


awk '(l<300 && $2>200000 && $3<250000) {l=($3-$2+1); print $0}' exercises.bed
# Gives:
# contig-2 201156 201359 gene-67 24.7 -
# contig-2 242625 243449 gene-68 46.5 +


6. A script

#!/usr/bin/env bash
set -e # stop running the script if there are errors
set -u # stop running the script if it uses an unknown variable
set -x # print every line before you run it (useful for debugging but annoying)

if [ $# -ne 2 ]
then
echo "You must provide two files"
exit 1 # exit the programme (and number > 0 reports that this is a failure)
fi

file_one=$1
file_two=$2

if [ ! -f $file_one ]
then
echo "The first file couldn't be found"
exit 2
fi

if [ ! -f $file_two ]
then
echo "The second file couldn't be found"
exit 2
fi

# Get the lines which aren't headers,
# take the first column and return the unique values
number_of_contigs_in_one=$(awk '$1 !~ /^#/ {print $1}' $file_one | sort -u | wc -l)
number_of_contigs_in_two=$(awk '/^[^#]/ {print $1}' $file_two | sort -u | wc -l)

if [ $number_of_contigs_in_one -gt $number_of_contigs_in_two ]
then
echo "The first file had more unique contigs than the second"
exit
elif [ $number_of_contigs_in_one -lt $number_of_contigs_in_two ]
then
echo "The second file had more unique contigs"
exit
else
echo "The two files had the same number of contigs"
exit
fi

7. Pro tips


8. Build your commands slowly

# check which column is which and if there are any headers
head -20 bar.bed

# have a look at the scores
head -20 bar.bed | awk '{print $5}'

# check the contigs don't look wierd
awk '{print $1}' bar.bed | sort -u | less

# check the genes don't look wierd
awk '{print $4}' bar.bed | sort -u | less

# check that I can spot genes
awk '$4 ~ /gene-/' bar.bed | head -20

# check I can find genes on contig-1
awk '($1 == "contig-1" && $4 ~ /gene-/)' bar.bed | head -20

# check my algorithm works on a subset of the data
head -20 bar.bed | awk '($1 == "contig-1" && $4 ~ /gene-/) {sum+=$5}; END {print sum}'

# apply the algorithm to all of the data
awk '($1 == "contig-1" && $4 ~ /gene-/) {sum+=$5}; END {print sum}' bar.bed


9. Which tool should I use?


Back to top


License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 4.0 International License.