Genome Academy

Logo

The manual and programme for Wellcome Connecting Science's Genome Academy

View the Project on GitHub WCSCourses/genomeacademy

Welcome to Bioinformatics - Part 2

Introduction to Multiple Sequence Aligmnents and Phylogeny

This tutorial has been modified from a tutorial delivered at Scifest Africa by the Student Council of the South African Society for Bioinformatics - SASBI-sc

You will explore genes for in Taste Receptors across different species!

Task 1: Retrieve sequences:

Obtain the protein sequences for TAS1R3, TAS1R2 and TAS1R1 for the organisms:

Procedure:

  1. Go to https://www.ensembl.org/biomart/martview
  2. Select Dataset on the left menu.
  3. Select Ensembl Genes 111 under the -CHOOSE DATABASE- dropdown menu.
  4. Select Human Genes (GRCh38.p14) under the -CHOOSE DATASET- dropdown menu.
  5. On the left menu, select Filters to filter out the genes you are interested in.
  6. Expand GENES and tick ID list limit [Max 500 advised].
  7. Enter the list of genes:
    TAS1R1
    TAS1R2
    TAS1R3
    

    in the textbox provided and select WiKi-Gene Name(s) on the dropdown menu above the textbox.

  8. On the left menu, select Attributes to get the features of your gene set.
  9. Tick Sequences, then expand the Header Information section.
  10. Untick all the ticked boxes.
  11. Under the Gene Information section tick:
    • Gene Stable ID
    • Gene Name
  12. Under the Exon information - select -CDS Length
  13. Click on Results towards the top of the page - you will get a list of protein sequences for the gene list you provided.
  14. Export results to a file by selecting File then FASTA from the dropdown menus in the Export all results to section.
  15. Click on Go
  16. A “mart_export.txt” file will download - you can rename this to the species you started with (human)

Repeat steps 4 - 14 (changing the species name under the -CHOOSE DATASET- dropdown menu) for each species to get the required protein sequences for the three species listed below.

Then for these species you can right click and retrieve as follows

Copy and paste all the sequences into a single file and call it all_sequences.fasta.

if you keep the suffix of mart_export, you can use the command:

cat *mart_export.txt > all_sequences.fasta

Task 2 Data Cleaning Then ensembl gene indentifier can be used to translate the organism the gene came from:

Do last: ENSG0 Homo sapiens (Human)

Open your all sequences file with gedit

Hit control h to open up the find replace menu, or control f, then select replace.

In the search, place the “ENSTRU” symbol, and in the replace option, the species name - Takifugu rubripes (Fugu) for each symbol and species type.

Finally, the web aligner doesnt like spaces in the names of fasta headers, so use the replace tool to replace “ “ (a space, just hit space) with _

Sequence Alignment

Task 3: Perform a multiple sequence alignment using the sequences you retrieved.

Procedure:

  1. Go to https://www.ebi.ac.uk/Tools/msa/clustalo/
  2. Upload the file with all the sequences you have downloaded (all_sequences.fasta) or copy all the contents of the file and paste it into the box. To Upload: Click on Choose File, navigate to the file location on the computer, then click Upload.
  3. Wait for the job to complete
  4. Look at the alignment of all the sequences - how does this compare to a pairwise analysis of two sequences
  5. Look at the tree of these sequences, how do they compare?

Extra task - use ensembl biomart to find your favourite organism from its selection of species and retrieve the taste genes. How do these compare to the other species we have seen today?

References