Topic outline

  • General


    Tomaž Curk ( and Blaž Zupan (

    Teaching Assistant

    Rok Gomišček (

    Course Language

    English (including the homework and, by consent of students, the exam).


    The course starts on Wednesday, Oct 9, 2019. Lectures are each Wednesday from 11:15 (sharp!) to 14:00 at P22.


    The course grade is composed from grade on the homework and grade from the final exam. The grade for the homework will be composed from the grade from Rosalind challenges (50%) and grade from mini projects (50%). To grant approach to the final exam, students have to score higher than 60% on rosalind and higher than 60% with the projects. The course is completed successfully if both homework and exam scores are above 60%. Homeworks may include bonuses; we take these into account in the final grade of homework but do not consider them when grading the final exam. That is, you have to score better than 60% in the final exam to pass the course. The final course score in points is computed as max[min(HW+15, EX), min(HW, EX+15)], where HW is a score from homework and EX is a score from the exam, both expressed in the scale from 0 to 100. Example: student's score is 85% for homework, 65% at the exam, the final score in points is 80%. One more example: exam score was 90%, homework 65%, final grade in points is 80%. Score is rounded to an integer and then translated to a course grade by the following rule: below or equal to 60 -> 5, from 61 to 68 -> 6, from 69 to 76 -> 7, from 77 to 84 -> 8, from 85 to 92 -> 9, from 93 -> 10. Grade 5 is negative (fail) and grades 6 and above are positive (pass). There's no oral exam, except for students who would like to change their final grades (note: this could go both directions, students who will not be prepared at the oral exam will fail the course).


    Homework will include exercises on rosalind and three smaller projects, where solving rosalind will carry 40% and each project 20% of the homework points.

    Enroll to the course on rosalind with your name and surname. Submit your solutions to the tasks until the deadline. Tasks have different scores.

    Projects have a strict deadline. For each day after the grade is multiplied by 0.9.

    Project report has to be prepared in LaTeX (template; also available in Slovene: template). Students will have to submit both the report and the associated code. Please see individual homework for a description for submissions. 


    • Nello Cristianini, Matthew W. Hahn (2007) Introduction to computational genomics: A case study approach, Cambridge University Press, Cambridge, UK.
    • Durbin et al. (1998) Biological sequence analysis, Cambridge University Press (selected chapters)
    • Koza (1992), John R. Genetic programming: on the programming of computers by means of natural selection. Vol. 1. MIT press (Supplementary material on genetic algorithms).

    Fun Stuff to Read

    • Larry Gonick & Mark Wheelis (2005) The Cartoon Guide to Genetics, Harper, New York. (FRI library, Amazon).
    • James D. Watson, Andrew Berry (2004) DNA: The Secret of Life, Arrow Books, UK. (also in Slovene: DNK, skrivnost življenja, Modrijan, Ljubljana, 2007).

    Past Exams

    Exams in years 2012 to 2016. Notice that we have been updating the course every year, so that the exam topics can change.

    Python-Related Resources

    Exam dates

    Exam dates will be announced later in the semester.

    Other resources

  • Introduction to Molecular Biology

    A cell is a structural and functional unit of all living organisms. It is also the smallest unit of life that we consider to be alive. Cells participate in various functions that are all related to the synthesis and decay of proteins. Proteins are biochemical compounds with a chain structure of at least 100 bonded amino acids. They constitute 20% of the cell, the other 70% is water, and the remaining 10% are small molecules, like sugars and fatty acids. Proteins are building blocks of the cell, serve as enzymes to catalyze chemical reactions, or take a role as transmembrane elements. As enzymes, they are involved in metabolic processes. An example of such a metabolic process is glycolysis, a series of chemical reactions that breaks glucose into pyruvate and from which the cell obtains some energy stored in ATP. Proteins drive metabolic processes. But they are, typically within a few hours, degraded and decomposed to amino acids. There has to be machinery that builds proteins from its constituents, and that can read a recipe that determines the sequence of amino acids. Such recipes are stored in a genome. In brief and simplified: DNA in the genome gets transcribed (copied) to messenger RNA, which carries the information out of the cell's nucleus. There, the protein factories, ribosomes, translate the mRNA recipe into a protein. This process is often referred to as a central dogma of molecular biology.

    Lecture Notes
    Additional Reading
    • The First Look at the Genome

      In the past decade, molecular biologists have sequenced the genomes of many different organisms. This includes human, mice, pig, different plants, and all important model organisms, such as yeast and fly. The sequences are now available in databases such as those at NCBI or EBI. We can represent these sequences with models to reveal their statistical properties, structure, and patterns. The simplest model is a multinomial model defined through probability distribution over the alphabet. This model assumes that the nucleotides are independent and identically distributed across the sequence; the probability of a particular nucleotide in this model does not depend on the position. We can infer the model from the sequence simply by counting the occurrences of the nucleotides. Similar can be done with subsequences, like dinucleotides or k-mers. Estimating probabilities of k-mers and estimating the odds that they occur in the genome or in the parts of the genome provides us a first insight into the structure of the DNA.

      Lecture Notes
      Additional Reading
      • Cristianini & Hahn (2007), Introduction to Computational Genomics, Chapter 1.
      • Where are the Genes?

        Three nucleotides form a codon and code for one aminoacid. Special codons, called start and stop codons, mark the area that will be translated to the protein. A sequence of codons is defined by the initial nucleotide, from where on the codons are nonoverlaping and continuous sequence of triplets of nucleotides. Every sequence can therefore be read in three reading frames, each producing three different sequences of aminoacids. Taking into consideration a complementary sequence, each DNA sequence gives rise to six different reading frames, three in forawrd (sense) direction and three in reverse (antisense) direction. The region between start and end codons is called open reading frame (ORF).

        The presence of ORF in a sequence does not guarantee that this part of the DNA will be transcribed and translated to a protein. ORF needs to be "embedded" within a gene, which on its own has structural features other than ORF that include promotor, TATA box, untranslated regions, and other. A set of ORF's in a genome provides for a set of candidates for coding sequences. To limit this set, we can (partially) rely on ORF's lenght. We observed that a randomized sequence would not yield long ORFs, and most of the ORFs would be shorter than 60 codons. Inspectio of gene models of various organisms tells us that average coding sequence length could be much longer. So filtering out ORFs of short length would increase the quality of prediction of coding sequences and hence genes.

        A threshold of a lenght of ORF can be derived theoretically from (uniform) multinomial model, or by examining the distribution of ORF length in a randomized sequence. In both cases, a parameter of the approach is the error that we are willing to make (alpha, often set to 0.01).

        Lecture Notes
        Additional Reading
        • Cristianini & Hahn (2007), Chapter 2.
        • Sequence Alignment

          Random changes of genetic material and subsequent fitness-based selection of the organisms are drivers of evolution. Genes in different species may originate from genes of common ancestors, but their sequence, due to evolution, is different. Such genes are called orthologous genes. Similarly, genes in the same species may have evolved from the same gene by gene duplications and then subsequent mutations of the gene sequence. These genes are refered to as paralogous genes. Both are the types of the homologous genes, that is, genes that have evolved from the same parental sequence.

          Knowledge of homology may help us understand evolution, but also assist us in tasks such as function prediction, gene finding, sequence assembly and assignment of gene function. Homologous genes may have similar sequences, and we can discover them through sequence comparison. To estimate the sequence similarity, we need to align the sequences and score the alignment. In the most trivial case the alignment score is derived from alignments of each of the nucleotide, that is, from a scoring function that assesses each particular element of alignment. Sequence similarity is then the sum of individual scoring across all alignment positions. We would like to find an alignment with maximal alignment score. An algorithm that effectively performs such search is called Needleman-Wunsch algorithm. It contributes its speed and elegance to the approach called dynamic programming.

          Lecture Notes
          Textbook Chapters
          • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
          • Cristianini & Hahn (2007), Chapter 3.

          Additional Material
          • Sequence Alignment: Few More Tricks

            There are several types of "complications" we can think about when considering sequence alignment. The last time we have considered only global alignment of two sequences. How can we find optimal local alignments, where we only care about alignments of subsequences? Can we extend the alignment procedures to simultaneous alignment of a set of sequences? And can we score alignment gaps more gently by penalizing an opening of the gap but reducing the penalty that stems from the gap lenght? And finally, how do we score penalties on mismatched aminoacids? Oh, so many questions! Yet, solutions of these are not difficult, and many (again) rely on the beauty of dynamic programming.

            Lecture Notes
            Textbook Chapters
            • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
            • Cristianini & Hahn (2007), Chapter 3.

            • Hidden Markov Models

              Introduction to Markov Chains, Hidden Markov Models and Decoding

              Genome consists of structural elements like CpG islands, coding sequences, promoters, the region around splicing sites, introns and exons. Structural elements have their own, characteristic sequence properties. One of this is also the distribution of nucleotides. We would like to devise models that incorporate these distributions but change them according to which structural elements they model. To make the model aware of the structural element, we assign it a state. When modeling CpGs island, for instance, the model would have two states: inside the CpG island, and outside the CpG island. In each state the model would emit a DNA sequence from the distribution that will be characteristic to the state. Notice that we are now dealing with two sequences, a DNA sequence (emission) and a sequence of states (a path). In this lecture, we think about representation of such models, computation of joint probabilities, and decoding of a path from a given emission sequence.

              Lecture Notes
              Textbook Chapters
              • Cristianini & Hahn (2007), Chapter 4 (motivation, examples, biref explanation of Viterbi algorithm).
              • Durbin et al. (1998) Chapter 3 (an excellent introduction on Hidden Markov Models and nice description of related algorithms)

              • Estimation of genetic distances

                We discuss on evolution, point mutations (so-called single nucleotide polymorphisms, or SNPs) and their utility in estimation of genetic distances. Mutations can be lethal, useful (drivers of evolution) or neutral, that is, those that have no observable effect. Measurement of genetic distances rely on neutral mutations. A good model for analysis of variation is mitochondrial DNA and it's D-loop (cca 900bp), where point mutations are considered neutral and can accumulate. Since over time, mutations can cancel each other (e.g., mutation of T to C and then back to T), this should be taken into account when computing distances. We discuss how this is done in Jukes-Cantor model and use it in a simple simulation to show its workings.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 5.

              • Inference of Phylogenetic Trees

                The methods we described so far (sequence alignment, genetic distance, Jukes-Cantor correction) allow us to measure the distance between DNA sequences. These sequences may belong to different organisms within or between different species (e.g., paralogous, homologous genes). Our goal is to use genetic distances to infer the relations between the different organisms or species. Phylogeny deals with such questions on evolutionary relationships. A phylogenetic tree linking related organisms (taxa) is usually drawn to present the inferred evolutionary relations. The length of the edge indicates the evolutionary distance between two taxa.

                Before the advent of molecular biology and associated genome sequencing technologies, phylogenetics used morphological features to estimate genetic distances. Nowadays, comparison of DNA sequences is used for inference of phylogenetic trees.

                Two basic methods to build a phylogenetic tree are: unweighted pair group method with arithmetic mean (UPGMA) and neighbour-joining (NJ). UPGMA assumes a constant rate of evolution, which affects the whole organisms as well as its' parts of the genome equally. It also assumes that all species with a common ancestor are evolutionary equally apart. NJ addresses the last issue and infers trees that more realistically reflect the measured genetic distances. More advanced methods consider the minimum evolution (maximum parsimony), where one goal is to minimize the length of the paths in a given tree; they apply the Occam razor when inferring evolutionary trees.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 7 (includes an interesting example on the evolution of SARS).
                • Durbin et al. (1998) Biological sequence analysis, Chapter 7 (systematic description and derivation of algorithms).
                • Examples (from Zvelebil and Baum: Understanding bioinformatics, Garland Science, 2008): UPGMA and NJ.
                • Carroll et al. Nature 2015 (Ebola example).