Topic outline

  • General


    Tomaž Curk ( and Blaž Zupan (

    Teaching Assistant

    Rok Gomišček (

    Course Language

    English (including the homework and, by consent of students, the exam).


    The course starts on Wednesday, Oct 9, 2019. Lectures are each Wednesday from 11:15 (sharp!) to 14:00 at P22.


    The course grade is composed from grade on the homework and grade from the final exam. The grade for the homework will be composed from the grade from Rosalind challenges (50%) and grade from mini projects (50%). To grant approach to the final exam, students have to score higher than 60% on rosalind and higher than 60% with the projects. The course is completed successfully if both homework and exam scores are above 60%. Homeworks may include bonuses; we take these into account in the final grade of homework but do not consider them when grading the final exam. That is, you have to score better than 60% in the final exam to pass the course. The final course score in points is computed as max[min(HW+15, EX), min(HW, EX+15)], where HW is a score from homework and EX is a score from the exam, both expressed in the scale from 0 to 100. Example: student's score is 85% for homework, 65% at the exam, the final score in points is 80%. One more example: exam score was 90%, homework 65%, final grade in points is 80%. Score is rounded to an integer and then translated to a course grade by the following rule: below or equal to 60 -> 5, from 61 to 68 -> 6, from 69 to 76 -> 7, from 77 to 84 -> 8, from 85 to 92 -> 9, from 93 -> 10. Grade 5 is negative (fail) and grades 6 and above are positive (pass). There's no oral exam, except for students who would like to change their final grades (note: this could go both directions, students who will not be prepared at the oral exam will fail the course).


    Homework will include exercises on rosalind and three smaller projects, where solving rosalind will carry 40% and each project 20% of the homework points.

    Enroll to the course on rosalind with your name and surname. Submit your solutions to the tasks until the deadline. Tasks have different scores.

    Projects have a strict deadline. For each day after the grade is multiplied by 0.9.

    Project report has to be prepared in LaTeX (template; also available in Slovene: template). Students will have to submit both the report and the associated code. Please see individual homework for a description for submissions. 


    • Nello Cristianini, Matthew W. Hahn (2007) Introduction to computational genomics: A case study approach, Cambridge University Press, Cambridge, UK.
    • Durbin et al. (1998) Biological sequence analysis, Cambridge University Press (selected chapters)
    • Koza (1992), John R. Genetic programming: on the programming of computers by means of natural selection. Vol. 1. MIT press (Supplementary material on genetic algorithms).

    Fun Stuff to Read

    • Larry Gonick & Mark Wheelis (2005) The Cartoon Guide to Genetics, Harper, New York. (FRI library, Amazon).
    • James D. Watson, Andrew Berry (2004) DNA: The Secret of Life, Arrow Books, UK. (also in Slovene: DNK, skrivnost življenja, Modrijan, Ljubljana, 2007).

    Past Exams

    Exams in years 2012 to 2016. Notice that we have been updating the course every year, so that the exam topics can change.

    Python-Related Resources

    Exam dates

    January 23rd at 15:00 in P01

    February 5th at 14:00 in P22

    Other resources

  • Introduction to Molecular Biology

    A cell is a structural and functional unit of all living organisms. It is also the smallest unit of life that we consider to be alive. Cells participate in various functions that are all related to the synthesis and decay of proteins. Proteins are biochemical compounds with a chain structure of at least 100 bonded amino acids. They constitute 20% of the cell, the other 70% is water, and the remaining 10% are small molecules, like sugars and fatty acids. Proteins are building blocks of the cell, serve as enzymes to catalyze chemical reactions, or take a role as transmembrane elements. As enzymes, they are involved in metabolic processes. An example of such a metabolic process is glycolysis, a series of chemical reactions that breaks glucose into pyruvate and from which the cell obtains some energy stored in ATP. Proteins drive metabolic processes. But they are, typically within a few hours, degraded and decomposed to amino acids. There has to be machinery that builds proteins from its constituents, and that can read a recipe that determines the sequence of amino acids. Such recipes are stored in a genome. In brief and simplified: DNA in the genome gets transcribed (copied) to messenger RNA, which carries the information out of the cell's nucleus. There, the protein factories, ribosomes, translate the mRNA recipe into a protein. This process is often referred to as a central dogma of molecular biology.

    Lecture Notes
    Additional Reading
    • The First Look at the Genome

      In the past decade, molecular biologists have sequenced the genomes of many different organisms. This includes human, mice, pig, different plants, and all important model organisms, such as yeast and fly. The sequences are now available in databases such as those at NCBI or EBI. We can represent these sequences with models to reveal their statistical properties, structure, and patterns. The simplest model is a multinomial model defined through probability distribution over the alphabet. This model assumes that the nucleotides are independent and identically distributed across the sequence; the probability of a particular nucleotide in this model does not depend on the position. We can infer the model from the sequence simply by counting the occurrences of the nucleotides. Similar can be done with subsequences, like dinucleotides or k-mers. Estimating probabilities of k-mers and estimating the odds that they occur in the genome or in the parts of the genome provides us a first insight into the structure of the DNA.

      Lecture Notes
      Additional Reading
      • Cristianini & Hahn (2007), Introduction to Computational Genomics, Chapter 1.
      • Where are the Genes?

        Three nucleotides form a codon and code for one aminoacid. Special codons, called start and stop codons, mark the area that will be translated to the protein. A sequence of codons is defined by the initial nucleotide, from where on the codons are nonoverlaping and continuous sequence of triplets of nucleotides. Every sequence can therefore be read in three reading frames, each producing three different sequences of aminoacids. Taking into consideration a complementary sequence, each DNA sequence gives rise to six different reading frames, three in forawrd (sense) direction and three in reverse (antisense) direction. The region between start and end codons is called open reading frame (ORF).

        The presence of ORF in a sequence does not guarantee that this part of the DNA will be transcribed and translated to a protein. ORF needs to be "embedded" within a gene, which on its own has structural features other than ORF that include promotor, TATA box, untranslated regions, and other. A set of ORF's in a genome provides for a set of candidates for coding sequences. To limit this set, we can (partially) rely on ORF's lenght. We observed that a randomized sequence would not yield long ORFs, and most of the ORFs would be shorter than 60 codons. Inspectio of gene models of various organisms tells us that average coding sequence length could be much longer. So filtering out ORFs of short length would increase the quality of prediction of coding sequences and hence genes.

        A threshold of a lenght of ORF can be derived theoretically from (uniform) multinomial model, or by examining the distribution of ORF length in a randomized sequence. In both cases, a parameter of the approach is the error that we are willing to make (alpha, often set to 0.01).

        Lecture Notes
        Additional Reading
        • Cristianini & Hahn (2007), Chapter 2.
        • Sequence Alignment

          Random changes of genetic material and subsequent fitness-based selection of the organisms are drivers of evolution. Genes in different species may originate from genes of common ancestors, but their sequence, due to evolution, is different. Such genes are called orthologous genes. Similarly, genes in the same species may have evolved from the same gene by gene duplications and then subsequent mutations of the gene sequence. These genes are refered to as paralogous genes. Both are the types of the homologous genes, that is, genes that have evolved from the same parental sequence.

          Knowledge of homology may help us understand evolution, but also assist us in tasks such as function prediction, gene finding, sequence assembly and assignment of gene function. Homologous genes may have similar sequences, and we can discover them through sequence comparison. To estimate the sequence similarity, we need to align the sequences and score the alignment. In the most trivial case the alignment score is derived from alignments of each of the nucleotide, that is, from a scoring function that assesses each particular element of alignment. Sequence similarity is then the sum of individual scoring across all alignment positions. We would like to find an alignment with maximal alignment score. An algorithm that effectively performs such search is called Needleman-Wunsch algorithm. It contributes its speed and elegance to the approach called dynamic programming.

          Lecture Notes
          Textbook Chapters
          • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
          • Cristianini & Hahn (2007), Chapter 3.

          Additional Material
          • Sequence Alignment: Few More Tricks

            There are several types of "complications" we can think about when considering sequence alignment. The last time we have considered only global alignment of two sequences. How can we find optimal local alignments, where we only care about alignments of subsequences? Can we extend the alignment procedures to simultaneous alignment of a set of sequences? And can we score alignment gaps more gently by penalizing an opening of the gap but reducing the penalty that stems from the gap lenght? And finally, how do we score penalties on mismatched aminoacids? Oh, so many questions! Yet, solutions of these are not difficult, and many (again) rely on the beauty of dynamic programming.

            Lecture Notes
            Textbook Chapters
            • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
            • Cristianini & Hahn (2007), Chapter 3.

            • Hidden Markov Models

              Introduction to Markov Chains, Hidden Markov Models and Decoding

              Genome consists of structural elements like CpG islands, coding sequences, promoters, the region around splicing sites, introns and exons. Structural elements have their own, characteristic sequence properties. One of this is also the distribution of nucleotides. We would like to devise models that incorporate these distributions but change them according to which structural elements they model. To make the model aware of the structural element, we assign it a state. When modeling CpGs island, for instance, the model would have two states: inside the CpG island, and outside the CpG island. In each state the model would emit a DNA sequence from the distribution that will be characteristic to the state. Notice that we are now dealing with two sequences, a DNA sequence (emission) and a sequence of states (a path). In this lecture, we think about representation of such models, computation of joint probabilities, and decoding of a path from a given emission sequence.

              Lecture Notes
              Textbook Chapters
              • Cristianini & Hahn (2007), Chapter 4 (motivation, examples, biref explanation of Viterbi algorithm).
              • Durbin et al. (1998) Chapter 3 (an excellent introduction on Hidden Markov Models and nice description of related algorithms)

              • Estimation of genetic distances

                We discuss on evolution, point mutations (so-called single nucleotide polymorphisms, or SNPs) and their utility in estimation of genetic distances. Mutations can be lethal, useful (drivers of evolution) or neutral, that is, those that have no observable effect. Measurement of genetic distances rely on neutral mutations. A good model for analysis of variation is mitochondrial DNA and it's D-loop (cca 900bp), where point mutations are considered neutral and can accumulate. Since over time, mutations can cancel each other (e.g., mutation of T to C and then back to T), this should be taken into account when computing distances. We discuss how this is done in Jukes-Cantor model and use it in a simple simulation to show its workings.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 5.

              • Inference of Phylogenetic Trees

                The methods we described so far (sequence alignment, genetic distance, Jukes-Cantor correction) allow us to measure the distance between DNA sequences. These sequences may belong to different organisms within or between different species (e.g., paralogous, homologous genes). Our goal is to use genetic distances to infer the relations between the different organisms or species. Phylogeny deals with such questions on evolutionary relationships. A phylogenetic tree linking related organisms (taxa) is usually drawn to present the inferred evolutionary relations. The length of the edge indicates the evolutionary distance between two taxa.

                Before the advent of molecular biology and associated genome sequencing technologies, phylogenetics used morphological features to estimate genetic distances. Nowadays, comparison of DNA sequences is used for inference of phylogenetic trees.

                Two basic methods to build a phylogenetic tree are: unweighted pair group method with arithmetic mean (UPGMA) and neighbour-joining (NJ). UPGMA assumes a constant rate of evolution, which affects the whole organisms as well as its' parts of the genome equally. It also assumes that all species with a common ancestor are evolutionary equally apart. NJ addresses the last issue and infers trees that more realistically reflect the measured genetic distances. More advanced methods consider the minimum evolution (maximum parsimony), where one goal is to minimize the length of the paths in a given tree; they apply the Occam razor when inferring evolutionary trees.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 7 (includes an interesting example on the evolution of SARS).
                • Durbin et al. (1998) Biological sequence analysis, Chapter 7 (systematic description and derivation of algorithms).
                • Examples (from Zvelebil and Baum: Understanding bioinformatics, Garland Science, 2008): UPGMA and NJ.
                • Carroll et al. Nature 2015 (Ebola example).

              • Genome assembly

                Assembling (reconstructing) genomes from short reads (k-mers) is a computationally challenging problem. We learn about two ways this problem is addressed. Both approaches rely on graphs to reconstruct the genome. Historically, k-mers were placed into nodes and edges connected overlapping k-mers. There, the goal was to find a path through the graph, which would visit every node exactly once. Solving this problem (Hamiltonian cycle) is a NP-complete problem, and impractical for graphs with thousands of nodes (an average genome assembly problem includes millions of k-mers). However, if k-mers are placed on edges, and nodes are (k-1)-mers, we then get a de Bruijn graph. There, the goal is to find a path through the graph, where each edge is traversed exactly once. This problem (Eulerian cycle or tour) can be solved in linear time, which allows the assembly of large genomes.

                Lecture Notes
                Textbook Chapters
                Additional Material
              • Gene expression and profiling

                Genes, quantity of mRNA, protein concentrations, contentrations of metabolites and characteristics of phenotypes are all subject to measurements. Biotechnology has in the past couple of decades created tools when we can measure these elements in parallel. Prominent examples of such techology are microarrays that can simultaneously assess the expression of the entire set of genes in the genome. We can now profile different tissues (say, maling or benign) with vectors of gene expressions, or can characterize genes by measuring their expression unde different environmental conditions. In either way, we obtain profiles, like tissue profiles, or gene expression profiles.

                Profile is thus a vector of measurements, with wich we have characters objects of our interest. We can characterize genes with their expression at different environmental conditions. Or characterize tissues with their gene expression. Or disseases with a vector of their characteristic genes. We can now compare profiles based on how similar they are to each other. Eucliedian distance, Pearson correlatin or any similar measure can be used to compare the profiles. If the profiles are just sets of items, we may use Jaccard distance.

                We would now like to know what is the structure of the object space. How different are their profiles. Do they form clusters or groups. We already know how to perform hiearchical clustering, and can use this method to reason about similarity of objects. In a different technique, called multi-dimensional scaling, we would like to present all the objects on the Euclidean plane. We would like to construct this projections so that, to a maximal degree, we retain the distances between the objects (that is, the distances between the profiles). If the two objects are similar, they should appear close to each other in the projection. If they are different, their distance in projection plane should be large. During the lectures, we saw that we can construct this projection by computing a gradient of optimization function, placing the objects randomly on the projection and then gradually correcting their position toward a more optimal one w.r.t. criteria function using gradient descent.

                Reading material

                • Gene set enrichment

                  Gene profiling and estimation of distances between profiles can lead to discovery of gene groups. Each gene within a group would have a similar profile to the other genes from the same group. We can now be interested if the genes from the same group have something else in common. Are their engaged in the same functions? Are they expressed in the same tissues, or at the same locations in the cell? Are they characteristic to the same dissease? To answer these questions, we require extra information. These can be obtained from various public data sources. For example, gene ontology classifies bilogical functions and processes, and for each term used in the classification lists associated genes. Genes of which proteins form protein complexes are available in String. Common dissease are associated to genes in OMIM. From all these and similar data sources we can obtain named gene sets, that is, a group of genes associated to some concept (like dissease, protein complex, or biological function). Given the profile-based gene group, it is now are tasks to go thrugh all of the gene sets and report on those with substantial overlap. That is, we would expect that concepts that are interesting and related to specific gene profile include many genes from profile-based gene group.

                  To exclude random associations, that is, associations that emerge by chance, we can employ a procedure called gene set enrichment analysis. In enrichment analysis, we estimate what is the probability a given or bigger overlap between profile-based group and a gene set. The assessment is perfomed considering hypergeometric distribution. Given a total set of N genes (e.g., genes in the genome), size n of the gene set, size K of the profile-based group, we would like to estimate what is the probability that k or more genes from profile-based group belong also to the gene set. Here, k is the observed number of genes that are both in profile-based group and a gene set. Our target is finding the gene sets where this probability is very low. We say that such sets are enriched with genes from our profile-based group.

                  Analiza obogatenosti skupin iz GO

                  On the figure: Gene Ontology gene set enrichment (molecular function). "Reference" gives a number of the genes with this function in the entire genome, and "Cluster" refers to a number of genes in a user-provided cluster that perform an associated function.

                  Notice that above we have described only a gene set enrichment analysis. Similar technique can be used for any analysis of clusters for which we have additional information (annotation) on the elements of clusters. Say, which twitter hashtags are characteristic for my facebook friends?

                • Gene expression regulation and inference of gene networks

                  Gene is expressed when its information is used in the synthesis of a protein. Gene expression is regulated. The cell does not need all the proteins at once, and protein sythesis takes energy and material, so it's better be used when needed. One of the main (but not the only!) means of gene expression regulation is through transcription factors. These are proteins that are required for RNA polymerase to transcribe the gene. Transcription factors bind to the promotor region of the gene (before ORF and just in front of UTR region). The presence of transcription factors can be required for transcription to take place, but some may also prohibit the transcription. We that talk about excitation and inhibition of transcription, respectively. As transcription factors are proteins, their sythesis is also regulated. The entire gene regulation system can thus be very complex and is often represented as directed network. Nodes in the network are genes, and directed links (arrows for excitation or blocked pointers for inhibition) tell us about their influence to transcription of the genes they regulate. Such networks are (for some strange reason) in biology referred to as pathways.

                  Regulacijska mreža za agregacija socialne amebe D. discoideum

                  Figure above show an example of regulation network. This network regulates the aggregation of social amoeba D. discoideum; the phenotype is represented as the final node in the network. The green links in the network denote a positive effect (excitation), and red ones a negative effect (inhibition) of one gene to another. We can observe that gene pkaC positively influences aggregation. The more the pkaC is expressed, the faster the social amoebas aggregate. Genes pufA and pkaR inhibit the expression of pkaC, and thus through it inhibit the aggregation.

                  This network is very abstract. It tells nothing about the mechanisms of regulation and it ignores time. Yet, it could be very informative to biologists and may raise several questions (hypothesis) that need to be tested. Biologists can namely "fix" the expression of a gene and check what is happening with the system. One way of "fixing" the expression is by knocking out a gene from the genome. In this way, the gene will never be expressed. We call such an organism a knock-out mutant. A different, but more complicated way of mutation is overexpression of the gene, forcing the gene to be expressed, often more than in normal conditions. We can then compare the phenotypes of mutated organisms to the organism withouth the mutations, which is commonly referred to as a wild-type organism.

                  For example, what happens if we knock-out gene pkaC? We denote such mutants as pkaC- and observe that these do not aggregate. Of course, a protein from pkaC is necessary for aggregation! How about overexpressing pufA instead? In pufA+ mutant the expression of pkaC will be reducedl; there will be no aggregation or these will be reduced. For the third example, consider pufA-. The inhibition of pkaC will be lowered (compared to the wild-type), the production of related protein will go up, and the aggregation will be enhanced.

                  If we fix a gene in the network, the upstream genes (those that regulate its expression) will have no effect on the phenotype. The fixed gene will block any influence of those genes. In pkaC- mutatnt and additional change of other genes in the network these later will have no effect and the organisms will never aggregate. Another example: consider pkaR- mutant. here, the aggregation is enhanced due to decreased inhibition of pkaC. Let's add another mutation, regA-. Any change? No. Expression of pkaR was already fixed (the gene was knocked-out), so changes in regA do not change the phenotype.

                  Logical reasoning about the phenotype of a double mutant, an organism where two genes are mutated, leads to a method called epistasis analysis. An upstream gene (the one that is in the pathway closer to the phenotype) can block downstream genes. Gene pkaC can block acaA, gene pkaR can block regA. We refer to the blocking (upstream gene) as epistatic gene. Gene pkaC is epistatic to acaA, pkaR is epistatic to regA. Let us denote three different aggregation phenotype as "reduced", "normal" and "raised". Consider the following three mutants and their related phenotype:

                  pufA- ==> raised
                  yakA- pufA- ==> raised
                  yakA- ==> reduced

                  The phenotype of yakA- pufA- mutant is the same as the phenotype of pufA-, and this happens despitea phenotype of a mutant yakA- which is completely different from that of pufA-;  mutation of yakA has no effect in the pufA- background (double mutant). This means that the path from gene yakA to phenotype has to pass through pufA: gene pufA is thus epistatic to  yakA. Knocking out pufA has a positive effect on the phenotype, hence expression of pufA is not required for aggregation and the corresponding regulation is negative. This is different with yakA-, where upon knock-out the phenotype is reduced. The influence of yakA to phenotype is thus positive. yakA and pufA have an opposite effect, hence we can assert that  the former inhibits the later: yakA -| pufA.

                  For another example, consider the following triplet of experiments:

                  pufA+ ==> reduced
                  pkaR- ==> raised
                  pufA+ pkaR- ==> normal

                  Phenotype of a double mutant is different from the phenotype of the two single mutants, hence neither pufA nor pkaR block each other. These two genes have to have parallel pathways to the phenotype.

                  How about the following set of experiments?

                  regA+ ==> reduced
                  pkaR- ==> raised
                  pkaR- regaA+ ==> raised

                  The phenotype of the double mutant is here the same as the phenotype of one single mutant (pkaR-), hence gene pkaR is epistatic. We can also infer that the influence between the two genes is positive, since their influence to the phenotype is of the same sign. We can thus hypothesize that regA -> pkaR.

                  All of the examples above are also consistent with D. discoideum network from the Figure. In real life, this is also how such networks are inffered. Researchers use double and single mutants to reconstruct the hidden network. Reconstruction is based on hypothesis of epistatic relations and then construction of network that is consistent with partial relations discovered from the experiments. Transitivity should hold as well. If we find that regA->pkaR and pkaR-|pkaC than we can hypothesize that the network that can well represent both relations is regA->pkaR-|pkaC.

                  For the exercise, use the following set of experiments and propose a network for sporulation? Assume that you already know that pkaR-|pkaC and takC->dhkA. You can also find this and other networks in the epistasis analysis program GenePath.

                  Dicty sporulation