Topic outline

  • General


    Tomaž Curk ( and Blaž Zupan (

    Teaching Assistant

    Rok Gomišček (

    Course Language

    English (including the homework and, by consent of students, the exam).


    The course starts on Wednesday, Oct 9, 2019. Lectures are each Wednesday from 11:15 (sharp!) to 14:00 at P22.


    The grade is composed from grade on the homework and grade from the final exam. To grant approach to the final exam, students have to score higher than 60% with homework. The course is completed successfully if both homework and exam scores are above 60%. Homeworks may include bonuses; we take these into account in the final grade of homework but do not consider them when grading the final exam. That is, you have to score better than 60% in the final exam to pass the course. The final course score in points is computed as max[min(HW+15, EX), min(HW, EX+15)], where HW is a score from homework and EX is a score from the exam, both expressed in the scale from 0 to 100. Example: student's score is 85% for homework, 65% at the exam, the final score in points is 80%. One more example: exam score was 90%, homework 65%, final grade in points is 80%. Score is rounded to an integer and then translated to a course grade by the following rule: below or equal to 60 -> 5, from 61 to 68 -> 6, from 69 to 76 -> 7, from 77 to 84 -> 8, from 85 to 92 -> 9, from 93 -> 10. Grade 5 is negative (fail) and grades 6 and above are positive (pass). There's no oral exam, except for students who would like to change their final grades (note: this could go both directions, students who will not be prepared at the oral exam will fail the course).


    Homework will include exercises on rosalind and three smaller projects, where solving rosalind will carry 40% and each project 20% of the homework points.

    Enroll to the course on rosalind with your name and surname. Submit your solutions to the tasks until the deadline. Tasks have different scores.

    Projects have a strict deadline. For each day after the grade is multiplied by 0.9.


    • Nello Cristianini, Matthew W. Hahn (2007) Introduction to computational genomics: A case study approach, Cambridge University Press, Cambridge, UK.
    • Durbin et al. (1998) Biological sequence analysis, Cambridge University Press (selected chapters)
    • Koza (1992), John R. Genetic programming: on the programming of computers by means of natural selection. Vol. 1. MIT press (Supplementary material on genetic algorithms).

    Fun Stuff to Read

    • Larry Gonick & Mark Wheelis (2005) The Cartoon Guide to Genetics, Harper, New York. (FRI library, Amazon).
    • James D. Watson, Andrew Berry (2004) DNA: The Secret of Life, Arrow Books, UK. (also in Slovene: DNK, skrivnost življenja, Modrijan, Ljubljana, 2007).

    Past Exams

    Exams in years 2012 to 2016. Notice that we have been updating the course every year, so that the exam topics can change.

    Python-Related Resources

    Exam dates

    Exam dates will be announced later in the semester.

    Other resources

  • Introduction to Molecular Biology

    A cell is a structural and functional unit of all living organisms. It is also the smallest unit of life that we consider to be alive. Cells participate in various functions that are all related to the synthesis and decay of proteins. Proteins are biochemical compounds with a chain structure of at least 100 bonded amino acids. They constitute 20% of the cell, the other 70% is water, and the remaining 10% are small molecules, like sugars and fatty acids. Proteins are building blocks of the cell, serve as enzymes to catalyze chemical reactions, or take a role as transmembrane elements. As enzymes, they are involved in metabolic processes. An example of such a metabolic process is glycolysis, a series of chemical reactions that breaks glucose into pyruvate and from which the cell obtains some energy stored in ATP. Proteins drive metabolic processes. But they are, typically within a few hours, degraded and decomposed to amino acids. There has to be machinery that builds proteins from its constituents, and that can read a recipe that determines the sequence of amino acids. Such recipes are stored in a genome. In brief and simplified: DNA in the genome gets transcribed (copied) to messenger RNA, which carries the information out of the cell's nucleus. There, the protein factories, ribosomes, translate the mRNA recipe into a protein. This process is often referred to as a central dogma of molecular biology.

    Lecture Notes
    Additional Reading
    • The First Look at the Genome

      In the past decade, molecular biologists have sequenced the genomes of many different organisms. This includes human, mice, pig, different plants, and all important model organisms, such as yeast and fly. The sequences are now available in databases such as those at NCBI or EBI. We can represent these sequences with models to reveal their statistical properties, structure, and patterns. The simplest model is a multinomial model defined through probability distribution over the alphabet. This model assumes that the nucleotides are independent and identically distributed across the sequence; the probability of a particular nucleotide in this model does not depend on the position. We can infer the model from the sequence simply by counting the occurrences of the nucleotides. Similar can be done with subsequences, like dinucleotides or k-mers. Estimating probabilities of k-mers and estimating the odds that they occur in the genome or in the parts of the genome provides us a first insight into the structure of the DNA.

      Lecture Notes
      Additional Reading
      • Cristianini & Hahn (2007), Introduction to Computational Genomics, Chapter 1.
      • Where are the Genes?

        Three nucleotides form a codon and code for one aminoacid. Special codons, called start and stop codons, mark the area that will be translated to the protein. A sequence of codons is defined by the initial nucleotide, from where on the codons are nonoverlaping and continuous sequence of triplets of nucleotides. Every sequence can therefore be read in three reading frames, each producing three different sequences of aminoacids. Taking into consideration a complementary sequence, each DNA sequence gives rise to six different reading frames, three in forawrd (sense) direction and three in reverse (antisense) direction. The region between start and end codons is called open reading frame (ORF).

        The presence of ORF in a sequence does not guarantee that this part of the DNA will be transcribed and translated to a protein. ORF needs to be "embedded" within a gene, which on its own has structural features other than ORF that include promotor, TATA box, untranslated regions, and other. A set of ORF's in a genome provides for a set of candidates for coding sequences. To limit this set, we can (partially) rely on ORF's lenght. We observed that a randomized sequence would not yield long ORFs, and most of the ORFs would be shorter than 60 codons. Inspectio of gene models of various organisms tells us that average coding sequence length could be much longer. So filtering out ORFs of short length would increase the quality of prediction of coding sequences and hence genes.

        A threshold of a lenght of ORF can be derived theoretically from (uniform) multinomial model, or by examining the distribution of ORF length in a randomized sequence. In both cases, a parameter of the approach is the error that we are willing to make (alpha, often set to 0.01).

        Lecture Notes
        Additional Reading
        • Cristianini & Hahn (2007), Chapter 2.