Topic outline

  • General

    Lecturers

    Tomaž Curk (tomaz.curk@fri.uni-lj.si) and Blaž Zupan (blaz.zupan@fri.uni-lj.si)

    Teaching Assistant

    Rok Gomišček (rok.gomiscek@fri.uni-lj.si)

    Course Language

    English (including the homework and, by consent of students, the exam).

    Dates

    The course starts on Wednesday, Oct 3, 2018. Lectures are each Wednesday from 8:15 (sharp!) to 11:00 at P19.

    Grading

    The grade is composed from grade on the homework and grade from the final exam. To grant approach to the final exam, students have to score higher than 60% with homework. The course is completed successfully if both homework and exam scores are above 60%. Homeworks may include bonuses; we take these into account in the final grade of homework but do not consider them when grading the final exam. That is, you have to score better than 60% in the final exam to pass the course. The final course score in points is computed as max[min(HW+15, EX), min(HW, EX+15)], where HW is a score from homework and EX is a score from the exam, both expressed in the scale from 0 to 100. Example: student's score is 85% for homework, 65% at the exam, the final score in points is 80%. One more example: exam score was 90%, homework 65%, final grade in points is 80%. Score is rounded to an integer and then translated to a course grade by the following rule: below or equal to 60 -> 5, from 61 to 68 -> 6, from 69 to 76 -> 7, from 77 to 84 -> 8, from 85 to 92 -> 9, from 93 -> 10. Grade 5 is negative (fail) and grades 6 and above are positive (pass). There's no oral exam, except for students who would like to change their final grades (note: this could go both directions, students who will not be prepared at the oral exam will fail the course).

    Homework

    Homework will include solving exercises on rosalind and working on a project as a bonus (20%).

    Homework submission is through rosalind, as indicated in the description of the homework. Submission deadline is strict, homework cannot be submitted after the deadline. Homeworks have a strict deadline. For each day after the grade is multiplied by 0.9.

    Project are submitted as Github repositories with all the code and documentation. The project final report, with summaries and figures, should be written in the README.md file in the repository.

    Textbook

    • Nello Cristianini, Matthew W. Hahn (2007) Introduction to computational genomics: A case study approach, Cambridge University Press, Cambridge, UK. (FRI library, Amazon)
    • Durbin et al. (1998) Biological sequence analysis, Cambridge University Press (selected chapters)
    • Koza (1992), John R. Genetic programming: on the programming of computers by means of natural selection. Vol. 1. MIT press (Supplementary material on genetic algorithms).

    Fun Stuff to Read

    • Larry Gonick & Mark Wheelis (2005) The Cartoon Guide to Genetics, Harper, New York. (FRI library, Amazon).
    • James D. Watson, Andrew Berry (2004) DNA: The Secret of Life, Arrow Books, UK. (also in Slovene: DNK, skrivnost življenja, Modrijan, Ljubljana, 2007).

    Past Exams

    Exams in years 2012 to 2016. Notice that we have been updating the course every year, so that the exam topics can change.

    Other Stuff

  • Introduction to Molecular Biology

    A cell is a structural and functional unit of all living organisms. It is also the smallest unit of life that we consider to be alive. Cells participate in various functions that are all related to synthesis and decay of proteins. Proteins are biochemical compounds with chain structure of at least 100 bonded amino acids. They constitute 20% of the cell, the other 70% is water, and the remaining 10% are small molecules, like sugars and fatty acids. Proteins are building blocks of the cell, serve as enzymes to catalyze chemical reactions, or take a role as transmembrane elements. As enzymes, they are involved in metabolic processes. An example of such a metabolic process is glycolysis, a series of chemical reactions that breaks glucose into pyruvate and from which the cell obtains some energy stored in ATP. Proteins drive metabolic processes. But they are, typically within a few hours, degraded and decomposed to amino acids. So there has to be machinery that builds proteins from its constituents, and that can read a recipe that determines the sequence of amino acids. Such recipes are stored in a genome. In brief and simplified: DNA in the genome gets transcribed (copied) to messenger RNA, which carries the information out of the cell's nucleus. There, the protein factories, ribosomes, translate the mRNA recipe into a protein. This process is often referred to as a central dogma of molecular biology.

    Lecture Notes
    Videos
    Additional Reading
    • Cristianini & Hahn (2007), Introduction to Computational Genomics, Chapter 1.
    • Jacob F (1997) Evolution and tinkering, Science 196(4295):1161-1166.
    • The First Look at the Genome

      In the past decade, molecular biologists have sequenced the genomes of many different organisms. This includes human, mice, pig, different plants, and all important model organisms, such as yeast and fly. The sequences are now available in databases such as those at NCBI or EBI. We can represent these sequences with models to reveal their statistical properties, structure, and patterns. The simplest model is a multinomial model defined through probability distribution over the alphabet. This model assumes that the nucleotides are independent and identically distributed across the sequence; the probability of a particular nucleotide in this model does not depend on the position. We can infer the model from the sequence simply by counting the occurrences of the nucleotides. Similar can be done with subsequences, like dinucleotides or k-mers. Estimating probabilities of k-mers and estimating the odds that they occur in the genome or in the parts of the genome provides us a first insight into the structure of the DNA.

      Lecture Notes
      Additional Reading
      • Cristianini & Hahn (2007), Introduction to Computational Genomics, Chapter 1.
      • Where are the Genes?

        Three nucleotides form a codon and code for one aminoacid. Special codons, called start and stop codons, mark the area that will be translated to the protein. A sequence of codons is defined by the initial nucleotide, from where on the codons are nonoverlaping and continuous sequence of triplets of nucleotides. Every sequence can therefore be read in three reading frames, each producing three different sequences of aminoacids. Taking into consideration a complementary sequence, each DNA sequence gives rise to six different reading frames, three in forawrd (sense) direction and three in reverse (antisense) direction. The region between start and end codons is called open reading frame (ORF).

        The presence of ORF in a sequence does not guarantee that this part of the DNA will be transcribed and translated to a protein. ORF needs to be "embedded" within a gene, which on its own has structural features other than ORF that include promotor, TATA box, untranslated regions, and other. A set of ORF's in a genome provides for a set of candidates for coding sequences. To limit this set, we can (partially) rely on ORF's lenght. We observed that a randomized sequence would not yield long ORFs, and most of the ORFs would be shorter than 60 codons. Inspectio of gene models of various organisms tells us that average coding sequence length could be much longer. So filtering out ORFs of short length would increase the quality of prediction of coding sequences and hence genes.

        A threshold of a lenght of ORF can be derived theoretically from (uniform) multinomial model, or by examining the distribution of ORF length in a randomized sequence. In both cases, a parameter of the approach is the error that we are willing to make (alpha, often set to 0.01).

        Lecture Notes
        Additional Reading
        • Cristianini & Hahn (2007), Chapter 2.
        • Sequence Alignment

          Random changes of genetic material and subsequent fitness-based selection of the organisms are drivers of evolution. Genes in different species may originate from genes of common ancestors, but their sequence, due to evolution, is different. Such genes are called orthologous genes. Similarly, genes in the same species may have evolved from the same gene by gene duplications and then subsequent mutations of the gene sequence. These genes are refered to as paralogous genes. Both are the types of the homologous genes, that is, genes that have evolved from the same parental sequence.

          Knowledge of homology may help us understand evolution, but also assist us in tasks such as function prediction, gene finding, sequence assembly and assignment of gene function. Homologous genes may have similar sequences, and we can discover them through sequence comparison. To estimate the sequence similarity, we need to align the sequences and score the alignment. In the most trivial case the alignment score is derived from alignments of each of the nucleotide, that is, from a scoring function that assesses each particular element of alignment. Sequence similarity is then the sum of individual scoring across all alignment positions. We would like to find an alignment with maximal alignment score. An algorithm that effectively performs such search is called Needleman-Wunsch algorithm. It contributes its speed and elegance to the approach called dynamic programming.

          Lecture Notes
          Textbook Chapters
          • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
          • Cristianini & Hahn (2007), Chapter 3.

          Additional Material
          • Sequence Alignment: Few More Tricks

            There are several types of "complications" we can think about when considering sequence alignment. The last time we have considered only global alignment of two sequences. How can we find optimal local alignments, where we only care about alignments of subsequences? Can we extend the alignment procedures to simultaneous alignment of a set of sequences? And can we score alignment gaps more gently by penalizing an opening of the gap but reducing the penalty that stems from the gap lenght? And finally, how do we score penalties on mismatched aminoacids? Oh, so many questions! Yet, solutions of these are not difficult, and many (again) rely on the beauty of dynamic programming.

            Lecture Notes
            Textbook Chapters
            • Durbin et al. (1998) Biological sequence analysis, Chapter 2.
            • Cristianini & Hahn (2007), Chapter 3.

            • Hidden Markov Models

              Introduction to Markov Chains, Hidden Markov Models and Decoding

              Genome consists of structural elements like CpG islands, coding sequences, promoters, the region around splicing sites, introns and exons. Structural elements have their own, characteristic sequence properties. One of this is also the distribution of nucleotides. We would like to devise models that incorporate these distributions but change them according to which structural elements they model. To make the model aware of the structural element, we assign it a state. When modeling CpGs island, for instance, the model would have two states: inside the CpG island, and outside the CpG island. In each state the model would emit a DNA sequence from the distribution that will be characteristic to the state. Notice that we are now dealing with two sequences, a DNA sequence (emission) and a sequence of states (a path). In this lecture, we think about representation of such models, computation of joint probabilities, and decoding of a path from a given emission sequence.

              Lecture Notes
              Textbook Chapters
              • Cristianini & Hahn (2007), Chapter 4 (motivation, examples, biref explanation of Viterbi algorithm).
              • Durbin et al. (1998) Chapter 3 (an excellent introduction on Hidden Markov Models and nice description of related algorithms)

              • Estimation of genetic distances

                We discuss on evolution, point mutations (so-called single nucleotide polymorphisms, or SNPs) and their utility in estimation of genetic distances. Mutations can be lethal, useful (drivers of evolution) or neutral, that is, those that have no observable effect. Measurement of genetic distances rely on neutral mutations. A good model for analysis of variation is mitochondrial DNA and it's D-loop (cca 900bp), where point mutations are considered neutral and can accumulate. Since over time, mutations can cancel each other (e.g., mutation of T to C and then back to T), this should be taken into account when computing distances. We discuss how this is done in Jukes-Cantor model and use it in a simple simulation to show its workings.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 5.

              • Inference of Phylogenetic Trees

                The methods we described so far (sequence alignment, genetic distance, Jukes-Cantor correction) allow us to measure the distance between DNA sequences. These sequences may belong to different organisms within or between different species (e.g., paralogous, homologous genes). Our goal is to use genetic distances to infer the relations between the different organisms or species. Phylogeny deals with such questions on evolutionary relationships. A phylogenetic tree linking related organisms (taxa) is usually drawn to present the inferred evolutionary relations. The length of the edge indicates the evolutionary distance between two taxa.

                Before the advent of molecular biology and associated genome sequencing technologies, phylogenetics used morphological features to estimate genetic distances. Nowadays, comparison of DNA sequences is used for inference of phylogenetic trees.

                Two basic methods to build a phylogenetic tree are: unweighted pair group method with arithmetic mean (UPGMA) and neighbour-joining (NJ). UPGMA assumes a constant rate of evolution, which affects the whole organisms as well as its' parts of the genome equally. It also assumes that all species with a common ancestor are evolutionary equally apart. NJ addresses the last issue and infers trees that more realistically reflect the measured genetic distances. More advanced methods consider the minimum evolution (maximum parsimony), where one goal is to minimize the length of the paths in a given tree; they apply the Occam razor when inferring evolutionary trees.

                Lecture Notes
                Textbook Chapters
                • Cristianini & Hahn (2007), Chapter 7 (includes an interesting example on the evolution of SARS).
                • Durbin et al. (1998) Biological sequence analysis, Chapter 7 (systematic description and derivation of algorithms).
                • Examples (from Zvelebil and Baum: Understanding bioinformatics, Garland Science, 2008): UPGMA and NJ.
                • Carroll et al. Nature 2015 (Ebola example).

              • Genome assembly

                Assembling (reconstructing) genomes from short reads (k-mers) is a computationally challenging problem. We learn about two ways this problem is addressed. Both approaches rely on graphs to reconstruct the genome. Historically, k-mers were placed into nodes and edges connected overlapping k-mers. There, the goal was to find a path through the graph, which would visit every node exactly once. Solving this problem (Hamiltonian cycle) is a NP-complete problem, and impractical for graphs with thousands of nodes (an average genome assembly problem includes millions of k-mers). However, if k-mers are placed on edges, and nodes are (k-1)-mers, we then get a de Bruijn graph. There, the goal is to find a path through the graph, where each edge is traversed exactly once. This problem (Eulerian cycle or tour) can be solved in linear time, which allows the assembly of large genomes.

                Lecture Notes
                Textbook Chapters
                Additional Material
              • Gene expression and profiling

                Genes, quantity of mRNA, protein concentrations, contentrations of metabolites and characteristics of phenotypes are all subject to measurements. Biotechnology has in the past couple of decades created tools when we can measure these elements in parallel. Prominent examples of such techology are microarrays that can simultaneously assess the expression of the entire set of genes in the genome. We can now profile different tissues (say, maling or benign) with vectors of gene expressions, or can characterize genes by measuring their expression unde different environmental conditions. In either way, we obtain profiles, like tissue profiles, or gene expression profiles.

                Profile is thus a vector of measurements, with wich we have characters objects of our interest. We can characterize genes with their expression at different environmental conditions. Or characterize tissues with their gene expression. Or disseases with a vector of their characteristic genes. We can now compare profiles based on how similar they are to each other. Eucliedian distance, Pearson correlatin or any similar measure can be used to compare the profiles. If the profiles are just sets of items, we may use Jaccard distance.

                We would now like to know what is the structure of the object space. How different are their profiles. Do they form clusters or groups. We already know how to perform hiearchical clustering, and can use this method to reason about similarity of objects. In a different technique, called multi-dimensional scaling, we would like to present all the objects on the Euclidean plane. We would like to construct this projections so that, to a maximal degree, we retain the distances between the objects (that is, the distances between the profiles). If the two objects are similar, they should appear close to each other in the projection. If they are different, their distance in projection plane should be large. During the lectures, we saw that we can construct this projection by computing a gradient of optimization function, placing the objects randomly on the projection and then gradually correcting their position toward a more optimal one w.r.t. criteria function using gradient descent.

                Reading material

                • Gene set enrichment

                  Gene profiling and estimation of distances between profiles can lead to discovery of gene groups. Each gene within a group would have a similar profile to the other genes from the same group. We can now be interested if the genes from the same group have something else in common. Are their engaged in the same functions? Are they expressed in the same tissues, or at the same locations in the cell? Are they characteristic to the same dissease? To answer these questions, we require extra information. These can be obtained from various public data sources. For example, gene ontology classifies bilogical functions and processes, and for each term used in the classification lists associated genes. Genes of which proteins form protein complexes are available in String. Common dissease are associated to genes in OMIM. From all these and similar data sources we can obtain named gene sets, that is, a group of genes associated to some concept (like dissease, protein complex, or biological function). Given the profile-based gene group, it is now are tasks to go thrugh all of the gene sets and report on those with substantial overlap. That is, we would expect that concepts that are interesting and related to specific gene profile include many genes from profile-based gene group.

                  To exclude random associations, that is, associations that emerge by chance, we can employ a procedure called gene set enrichment analysis. In enrichment analysis, we estimate what is the probability a given or bigger overlap between profile-based group and a gene set. The assessment is perfomed considering hypergeometric distribution. Given a total set of N genes (e.g., genes in the genome), size n of the gene set, size K of the profile-based group, we would like to estimate what is the probability that k or more genes from profile-based group belong also to the gene set. Here, k is the observed number of genes that are both in profile-based group and a gene set. Our target is finding the gene sets where this probability is very low. We say that such sets are enriched with genes from our profile-based group.

                  Analiza obogatenosti skupin iz GO

                  On the figure: Gene Ontology gene set enrichment (molecular function). "Reference" gives a number of the genes with this function in the entire genome, and "Cluster" refers to a number of genes in a user-provided cluster that perform an associated function.

                  Notice that above we have described only a gene set enrichment analysis. Similar technique can be used for any analysis of clusters for which we have additional information (annotation) on the elements of clusters. Say, which twitter hashtags are characteristic for my facebook friends?