BIOL4200
Review for Final Exam: Introduction to Bioinformatics


The exam is CLOSED book and IN-CLASS, although you may bring to class up to 4-pages of handwritten or typed notes (i.e. a crib sheet of 4 pages, written on both sides), along with a CALCULATOR.

Although the exam is cumulative, it will focus mostly on the material covered since the midterm -- i.e. simple applications of probability theory and probability distributions (binomial, geometric, hypergeometric, normal) and statistical tests (T-test, χ2-test), linkage disequilibrium,bacterial origin of replication (ORI) -- GC-skew, biological reasons for skew, glycan arrays, Fisher exact test and its use for the infectivity analysis of the influenze virus, phylogenetic tree construction. The exam will reflect coverage of material in the course, so please review the course slides and homeworks.

Concepts to review include the following.

Some example questions

    General

  1. Using the BLOSUM62 (log odds) matrix, answer the following questions.

    1. Does isoleucine substitute for leucine more often than by chance?
    2. Does glutamine substitute for glutamic acid more often than by chance?
    3. Which amino acid is found to substitute for tryptophan most often?

    Solution:

    BLOSUM2 values, as explained in class (and as you computed in the midterm exam), are given by α · log( p(x,y)/(q(x)q(y)) ) rounded to the nearest integer, where α is a scaling factor, p(x,y) is the joint probabity (relative frequency that x and y both appear in the same column of the multiple alignment), and q(x) [resp. q(y)] is the relative frequency of occurrence of residues x [resp. y] in the multiple alignment. If the BLOSUM value is positive, then substitution happens more often by chance, while if the BLOSUM value is negative , then substitution happens less often than by chance. From this, the answers to a,b,c are clear when you check values in the BLOSUM62 table.


  2. What do the databases GenBank, PDB, Rfam, Pfam, UniProt, PubMed, OMIM contain? Please be precise. If the database contains 3D structures determined by X-ray crystallography or NMR, then please state this; if the database contains RNA sequences and secondary structures, the please state this, etc.

    Solution:

    Check on-line. These were explicitly discussed in class, though I didn't write all the details in the class slides.


  3. Describe one difference between the format/contents of a GenBank genomic nucleotide sequence file in comparison with an EMBL (EBI) genomic nucleotide sequence file.

    Solution:

    Same as above.


  4. Describe how one uses PSI-BLAST to determine the PSSM of remote homologs of a user-specified protein.

    Solution:

    Same as above. What I said in class is that PSI-BLAST ("position specific iterated BLAST) involves a user-specified number of iterations (often 3 or 4). After a first application of BLAST, the user can either manually select, or what is more often the case, simply select ALL the hits returned by BLAST whose E-value is less than the default value on the web server. The hits are unified into a profile or PSSM, which is then BLASTED against the databases again, etc.


  5. Please select one of the options in the following multiple choice selection. What does the database PDB contain:

    1. Multiple alignments of DNA and proteins.
    2. Metabolic pathways.
    3. Explanations of side effects of pharmaceutical drugs.
    4. All of the above.
    5. None of the above

    Solution:

    None of the above


  6. In E. coli, which has compositional frequency of 0.25 for each nucleotide, what is the expected length of an ORF computed from basic probability?

    Solution:

    Rephrasing the question, suppose that you have a coin with heads probability of p. How many times do you need to flip the coin (on average) to get the first heads? (We did this a number of times in class.) For this question, you need to determine the "heads probability" (probability of a STOP codon) for the E. coli genome. Once you determine this, you then determine the expected number of coin flips before getting a "heads" (i.e. a STOP codon).


    Linkage disequilibrium

  7. Consider the following statistics from a sample population of people genotyped for having green eyes (locus X) and having a superpower (locus Y); i.e. using a gross oversimplification, the "green eye gene" X has the allele 1 or 0, and the "superpower gene" (being able to fly through the air, walk on water, etc.) Y has the allele 1 or 0.

    Compute the linkage disequilibrium value D(X,Y) and its normalized value D'.

    Solution:

    You should work through the formula to make sure you understand the answer.

    In hypothesis testing, whether the genes are associated (linked), the null hypothesis H0 is "locus X and locus Y are not linked", while the alternative hypothesis H1 is "locus X and locus Y are linked". The χ2 test statistic can be computed directly, but in our class notes, we saw that the test statistic can also be expressed in terms of the linkage disequilibrium D (not the normalized value D'):

    χ2 = n D2/(P(x=0)P(x=1)P(y=0)P(y=1))
    One then determines the area of the tail region, to the right of the χ2 test statistic in the χ2 distribution with 1 degree of freedom. If the area, called the "p-value", is less than 0.05, then with 95% statistical certainty we can REJECT the null hypothesis, and conclude that the loci are linked.


    Probability, Combinatorics, Entropy

  8. Suppose that an urn contains 4 red balls and 8 black balls.
    1. If you randomly select 4 balls, replacing the selected ball in the urn each time after you select it, then what is the probability of selecting exactly 2 red ball and 2 black balls? In other words, what is the probability of selecting 2 red balls and 2 black balls with replacement?
    2. If you randomly select 4 balls, where the selected ball is NOT placed back in the urn each time after you select it, then what is the probability of selecting exactly 2 red ball and 2 black balls? In other words, what is the probability of selecting 2 red balls and 2 black balls without replacement?

    Solution:

    1. Use the binomial distribution:
      "4 choose 2" · (1/3)2 · (1-1/3)2 
      = 0.296296
      
    2. Use the hypergeometric distribution:
      ("4 choose 2" · "8 choose 2" )/"12 choose 4" =
      6 · 28/ ( (8 · 7 · 6 · 5)/24 ) = 2.4
      


  9. Suppose that on average, there are 200 occurrences of the motif TATAAA in a genome of size one megabase. In a random genome of the same size, what is the probability that there are no occurrences of the motif TATAAA? What is the probability that there are exactly two occurrences of the motif TATAAA? Hint: can you model this by the Poisson distribution?

    Solution:

    If the process of acquiring a motif is uniform (think of randomly choosing the exact location within (say) one megabase for a motif to occur), then the NUMBER of occurrences of the motif within a window of (say) one megabase should be approximately given by the Poisson probability:

    Prob[X = k] =  e · (μ)k/k! 
    
    where Prob[X=k] is the probability of exactly k motif occurrences. If the number of motif occurrences is really representative, then we can assume that the EXPECTED NUMBER of motif occurrences is 200, where the words expectation, expected value, mean, and average all mean the same thing -- it follows that μ=200. The probability of NO occurrence is
    Prob[X = 0] =  e-200 · (μ)0/0!  = e-200 = 1.3838965267367376e-87
    
    Prob[X = 2] =  e-200 · 2002/2!  = 2.767793053473475e-83
    


  10. You are told that the average distance between successive TATAAA motifs in a genome of ten megabases is 10,000 bases. What is the probability of occurrence of the motif TATAAA? Is the genome pyrimidine rich or purine rich? Careful: this question is quite different from the previous question.

    Solution:

    This is a tricky problem, although we have mentioned in class all the material needed to answer the question.

    The probability density function for the exponential distribution is given by

    p(x) = λ · e-λ x 
    
    In class, we said that the exponential distribution is a good mathematical model for the genomic DISTANCE between two successive motifs. In particular, the probability that the distance between two SUCCESSIVE motifs is AT LEAST A and AT MOST B is is the integral from A to B of the probability density function p(x). The mean (average value) for the exponential distribution is μ=1/λ . The cumulative density function, or integral from 0 to x of the density function p(x) is given by
    CDF(x) = 1 - e-λ · x
    
    From the problem assumptions, we have μ=104, so λ=10-4.

    Moreover, in class, we also said that the exponential distribution is the CONTINUOUS analogue of the GEOMETRIC distribution, which latter is described as follows:

    probability of (k-1) tails followed by a heads = P(k) = (1-p)k-1 · p
    
    where p is the probability of obtaining a heads when flipping a biased coin. The mean of the geometric distribution is 1/p, just as the mean of the exponential distribution with parameter λ is 1/λ. Thus if we set p=λ, then a good estimate for the probability of occurrence of TATAAA at any given occurrence is p=0.0001. Now, if we assume that the genomic compositional frequency of A,C,G,T is 1/4, as in the case of E. coli K-12, then we have heads probability q=(0.25)6 = 0.000244140625, which is about twice the value of p. Let's assume that the genomic frequency of A equals that of T, and that the genomic frequency of C equals that of G -- the so-called "second" Chargaff law (which is not a "law" at all and can easily be seen to be violated given certain genomes). The probability of TATAAA is then r6, where r is the (unknown) probability of T, or equivalently of A. If r6 = 0.0001, then by taking the sixth-root of both sides, we have that r=0.2154. Thus we have (by our assumption of Chargaff's second "law") that P(A) = 0.2154 = P(T), P(C) = 0.3923 = P(G). Now A and G are purines, while C and T are pyrimidines. By our assumption of Chargaff's second "law", there are an equal number of purines and pyrimidines.

    A final remark: it is usual to model the "waiting time" or equivalently the "genomic distance" between two motif occurrences by the exponential distribution, provided the number of motif occurrences is modeled by the Poisson distribution. However, the exponential distribution is the continuous analogue of the geometric distribution, and requires a bit more math (not much, just a small integral), so we'll settle for the above approximation using the geometric distribution).

    If we don't assume Chargaff's second "law", then we have the equation r2 · s4 = 0.0001, where r [resp. s] denotes the probability of T [resp. A]. To solve 2 unknowns, we need 2 equations; without an additional equation, we can trace a curve of s as a function of r (similar to a curve of y as a function of x), which allows you to determine the probability of T if you know the probability of A.


  11. Suppose that a glycan array contains 100 glycans, 40 of which contain a particular treelet T, and 60 of which do not contain the treelet T. What is the joint probability that hemagglutin binds 39 of the 40 glycans that contain T and that hemagglutin binds 30 of the 60 glycans that do not contain T?

    Solution:

    First, write out the contingency table: a = number of glycans that contain T and bind H, b = number of glycans that do NOT contain T yet do bind H, c = number of glycans that contain T and do not bind H, d = number of glycans that do NOT contain T and do NOT bind H. The numerator is the multinomial factor "n choose a,b,c,d", where n=a+b+c+d. The denominator is the product of the binomial factor "n choose (a+b)" where a+b is the sum of first row, times "n choose (a+c)" where a+c is the sum of first column. Then crunch numbers with your calculator.


  12. Suppose that you estimate that the probability of an occurrence of an AUG codon occurring in any given genomic position is 1/64. What is the Poisson probability that there are exactly 5 occurrences of AUG in a genomic region of size 100 codons (i.e. 300 nt)?

    Solution:

    The expected number, or average, number of occurrences is (approximately)
            μ = (1/64)*100 =  1.5625
    By definition, P[X=k] = exp(-μ)· μk/k!.
    Now 
       (1) exp(-μ) = exp(-1.5625) = 0.2096113871510978, 
       (2) μ5 = 9.313225746154785,
       (3) μ5/(5*4*3*2*1) = 0.07761021455128987, 
    so 
       P[X=5] = exp(-μ)*  μ5/(5*4*3*2*1) = 0.016267984729190187
    


  13. Compute the base 2 entropy of the first position of the following multiple alignment, AND determine the sum of the heights of the stacked letters in the Web Logo output.
    ATTA
    CGG-
    ATCA
    GTTT
    TTGT
    AAAT
    

    Solution:

    In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167.

    H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)) = 
    1.3966704947492563
    
    which is 1.397 to 3 decimal places. Sequence logo height is maximum entropy minus the entropy, so in this case maximum entropy is log2(4), since there are 4 nucleotides, and the sum of the heights of the stacked letters is 2 - 1.397 = 0.603.


  14. Compute the PSSM for the following multiple alignment, both without and with a pseudocount of 1.
    AT
    CG
    AT
    GT
    TT
    AA
    

    Solution:

    1. [Without pseudocounts] In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167, for probabilities in the first column. Similarly, the probability distribution for column 2 is: p(A)=1/6 p(C)=0 p(G)=1/6 p(T)=2/3. You should display this as follows:
      	1	2	
         ----------------------
      A|	1/2	1/6
      C|	1/6	0
      G|	1/6	1/6
      T|	1/6	2/3
      
    2. [With a pseudocount of 1] In the first position, with 1 pseudocount for each of A,C,G,T, there are 4 A's, 2 C's, 2 G's, and 2 T's and a total of 6+4=10 nucleotides (due to a pseudocount of 1 for each nucleotide), so we have: p(A)=0.4, p(C)=0.2, p(G)=0.2, p(T)=0.2. I'll leave the remaining compution for position 2 to you. You should produce a display as follows:
      	1	2	
         ----------------------
      A|	4/10	2/10
      C|	2/10	1/10
      G|	2/10	2/10
      T|	2/10	5/10
      


    Sequence Alignment

  15. Consider the following alignment:
    ACGG-TT-GACCTTACGACT-
    A-GTACGA--TCACTCTA-TT
    
    1. Calculate the score of alignment shown above, using the following scoring rules: +1 for a match, -1 for a mismatch, -1 for a gap (linear gap penalty).
    2. Calculate the score of alignment using the following scoring rules: +1 for a match, -1 for a mismatch, -2 for gap initiation and -1 for gap extension (affine gap penalty).


  16. Calculate the semiglobal alignment score, where a match counts +1, a mismatch -1, and gap initiation counts -2, and gap extension -1. In a semiglobal alignment, gaps at the left and right extremity are not counted; however internal gaps do indeed have a gap penalty.
    --ACGG-TT-GACCTTACGACT-
    TTA-GTACGA--TCACTCTA-TT
    


  17. Give the path matrix for the global alignment (Needleman-Wunsch) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.

    Solution:

    s is :TGAGC 
    t is :AGCGT
    
    Path matrix
    0	-1	-2	-3	-4	-5	
    -1	-2	-3	-4	-5	-3	
    -2	-3	-1	-2	-3	-4	
    -3	-1	-2	-3	-4	-5	
    -4	-2	0	-1	-2	-3	
    -5	-3	-1	1	0	-1	
    
    Path arrow matrix
    
    
    The optimal alignment is
    
    TGAGC--
    --AGCGT
    


    Sequence assembly using de Bruijn graphs

  18. Previous homework problem: Consider the following set of eight 3-mers that have been obtained from a "genome" sequence: {AGT, AAA, ACT, AAC, CTT, GTA, TTT, TAA}.
    1. Draw a de Bruijn graph for this same set of eight fragments. Recall that the nodes will be 2-mers, and an edge is drawn between pairs of nodes if an only if the 1-letter suffix of a 2-mer matches the one-letter prefix of the other 2-mer. Remember, a pair of nodes (and the edge between them) represents a 3-mer from the data. No 2-mer can be present more than once.
    2. Describe a path through this graph that visits each edge exactly once. Here, a path "description" is of the form a->b->c->d, where a,b,c,d are nodes in the graph -- i.e. 2-mers.
    3. Write the superstring corresponding to this path.

      Hint: You are allowed to draw an edge from one 2-mer back to itself, forming a loop that involves no other nodes. [you are also allowed to form more complicated loops, etc.]

    Solution:

    1. [Sequence assembly by search for a (directed) Eulerian path] Define the new graph G=(V,E), where the vertices consist of all 2-mers from the collection of (experimentally given) 3-mers. Here, we write the list of 2-mers, with their multiplicities (number of occurrences). If there is only one occurrence, then we do not write the multiplicity:
      1: AA 	3
      2: AC
      3: AG
      4: CT   2
      5: GT   2
      6: TA   2
      7: TT   3
      
      The collection of directed, labeled edges is given by a list of ordered pairs, together with the 3-mer label that justifies the edge:
      3,5	AGT
      3,3	AAA
      2,4	ACT
      1,2	AAC
      4,7	CTT
      5,6	GTA
      7,7	TTT
      6,1	TAA
      
      Start by identifying the sink, which is vertex 7, since it is the only node which does not occur as the left member of an ordered pair where the right member is distinct from the left pair (i.e. 7 is the only value that does not occur as the head of an arrow leading to a distinct node). Now backtrack, starting with 7 -- in other words, follow the tail to the head of arrows: 3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels, each time appending the growing sequence by appending the newest symbol. This produces AGTAACTTT. Since we also have the ordered pair (3,3), arguably the sequence should be instead AAGTAACTTT.


    Origin of replication determination

  19. Suppose that the genome of a bacterium consists of one circular chromosome, and it has a unique origin of replication (oriC). Suppose that the genome is available as a FASTA file in NCBI GenBank database, and that the oriC is indicated at the right end of the plus-strand. Draw both the skew plot and the cumulative skew plot (hint - should be similar to B. subtilis).

    Solution:

    See the GC-skew plot of B. subtilis in class slides.


  20. Now consider the combined effect of transcription and replication on the cumulative GC-skew.

    1. Assume that a viral genome consists of one linear chromosome, having an origin of replication (oriC) at EACH END. Assuming that mRNA is transcribed from the plus and minus strands at approximately the same frequency, draw the cumulative skew plot for this genome.
    2. Now assume that a viral genome consists of one circular chromosome, having a single origin of replication (oriC), but that ALL genes are physically transcribed from the minus strand so that the mRNA transcripts are identical with the genomic DNA of the plus strand, as indicated in GenBank files. Taking into account the effect of transcription (which causes the duplex DNA to be separated for the action of RNA polymerase), draw the cumulative skew plot in this case.
    3. Make the same assumptions as in (b), except that ALL genes are physically transcribed from the plus strand, which is indicated in GenBank files by the word "complement" appearing after CDS, as in
        CDS             complement(3300..4037)
      
      Taking into account the effect of transcription, draw the cumulative skew plot in this case.

    Solution:

    1. This situation is exactly the case of human adenovirus for which the book give a parabolic cumulative skew plot shown in Figure 6.5.
    2. When transcription occurs from the minus strand, the plus strand remains single-stranded more than the minus strand and so will have a larger GC-skew. Since replication of the minus strand will leave the plus strand single-stranded more than the minus strand, there will be an additional cause for the plus strand to have a larger GC-skew. This happens up to the terminus, located at the midpoint of the circular genome, after which the minus strand will have a larger GC-skew due to replication. Thus after the terminus, there will be a cancellation and GC-skew will remain constant. This is the case of human papilloma virus HPV shown in Figure 6.6 of the book.
    3. When transcription occurs from the plus trand, the minus strand remains single-stranded more than the plus strand and so will have a larger GC-skew. Since replication of the minus strand will leave the plus strand single-stranded more than the minus strand, there will be an effective cancellation so the first half of the cumulative GC-skew diagram will be flat. So the curve will look like a flat line followed by a line with slope -1.


    Dijkstra's single source shortest path algorithm

  21. Please give the list of (weighted) edges representation of the directed graph below, and then step through Dijkstra's algorithm, when taking node 1 as source. Please output the optimal path from 1 to 6. Also, you should be able to apply Dijkstra's algorithm to determine the optimal gene, given various potential donor and acceptor sites and scores of likelihood that a certain region is an intron or an exon.

    Solution: WARNING: In my solution below, for convenience, I assume the nodes a,b,c,d,e,f are the integers 1,2,3,4,5,6 in that order. Thus the first line (1 2 2) of the following list corresponds to the directed edge from a to b with weight 2.

    List of weighted edges representation:
    1 2 2
    1 3 4
    2 3 1
    2 5 2
    3 5 3
    2 4 4
    5 4 3
    4 6 2
    5 6 2
    -----------------------------------
    Initialization step (time 0):
    
    vertex	D	Pred
    1:      0	N/A
    2:	inf	N/A
    3:	inf	N/A
    4:	inf	N/A
    5:	inf	N/A
    6:	inf	N/A
    
    Q = [1, 2, 3, 4, 5, 6] 
    
    -----------------------------------
    Time step 1: select node from Q with smallest distance to source, where
    the source is 1 (i.e. find index k such that D[x] is smallest value possible).
    
    Set vertex x=1, since D[1]=0, remove 1 from Q so Q = [2, 3, 4, 5, 6] 
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	4	1
    4:	inf	N/A
    5:	inf	N/A
    6:	inf	N/A
    
    Q = [2, 3, 4, 5, 6] 
    
    -----------------------------------
    Time step 2: select node x from Q with smallest distance to source.
    
    Set vertex x=2, since D[2]=2, remove 2 from Q so Q = [3, 4, 5, 6] 
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	3 	2	
    4:	6	2
    5:	4	2
    6:	inf	N/A
    
    Q = [3, 4, 5, 6] 
    
    -----------------------------------
    Time step 3: select node x from Q with smallest distance to source.
    
    Set vertex x=3, since D[3]=3, remove 3 from Q so Q = [4, 5, 6] 
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    In this case, D[3]+weighte(3->5) is greater than D[3], so no updates are
    performed to D and Pred.
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	3 	2	
    4:	6	2
    5:	4	2
    6:	inf	N/A
    
    Q = [4, 5, 6] 
    -----------------------------------
    Time step 4: select node x from Q with smallest distance to source.
    
    Set vertex x=5, since D[5]=4, remove 5 from Q so Q = [4, 6] 
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	3 	2	
    4:	6	2
    5:	4	2
    6:	6	5
    
    Q = [4, 6] 
    -----------------------------------
    Time step 5: select node x from Q with smallest distance to source.
    
    Set vertex x=4, since D[4]=6, and although D[6]=6, 4 comes before 6, so
    we choose 4. Remove 4 from Q so Q = [6] 
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    In this case, D[4]+weighte(4->6) is greater than D[6], so no updates are
    performed to D and Pred.
    
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	3 	2	
    4:	6	2
    5:	4	2
    6:	6	5
    
    Q = [6] 
    -----------------------------------
    Time step 6: select node x from Q with smallest distance to source.
    
    Set vertex x=6, since D[6]=6, and there are no other nodes in Q.
    Remove 6 from Q so Q = [ ] i.e. empty
    
    For all nodes y such that x->y is edge, 
    if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
    Update matrices D (distance to source) and Pred (predecessor) accordingly.
    
    vertex	D	Pred
    1:      0	N/A
    2:	2	1
    3:	3 	2	
    4:	6	2
    5:	4	2
    6:	6	5
    
    Using tracebacks with Pred array, we determine optimal path from source
    vertex 1 to target vertex 6: 1->2->5->6
    


    Phylogenetic tree construction

  22. Using the following distance matrix D, follow the steps of the WPGMA algorithm and construct the corresponding (ultrametric) tree. (Remember that ultrametric means that the distance from any node to all of its leaves is the same.)
    0	5	4	7	6	8	
    5	0	7	10	9	11	
    4	7	0	7	6	8	
    7	10	7	0	5	9	
    6	9	6	5	0	8	
    8	11	8	9	8	0	
    

    Solution:

    Original distance matrix where columns and rows correspond respectively
    to 0,1,2,3,4,5 (i.e. the leaves are labeled 0,1,2,3,4,5 instead of A,B,C,D,E,F).
    	A	B	C	D	E	F
    A	0	5	4	7	6	8	
    B	5	0	7	10	9	11	
    C	4	7	0	7	6	8	
    D	7	10	7	0	5	9	
    E	6	9	6	5	0	8	
    F	8	11	8	9	8	0	
    
    ---------------------------------------------
    Step 1:  Combine A,C to form {A,C}.
    	A,C	B	D	E	F	
    A,C	0	6	7	6	8	
    B	6	0	10	9	11	
    D	7	10	0	5	9	
    E	6	9	5	0	8	
    F	8	11	9	8	0	
    
    ---------------------------------------------
    Step 2:  Combine D,E into {D,E}.
    	A,C	B	D,E	F
    A,C	0	6	6.5	8	
    B	6	0	9.5	11	
    D,E	6.5	9.5	0	8.5	
    F	8	11	8.5	0	
    
    ---------------------------------------------
    Step 3:  Combine {A,C} and B into {A,B,C}
    	A,B,C	D,E	F	
    A,B,C	0	8	9.5	
    D,E	8	0	8.5	
    F	9.5	8.5	0	
    
    ---------------------------------------------
    Step 4:  Combine {A,B,C} and {D,E} into {A,B,C,D,E} which we abbreviate by A-E
    	A-E	F
    A-E	0	9	
    F	9	0	
    
    ---------------------------------------------
    Step 5:  Combine {A-F} and G into {A,B,C,D,E,F} to produce the following
    tree:
    
    	--------5
    	|
    --------4.5
    	|
    			--------4
    			|
    		--------2.5
    			|
    			--------3
    		|
    	--------4
    		|
    			--------1
    			|
    		--------3
    			|
    				--------2
    				|
    			--------2
    				|
    				--------0
    
    
    Following the class notes, you should be able to give the edge weights for all edges of the tree (not shown here).


  23. The following four trees are drawn with branch lengths to scale. Which statement is correct?

    1. All four trees are non-equivalent.
    2. All four trees are equivalent.
    3. Only trees (i) and (iii) are equivalent.
    4. Only trees (iii) and (iv) are equivalent.

    Solution:

    D is correct: trees (iii) and (iv) are equivalent because C branches first, D second and A and B last.


    Genomic rearrangement events (reversals)

  24. Consider the following unsigned sequence of genes:

    5 1 2 4 3
    
    1. What are the adjacencies and the break points? Please indicate by break points by placing a vertical bar between the positions where a break point occurs. How many break points are there?
    2. Please give the pseudocode for the greedy (unsigned) reversal sort algorithm.
    3. Using the following pseudocode for the greedy (unsigned) reversal algorithm, illustrate each step of the greedy algorithm on the previous data, starting from 5 1 2 4 3 and terminating with the sorted order 1 2 3 4 5.

      Let F be the permutation which represents the gene order 5 1 2 4 3; in other words, F is a function, whose domain and range is the set {1,2,3,4,5}, and whose values are defined by F(1)=5, F(2)=1, F(3)=2, F(4)=4, F(5)=3.

      def greedyInversionDistance(F):
        distance = 0
        for i = 1 to n
          j = F_inverse(i)
          //j = position of element i in permutation F 
          //i.e. F(j)=i or equivalently F_inverse(i)=j
          if i != j
            F = F r(i,j) //F composed with the reversal of segment between i and j
            print F
            distance = distance + 1
          if F = identity permutation:
            return distance
      

    Solution

    1. There are 2 break points in the original gene set; however if one adds the left telomer 0 and the right telomere 6, then there are 4 break points:
      0 | 5 | 1 2 | 4 3 | 6
      
    2.  
      Simulation of greedy algorithn on initial data, where the data 
      or permutation F is given by
          5 1 2 4 3
      and is represented by the following table. Here we think of a
      permutation F as a 1-to-1 function from domain {1,2,...,5} into
      range {1,2,...,5}, so F(1)=5, F(2)=1, etc.
      
      Step 0: Original data F
      x|y
      ---
      1 5
      2 1
      3 2
      4 4
      5 3
      
      
      Step 1: Inversion r(1,2) applied to reverse segment (1 5) to (5 1)
      x|y
      ---
      1 1
      2 5
      3 2
      4 4
      5 3
      
      Step 2: Inversion r(2,3) applied to reverse segment (5 2) to (2 5)
      x|y
      ---
      1 1
      2 2
      3 5
      4 4
      5 3
      
      
      Step 3: Inversion r(3,5) applied to reverse segment (5 4 3) to (3 4 5)
      x|y
      ---
      1 1
      2 2
      3 3
      4 4
      5 5
      


  25. Since we did the previous example in class, you should try to solve the same problem for the following gene order in a chromosome: Consider the following unsigned sequence of genes:

    6 4 5 3 1 2