BIOL4200
Review for Midterm: Introduction to Bioinformatics


The midterm is a TAKE HOME exam, available on CANVAS from Friday, Nov 12 at 5:00 pm until Sunday, Nov 14 at 11:59 pm. The test is TIMED, and you have exactly 2 hours, including the time necessary to create a SINGLE PDF file of your answers, and upload it to the Canvas web site in the QUIZZES section.

For anyone who has kept up in class and understood the material we've covered, the midterm should not take more than 60-90 minutes, which is ample time for you to do the test, create a single pdf file and upload it. PLEASE practice making a single pdf file BEFORE you start the test, since the CANVAS site will not allow you to upload anything after 2 hours have passed since you have begun the test.

During the midterm exam, you may use class notes from our web site, your own notes, a calculator, Excel and Mathematica; however, you may NOT use Internet or discuss the test with any person. Here, I am solemnly asking each of you to abide by these directions completely. Remember, if you cheat in this instance, then you will cheat and take short cuts as a physician, business person or in whatever profession you choose. Certainly none of you would want to be treated in an emergency room for a life threatening health problem by a physician who cheats and takes short cuts. There is ZERO tolerance for cheating, and I have confidence that you all will abide by the ethical code of Boston College.

The test will reflect coverage of material in the course, so please review the course slides, homeworks and quizzes. With the exception of a few questions on general notions (see below), the test questions will be similar to quiz and homework problems that require understanding of the material.

On the midterm, there will be NO questions concerning the Poisson or geometricprobability distributions. There WILL BE questions on the binomial and hypergeometric distributions. You will have to use Mathematica to solve a system of differential equations, so please review class notes and homework for Week 2, when we covered how to model the spread of an epidemic and how to solve a system of differential equations using Mathematica (see also the related problems on logistic model for bacterial growth, etc.).

Concepts to review include the following.

Example questions

    General

  1. Crick's proposed Central Dogma states that the "information flow" in a cell is from DNA to RNA to protein; i.e. DNA -> RNA -> protein, where each of the arrows cannot be reversed. Describe a violation of the Central Dogma.

    Solution:

    The CRISPR/Cas system in prokaryotes involves (1) bacterial proteins that extract viral DNA, which is incorporated into the CRISPR array region of the bacterial genome, and (2) transcription of the polycistronic CRISPR RNAs with subsequent cleavage and incorporation into proteins that recognize and cleave viral DNA which hybridizes with the CRISPR RNA. Thus proteins can change DNA, which violates the Central Dogma. Another example is the yet unclear manner of how long noncoding RNAs direct the transformation of genomic regions (DNA) into heterochromatin.


  2. This problem concerns IUPAC nucleotide and amino acid codes.

    1. What do the IUPAC codes R and Y stand for?
    2. What is the IUPAC code for the sequence read "either G or C" -- i.e. in the case of uncertain data where it is clear that the nucleotide at a particular position is either G or C, but there is insufficient signal to determine which.
    3. Which amino acid does the IUPAC code G stand for, and what is the side chain of this amino acid?

    Solution:

    1. R=purine, Y=pyrimidine.
    2. S = strong (G or C)
    3. G = Gly = glycine, with side chain H.


  3. Restriction enzymes provide a rudimentary "immune system" for bacteria, in that they recognize certain specific nucleotide sequences and then cut the sequence in a specific position in the recognition sequence. Generally, restriction enzyme recognition sequences are biological palindromes, such as

    BamH1   G^GATCC
    EcoR1   G^AATTC
    EcoRV   GAT^ATC
    

    where the cleavage site is indicated with a '^' (carat) character. In a random "genome" of 4.6 million base pairs (i.e. double-stranded DNA) where P(G)=P(C)= 0.3 and P(A)=P(T)=0.2, what is the expected number of occurrences of the restriction enzyme EcoR1?

    Solution:

    p = (0.3)*(0.3)*(0.2)**4
    p = 0.000144
    exp num = p*2*(4600000)
    exp num = 1324.8
    


    Probability, Combinatorics, Entropy

  4. Suppose that you have a coin whose "heads" probability is 1/8. What is the expected number of times that you must flip the coin before obtaining a heads?

    Solution:

    1/(1/8) = 8.


  5. If you flip a coin with heads probability of 1/4 five times, then what is the probability that you have exactly 2 heads?

    Solution:

    "5 choose 2" * (1/4)2 * (3/4)3 = 10 * 1/16 * 27/64 = 0.263671875


  6. Suppose that you estimate that the probability of an occurrence of an AUG codon occurring in any given genomic position is 1/64. What is the expected number of occurrences of AUG in the sense strand of the E. coli genome?

    Solution:

    E. coli K-12 has genome size of 4.5 x 106 base pairs, with relative frequency of 1/4 for each nucleotide. Thus 4,500,000/64 = 70312.5 is the expected number of occurrences of AUG, ASSUMING the DNA sequence is random. Since E. coli genome is certainly not random, the number of AUG may deviate from this expected number.


  7. Compute the base 2 entropy of the first position of the following multiple alignment, AND determine the sum of the heights of the stacked letters in the Web Logo output.
    ATTA
    CGG-
    ATCA
    GTTT
    TTGT
    AAAT
    

    Solution:

    In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167.

    H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)/log(2)) = 
    1.792481250360578
    
    which is 1.397 to 3 decimal places. Sequence logo height is maximum entropy minus the entropy, so in this case maximum entropy is log2(4), since there are 4 nucleotides, and the sum of the heights of the stacked letters is 2 - 1.397 = 0.603.


    Linkage Disequilibrium

  8. Given the following genotyping data, please compute the following value (and show your work by giving appropriate formulas).

    Solution:

    P(x=1) = P(diploid X locus is "11") + 1/2 P(diploid X locus is "01")
    P(y=1) = P(diploid Y locus is "11") + 1/2 P(diploid Y locus is "01")					
    P(x=1) = 0.445454545				
    P(y=1) = 0.681818182				
    					
    D	-0.012809917				
    Since D < 0, D' is |D|/min(P(x=1)*P(y=1), P(x=0)*P(y=0))
    D'	0.072599532			
    Since D' is close to zero, there is NO LINKAGE between genes.
    
    					
    In hypothesis testing, where the null hypothesis H0 is
       H0: D' = 0  (i.e. there is no linkage)
    and the alternative hypothesis is
       H1: D' different from 0 (i.e. there is linkage)
    we determine that
    rho	-0.055335532				
    n rho^2	0.336822317				
    p-val	0.561669209				
    
    If the p-value were less than 0.05, then we have sufficient statistical evidence to REJECT the null hypothesis and conclude that the genes are linked. However, this is not the case, and we can not reject the null hypothesis and so find no evidence that the genes are linked.


    Sequence Alignment

  9. Consider the following alignment:
    ACGG-TT-GACCTTACGACT-
    A-GTACGA--TCACTCTA-TT
    
    1. Calculate the score of alignment shown above, using the following scoring rules: +1 for a match, -1 for a mismatch, -1 for a gap (linear gap penalty).
    2. Calculate the score of alignment using the following scoring rules: +1 for a match, -1 for a mismatch, -2 for gap initiation and -1 for gap extension (affine gap penalty).

    Solution:

    1. There are 7 gaps, 6 matches, and 8 mismatches, so score is -7+6-8=-9.
    2. There are 6 gap blocks (1 of size 2, and 5 of size 1), 6 matches, and 8 mismatches, so score is -2(6) -1 + 6 - 8 = -15, where I hope I haven't miscounted.
    The main question is whether
    T-GAC
    GA--T
    
    is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks, one of size 1 and one of size 2.


  10. Calculate the semiglobal alignment score, where a match counts +1, a mismatch -1, and gap initiation counts -2, and gap extension -1. In a semiglobal alignment, gaps at the left and right extremity are not counted; however internal gaps do indeed have a gap penalty.
    --ACGG-TT-GACCTTACGACT-
    TTA-GTACGA--TCACTCTA-TT
    

    Solution:

    Semiglobal alignment score of the given aligment is the same as the score of the truncated alignment

    ACGG-TT-GACCTTACGACT
    A-GTACGA--TCACTCTA-T
    
    There are 5 gap blocks (1 of size 2, and 4 of size 1), 6 matches, and 8 mismatches, so score is -2(5) -1 + 6 - 8 = -13, where I hope I haven't miscounted. The main question is whether
    T-GAC
    GA--T
    
    is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks, one of size 1 and one of size 2.


  11. Give the path matrix for the global alignment (Needleman-Wunsch) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.

    Solution:

    s is :GGACC
    t is :CGAA
    		C	G	A	A	
    	0	-1	-2	-3	-4	
    G	-1	-2	0	-1	-2	
    G	-2	-3	-1	-2	-3	
    A	-3	-4	-2	0	-1	
    C	-4	-2	-3	-1	-2	
    C	-5	-3	-4	-2	-3	
    
    The path matrix with backarrows and one (of many) tracebacks is here:


  12. Give the path matrix for the LOCAL alignment (Smith-Waterman) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.

    Solution:

    s is :GGACC
    t is :CGAA
    		C	G	A	A	
    	0	0	0	0	0	
    G	0	0	1	0	0	
    G	0	0	1	0	0	
    A	0	0	0	2	1	
    C	0	1	0	1	0	
    C	0	1	0	0	0	
    
    First, recall that we start numbering rows and columns with zero (0), not 1. The maximum value in the path matrix is 2, and since this value is unique, there is only one optimal local aligment. The traceback arrow from 2 in row 3, col 3 is to the value 1 located in row 2, col 2. The traceback arrow from 1 in row 2, col 2 is to the value 0 located in row 1, col 1. We stop, since we've encountered the value 0, so the optimal local alignment is the following:
    2-GA-3
    2-GA-3
    
    meaning that the starting nucleotide in the first sequence is 2, then ending nucleotide in the first sequence is 3, and similarly for the second sequence. Recall that BLAST output gives you the starting/stopping position for both query and target sequences, and that BLAST is a LOCAL alignment alignment.

    Sequence assembly using de Bruijn graphs

  13. Previous homework problem: Consider the following set of eight 3-mers that have been obtained from a "genome" sequence: {AGT, AAA, ACT, AAC, CTT, GTA, TTT, TAA}.
    1. Draw a de Bruijn graph for this same set of eight fragments. Recall that the nodes will be 2-mers, and an edge is drawn between pairs of nodes if an only if the 1-letter suffix of a 2-mer matches the one-letter prefix of the other 2-mer. Remember, a pair of nodes (and the edge between them) represents a 3-mer from the data. No 2-mer can be present more than once.
    2. Describe a path through this graph that visits each edge exactly once. Here, a path "description" is of the form a->b->c->d, where a,b,c,d are nodes in the graph -- i.e. 2-mers.
    3. Write the superstring corresponding to this path.

      Hint: You are allowed to draw an edge from one 2-mer back to itself, forming a loop that involves no other nodes. [you are also allowed to form more complicated loops, etc.]

    Solution:

    1. [Sequence assembly by search for a (directed) Eulerian path] Define the new graph G=(V,E), where the vertices consist of all 2-mers from the collection of (experimentally given) 3-mers. Here, we write the list of 2-mers, with their multiplicities (number of occurrences).
      1: AA 	4
      2: AC	2
      3: AG	1
      4: CT   2
      5: GT   2
      6: TA   2
      7: TT   3
      
      The collection of directed, labeled edges is given by a list of ordered pairs, together with the 3-mer label that justifies the edge:
      3,5	AGT
      1,1	AAA
      2,4	ACT
      1,2	AAC
      4,7	CTT
      5,6	GTA
      7,7	TTT
      6,1	TAA
      
      Start by identifying the sink, which is vertex 7, since it is the only node which does not occur as the left member of an ordered pair where the right member is distinct from the left pair (i.e. 7 is the only value that does not occur as the head of an arrow leading to a distinct node). Now backtrack, starting with 7 -- in other words, follow the tail to the head of arrows: 3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels, each time appending the growing sequence by appending the newest symbol. This produces AGTAACTTT. Since we also have the ordered pair (3,3), arguably the sequence should be instead AAGTAACTTT.