Although the exam is cumulative, it will focus mostly on the material
covered since the midterm -- i.e. simple applications of probability
theory and probability distributions (binomial, geometric, hypergeometric,
normal) and statistical tests (T-test, χ2-test),
linkage disequilibrium,bacterial origin of replication (ORI) -- GC-skew,
biological reasons for skew, glycan arrays, Fisher exact test and its use
for the infectivity analysis of the influenze virus, phylogenetic tree
construction.
The exam will reflect coverage of material
in the course, so please review the course slides and homeworks.
Concepts to review include the following.
I would encourage you to focus on the following notions:
Using the BLOSUM62 (log odds) matrix, answer the following questions.
Solution:
BLOSUM2 values, as explained in class (and as you computed in the midterm
exam), are given by α · log( p(x,y)/(q(x)q(y)) ) rounded to
the nearest integer, where α is a scaling factor, p(x,y) is the joint
probabity (relative frequency that x and y both appear in the same column
of the multiple alignment), and q(x) [resp. q(y)] is the relative frequency
of occurrence of residues x [resp. y] in the multiple alignment.
If the BLOSUM value is positive, then substitution happens more often by
chance, while
if the BLOSUM value is negative , then substitution happens less often than
by chance. From this, the answers to a,b,c are clear when you check values
in the BLOSUM62 table.
What do the databases GenBank, PDB, Rfam, Pfam, UniProt, PubMed, OMIM
contain? Please be precise. If the database contains 3D structures
determined by X-ray crystallography or NMR, then please state this; if
the database contains RNA sequences and secondary structures, the please
state this, etc.
Solution:
Check on-line. These were explicitly discussed in class, though I didn't
write all the details in the class slides.
Describe one difference between the format/contents of a GenBank
genomic nucleotide sequence file in comparison with an EMBL (EBI)
genomic nucleotide sequence file.
Solution:
Same as above.
Describe how one uses PSI-BLAST to determine the PSSM of remote homologs
of a user-specified protein.
Solution:
Same as above. What I said in class is that PSI-BLAST ("position specific
iterated BLAST) involves a user-specified number of iterations (often 3 or 4).
After a first application of BLAST, the user can either manually select, or
what is more often the case, simply select ALL the hits returned by BLAST
whose E-value is less than the default value on the web server.
The hits are unified into a profile or PSSM, which is then BLASTED against
the databases again, etc.
Please select one of the options in the following multiple choice selection.
What does the database PDB contain:
Solution:
None of the above
In E. coli, which has compositional frequency of 0.25 for each
nucleotide, what is the expected length of an ORF computed from basic
probability?
Solution:
Rephrasing the question, suppose that you have a coin with
heads probability of p. How many times do you need to flip the coin
(on average) to get the first heads? (We did this a number of times in
class.) For this question, you need to determine the "heads probability"
(probability of a STOP codon) for the E. coli genome. Once you determine
this, you then determine the expected number of coin flips before getting
a "heads" (i.e. a STOP codon).
Compute the linkage disequilibrium value D(X,Y) and its normalized value D'.
Solution:
You should work through the formula to make sure you understand the answer.
In hypothesis testing, whether the genes are associated (linked), the null
hypothesis H0 is "locus X and locus Y are not linked", while the alternative
hypothesis H1 is "locus X and locus Y are linked". The χ2
test statistic can be computed directly, but in our class notes, we saw that
the test statistic can also be expressed in terms of the linkage disequilibrium
D (not the normalized value D'):
Solution:
Solution:
If the process of acquiring a motif is uniform (think of randomly choosing the
exact location within (say) one megabase for a motif to occur), then
the NUMBER of occurrences of the motif within a window of (say) one megabase
should be approximately given by the Poisson probability:
Solution:
This is a tricky problem, although we have mentioned in class all the material
needed to answer the question.
The probability density function for the exponential distribution is
given by
Moreover, in class, we also said that the exponential distribution is
the CONTINUOUS analogue of the GEOMETRIC distribution, which latter is
described as follows:
A final remark: it is usual to model the "waiting time" or equivalently the
"genomic distance" between two motif occurrences by the exponential
distribution, provided the number of motif occurrences is modeled by
the Poisson distribution. However, the exponential distribution is the
continuous analogue of the geometric distribution, and requires a bit more
math (not much, just a small integral), so we'll settle for the above
approximation using the geometric distribution).
If we don't assume Chargaff's second "law", then we have the equation
r2 · s4 = 0.0001, where r [resp. s] denotes the
probability of T [resp. A]. To solve 2 unknowns, we need 2 equations;
without an additional equation, we can trace a curve of s as a function
of r (similar to a curve of y as a function of x), which allows you to
determine the probability of T if you know the probability of A.
Solution:
First, write out the contingency table:
a = number of glycans that contain T and bind H,
b = number of glycans that do NOT contain T yet do bind H,
c = number of glycans that contain T and do not bind H,
d = number of glycans that do NOT contain T and do NOT bind H.
The numerator is the multinomial factor "n choose a,b,c,d", where n=a+b+c+d.
The denominator is the product of the binomial factor
"n choose (a+b)" where a+b is the sum of first row, times
"n choose (a+c)" where a+c is the sum of first column.
Then crunch numbers with your calculator.
Solution:
Solution:
In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total
of 6 nucleotides, so by rounding to 3 decimal places we have:
p(A)=0.5,
p(C)=0.167,
p(G)=0.167,
p(T)=0.167.
Solution:
Solution:
Hint:
You are allowed to draw an edge from one
2-mer back to itself, forming a loop that involves no other nodes. [you are
also allowed to form more complicated loops, etc.]
Solution:
Solution:
See the GC-skew plot of B. subtilis in class slides.
Now consider the combined effect of transcription and replication on the
cumulative GC-skew.
Solution:
Please give the list of (weighted) edges representation of the directed
graph below, and then
step through Dijkstra's algorithm, when
taking node 1 as source. Please output the optimal path from 1 to 6.
Also, you should be able to apply Dijkstra's algorithm to determine the
optimal gene, given various potential donor and acceptor sites and
scores of likelihood that a certain region is an intron or an exon.
Solution:
WARNING: In my solution below, for convenience, I assume the nodes
a,b,c,d,e,f are the integers 1,2,3,4,5,6 in that order. Thus the first
line (1 2 2) of the following list corresponds to the directed edge
from a to b with weight 2.
Solution:
The following four trees are drawn with branch lengths to scale.
Which statement is correct?
Solution:
D is correct: trees (iii) and (iv) are equivalent because C branches first,
D second and A and B last.
Consider the following unsigned sequence of genes:
Let F be the permutation which represents the gene order 5 1 2 4 3;
in other words, F is a function, whose domain and range is the set {1,2,3,4,5},
and whose values are defined by F(1)=5, F(2)=1, F(3)=2, F(4)=4, F(5)=3.
Solution
Since we did the previous example in class, you should try to solve the
same problem for the following gene order in a chromosome:
Consider the following unsigned sequence of genes:
Some example questions
General
Linkage disequilibrium
χ2 = n D2/(P(x=0)P(x=1)P(y=0)P(y=1))
One then determines the area of the tail region, to the right of the
χ2 test statistic in the χ2 distribution
with 1 degree of freedom. If the area, called the "p-value", is less than
0.05, then with 95% statistical certainty we can REJECT the null hypothesis,
and conclude that the loci are linked.
Probability, Combinatorics, Entropy
"4 choose 2" · (1/3)2 · (1-1/3)2
= 0.296296
("4 choose 2" · "8 choose 2" )/"12 choose 4" =
6 · 28/ ( (8 · 7 · 6 · 5)/24 ) = 2.4
Prob[X = k] = e-μ · (μ)k/k!
where Prob[X=k] is the probability of exactly k motif occurrences.
If the number of motif occurrences is really representative, then we
can assume that the EXPECTED NUMBER of motif occurrences is 200, where
the words expectation, expected value, mean, and average all mean the same
thing -- it follows that μ=200. The probability of NO occurrence is
Prob[X = 0] = e-200 · (μ)0/0! = e-200 = 1.3838965267367376e-87
Prob[X = 2] = e-200 · 2002/2! = 2.767793053473475e-83
p(x) = λ · e-λ x
In class, we said that the exponential distribution is a good mathematical
model for the genomic DISTANCE between two successive motifs.
In particular, the probability that the distance between two
SUCCESSIVE motifs is AT LEAST A and AT MOST B is
is the integral from A to B of the probability density function p(x).
The mean (average value) for the exponential distribution is
μ=1/λ . The cumulative density function,
or integral from 0 to x of the density function p(x) is given by
CDF(x) = 1 - e-λ · x
From the problem assumptions, we
have μ=104, so λ=10-4.
probability of (k-1) tails followed by a heads = P(k) = (1-p)k-1 · p
where p is the probability of obtaining a heads when flipping a biased coin.
The mean of the geometric distribution is 1/p, just as the mean of the
exponential distribution with parameter λ is 1/λ. Thus if
we set p=λ, then a good estimate for the probability of occurrence
of TATAAA at any given occurrence is p=0.0001. Now, if we assume that
the genomic compositional frequency of A,C,G,T is 1/4, as in the case of
E. coli K-12, then we have heads probability q=(0.25)6 =
0.000244140625, which is about twice the value of p. Let's assume that
the genomic frequency of A equals that of T, and that
the genomic frequency of C equals that of G -- the so-called "second"
Chargaff law (which is not a "law" at all and can easily be seen to be
violated given certain genomes). The probability of TATAAA is then
r6, where r is the (unknown) probability of T, or equivalently
of A. If r6 = 0.0001, then by taking the sixth-root of both
sides, we have that r=0.2154. Thus we have (by our assumption of
Chargaff's second "law") that P(A) = 0.2154 = P(T), P(C) = 0.3923 = P(G).
Now A and G are purines, while C and T are pyrimidines. By our assumption
of Chargaff's second "law", there are an equal number of purines and
pyrimidines.
The expected number, or average, number of occurrences is (approximately)
μ = (1/64)*100 = 1.5625
By definition, P[X=k] = exp(-μ)· μk/k!.
Now
(1) exp(-μ) = exp(-1.5625) = 0.2096113871510978,
(2) μ5 = 9.313225746154785,
(3) μ5/(5*4*3*2*1) = 0.07761021455128987,
so
P[X=5] = exp(-μ)* μ5/(5*4*3*2*1) = 0.016267984729190187
ATTA
CGG-
ATCA
GTTT
TTGT
AAAT
H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)) =
1.3966704947492563
which is 1.397 to 3 decimal places. Sequence logo height is maximum
entropy minus the entropy, so in this case maximum entropy is
log2(4), since there are 4 nucleotides, and the sum of the
heights of the stacked letters is 2 - 1.397 = 0.603.
AT
CG
AT
GT
TT
AA
1 2
----------------------
A| 1/2 1/6
C| 1/6 0
G| 1/6 1/6
T| 1/6 2/3
1 2
----------------------
A| 4/10 2/10
C| 2/10 1/10
G| 2/10 2/10
T| 2/10 5/10
Sequence Alignment
ACGG-TT-GACCTTACGACT-
A-GTACGA--TCACTCTA-TT
--ACGG-TT-GACCTTACGACT-
TTA-GTACGA--TCACTCTA-TT
s is :TGAGC
t is :AGCGT
Path matrix
0 -1 -2 -3 -4 -5
-1 -2 -3 -4 -5 -3
-2 -3 -1 -2 -3 -4
-3 -1 -2 -3 -4 -5
-4 -2 0 -1 -2 -3
-5 -3 -1 1 0 -1
Path arrow matrix
The optimal alignment is
TGAGC--
--AGCGT
Sequence assembly using de Bruijn graphs
The collection of directed, labeled edges is given by a list
of ordered pairs, together with the 3-mer label that justifies the edge:
1: AA 3
2: AC
3: AG
4: CT 2
5: GT 2
6: TA 2
7: TT 3
Start by identifying the sink, which is vertex 7, since
it is the only node which does not occur as the left member of an
ordered pair where the right member is distinct from the left pair
(i.e. 7 is the only value that does not occur as the head of an arrow
leading to a distinct node). Now backtrack, starting with 7 -- in other
words, follow the tail to the head of arrows:
3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels,
each time appending the growing sequence by appending the newest symbol.
This produces AGTAACTTT. Since we also have the ordered pair (3,3),
arguably the sequence should be instead
AAGTAACTTT.
3,5 AGT
3,3 AAA
2,4 ACT
1,2 AAC
4,7 CTT
5,6 GTA
7,7 TTT
6,1 TAA
Origin of replication determination
CDS complement(3300..4037)
Taking into account the effect of transcription,
draw the cumulative skew plot in this case.
Dijkstra's single source shortest path algorithm
List of weighted edges representation:
1 2 2
1 3 4
2 3 1
2 5 2
3 5 3
2 4 4
5 4 3
4 6 2
5 6 2
-----------------------------------
Initialization step (time 0):
vertex D Pred
1: 0 N/A
2: inf N/A
3: inf N/A
4: inf N/A
5: inf N/A
6: inf N/A
Q = [1, 2, 3, 4, 5, 6]
-----------------------------------
Time step 1: select node from Q with smallest distance to source, where
the source is 1 (i.e. find index k such that D[x] is smallest value possible).
Set vertex x=1, since D[1]=0, remove 1 from Q so Q = [2, 3, 4, 5, 6]
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
vertex D Pred
1: 0 N/A
2: 2 1
3: 4 1
4: inf N/A
5: inf N/A
6: inf N/A
Q = [2, 3, 4, 5, 6]
-----------------------------------
Time step 2: select node x from Q with smallest distance to source.
Set vertex x=2, since D[2]=2, remove 2 from Q so Q = [3, 4, 5, 6]
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
vertex D Pred
1: 0 N/A
2: 2 1
3: 3 2
4: 6 2
5: 4 2
6: inf N/A
Q = [3, 4, 5, 6]
-----------------------------------
Time step 3: select node x from Q with smallest distance to source.
Set vertex x=3, since D[3]=3, remove 3 from Q so Q = [4, 5, 6]
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
In this case, D[3]+weighte(3->5) is greater than D[3], so no updates are
performed to D and Pred.
vertex D Pred
1: 0 N/A
2: 2 1
3: 3 2
4: 6 2
5: 4 2
6: inf N/A
Q = [4, 5, 6]
-----------------------------------
Time step 4: select node x from Q with smallest distance to source.
Set vertex x=5, since D[5]=4, remove 5 from Q so Q = [4, 6]
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
vertex D Pred
1: 0 N/A
2: 2 1
3: 3 2
4: 6 2
5: 4 2
6: 6 5
Q = [4, 6]
-----------------------------------
Time step 5: select node x from Q with smallest distance to source.
Set vertex x=4, since D[4]=6, and although D[6]=6, 4 comes before 6, so
we choose 4. Remove 4 from Q so Q = [6]
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
In this case, D[4]+weighte(4->6) is greater than D[6], so no updates are
performed to D and Pred.
vertex D Pred
1: 0 N/A
2: 2 1
3: 3 2
4: 6 2
5: 4 2
6: 6 5
Q = [6]
-----------------------------------
Time step 6: select node x from Q with smallest distance to source.
Set vertex x=6, since D[6]=6, and there are no other nodes in Q.
Remove 6 from Q so Q = [ ] i.e. empty
For all nodes y such that x->y is edge,
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.
vertex D Pred
1: 0 N/A
2: 2 1
3: 3 2
4: 6 2
5: 4 2
6: 6 5
Using tracebacks with Pred array, we determine optimal path from source
vertex 1 to target vertex 6: 1->2->5->6
Phylogenetic tree construction
0 5 4 7 6 8
5 0 7 10 9 11
4 7 0 7 6 8
7 10 7 0 5 9
6 9 6 5 0 8
8 11 8 9 8 0
Original distance matrix where columns and rows correspond respectively
to 0,1,2,3,4,5 (i.e. the leaves are labeled 0,1,2,3,4,5 instead of A,B,C,D,E,F).
A B C D E F
A 0 5 4 7 6 8
B 5 0 7 10 9 11
C 4 7 0 7 6 8
D 7 10 7 0 5 9
E 6 9 6 5 0 8
F 8 11 8 9 8 0
---------------------------------------------
Step 1: Combine A,C to form {A,C}.
A,C B D E F
A,C 0 6 7 6 8
B 6 0 10 9 11
D 7 10 0 5 9
E 6 9 5 0 8
F 8 11 9 8 0
---------------------------------------------
Step 2: Combine D,E into {D,E}.
A,C B D,E F
A,C 0 6 6.5 8
B 6 0 9.5 11
D,E 6.5 9.5 0 8.5
F 8 11 8.5 0
---------------------------------------------
Step 3: Combine {A,C} and B into {A,B,C}
A,B,C D,E F
A,B,C 0 8 9.5
D,E 8 0 8.5
F 9.5 8.5 0
---------------------------------------------
Step 4: Combine {A,B,C} and {D,E} into {A,B,C,D,E} which we abbreviate by A-E
A-E F
A-E 0 9
F 9 0
---------------------------------------------
Step 5: Combine {A-F} and G into {A,B,C,D,E,F} to produce the following
tree:
--------5
|
--------4.5
|
--------4
|
--------2.5
|
--------3
|
--------4
|
--------1
|
--------3
|
--------2
|
--------2
|
--------0
Following the class notes, you should be able to give the edge weights
for all edges of the tree (not shown here).

Genomic rearrangement events (reversals)
5 1 2 4 3
def greedyInversionDistance(F):
distance = 0
for i = 1 to n
j = F_inverse(i)
//j = position of element i in permutation F
//i.e. F(j)=i or equivalently F_inverse(i)=j
if i != j
F = F r(i,j) //F composed with the reversal of segment between i and j
print F
distance = distance + 1
if F = identity permutation:
return distance
0 | 5 | 1 2 | 4 3 | 6
Simulation of greedy algorithn on initial data, where the data
or permutation F is given by
5 1 2 4 3
and is represented by the following table. Here we think of a
permutation F as a 1-to-1 function from domain {1,2,...,5} into
range {1,2,...,5}, so F(1)=5, F(2)=1, etc.
Step 0: Original data F
x|y
---
1 5
2 1
3 2
4 4
5 3
Step 1: Inversion r(1,2) applied to reverse segment (1 5) to (5 1)
x|y
---
1 1
2 5
3 2
4 4
5 3
Step 2: Inversion r(2,3) applied to reverse segment (5 2) to (2 5)
x|y
---
1 1
2 2
3 5
4 4
5 3
Step 3: Inversion r(3,5) applied to reverse segment (5 4 3) to (3 4 5)
x|y
---
1 1
2 2
3 3
4 4
5 5
6 4 5 3 1 2