BIOL4200
Review for Midterm: Introduction to Bioinformatics
The midterm is a TAKE HOME exam, available on CANVAS from Friday, Nov 12
at 5:00 pm until Sunday, Nov 14 at 11:59 pm. The test is TIMED, and you have
exactly 2 hours, including the time necessary to create a SINGLE PDF file
of your answers, and upload it to the Canvas web site in the QUIZZES section.
For anyone who has kept up in class and understood the material we've covered,
the midterm should not take more than 60-90 minutes, which is ample
time for you to do the test, create a single
pdf file and upload it. PLEASE practice making a single pdf file BEFORE you start
the test, since the CANVAS site will not allow you to upload anything after
2 hours have passed since you have begun the test.
During the midterm exam, you may use class notes from our web site, your own notes,
a calculator, Excel and Mathematica; however,
you may NOT use Internet or discuss the test with any person. Here, I am
solemnly asking each of you to abide by these directions completely. Remember,
if you cheat in this instance, then you will cheat and take short cuts as a
physician, business person or in whatever profession you choose. Certainly
none of you would want to be treated in an emergency room for a life threatening
health problem by a physician who cheats and takes short cuts. There is
ZERO tolerance for cheating, and I have confidence that you all will abide by
the ethical code of Boston College.
The test will reflect coverage of material
in the course, so please review the course slides, homeworks and quizzes.
With the exception of a few questions on general notions (see below), the
test questions will be similar to quiz and homework problems that require
understanding of the material.
On the midterm, there will be NO questions concerning the
Poisson or geometricprobability distributions. There WILL BE
questions on the binomial and hypergeometric distributions. You will have to
use Mathematica to solve a system of differential equations, so please review
class notes and homework for Week 2, when we covered how to model the spread
of an epidemic and how to solve a system of differential equations using
Mathematica (see also the related problems on logistic model for bacterial growth,
etc.).
Concepts to review include the following.
-
GENERAL NOTIONS: Concepts
from molecular biology and biochemistry covered in the notes
chap0_intro.pdf and in notes from other chapters:
Crick's central dogma, examples of violations to the central dogma,
CRISPR-Cas system, year of publication of Watson and Crick's paper on DNA
structure (1953), year of completion of sequencing of human genome (2003),
name of first bacterium sequenced (Haemophilus influenzae 1995),
name of first eukaryote sequenced (S. cerevisiae 1996),
genome size of HIV-1, genome size of Haemophilus influenzae,
genome size of Saccharomyces cerevisiae,
genome size of Homo sapiens, description of how Sanger sequencing works,
what dideoxynucleotides are and how they are used in Sanger sequencing,
what shotgun sequencing is, what is the start codon or codon of methionine
(AUG), be able to name at least one stop codon (UAG,UAA, UGA),
name 2 hydrophobic amino acids,
name 2 hydrophilic (polar) amino acids,
name 2 negatively charged amino acids,
name 2 positively charged amino acids,
what is the smallest amino acid and what is its sidechain (glycine with
sidechain of hydrogen atom).
what is the sidechain of alanine (methyl group CH3),
know all IUPAC nucleotide codes and important amino acid codes,
know difference an an example of orthologous and of paralogous genes,
know that the GC-content is the compositional frequency of G plus that of C,
know what lower and higher numbers mean for both PAM and BLOSUM family
matrices, be able to construct a "toy" BLOSUM matrix as done in the notes
by computing the log odds score log( p(xy)/q(x)q(y) ).
-
Elementary Combinatorics and Probability: Combinations and permutations
("n choose k"), binomial distribution, hypergeometric distribution,
Poisson distribution, the expected number of times that you need to flip
a coin having heads probability 1/8 in order to get the first heads,
expected number of occurrences of a motif
(such as TATAAA) in a genome of size N having compositional frequency
p(A), p(C), p(G), and p(T).
-
Differential equation models with Mathematica: modeling
epidemics using the SIR-model and its extensions, how to solve differential
equations or systems of differential equations that model biological
phenomena (prey-predator, number of bacteria, number of tumor cells, etc.)
using Mathematica.
-
Linkage disequilibrium (LD):
How to compute value D of linkage disequilibrium
for genotype data of diploid organisms (e.g. human in a sample as we've done
in class and a quiz), normalized linkage disequilibrium D', test statistic
n ρ2 to test whether linkage disequilibrium D is
statistically significant, by using χ2 distribution with one
degree of freedom, test statistic
for T-test of equality of two sample means.
-
Sequence Alignment: Global pairwise alignment (Needleman-Wunsch).
Local pairwise alignment (Smith-Waterman). How to construct a path matrix
for local and global alignment with linear gap penalty. Know what the affine
gap penalty is and how to score an alignment with the affine gap penalty.
Know how to score a semi-global alignment -- i.e. where end gaps are not
counted but interior gaps are counted.
-
Graph Theory:
Definitions of directed and undirected graphs, of simple graphs, of connected
graphs, of Hamiltonian cycles and paths, of Eulerian cycles and paths.
Euler's theorem on the existence of Eulerian paths and cycles. How to
construct an Eulerian cycle or path.
Sequence assembly using DeBruijn graphs with application of Euler's
construction. Sequence assembly by constructing a Hamiltonian path or
cycle from k-mer data.
-
Origin of replication (oriC):
G-C skew, cumulative skew, how to predict
G-C raw skew and cumulative skew from knowledge of location of the oriC
and on which strand most genes are transcribed.
-
PSSMs, weight matrices:
Position-specific scoring matrices, position-specific entropy,
maximum entropy, information (maximum entropy minus entropy),
relative entropy, accuracy, sensitivity, specificity,
positive predictive value, etc.
Example questions
General
Crick's proposed Central Dogma states that the "information flow"
in a cell is from DNA to RNA to protein; i.e. DNA -> RNA -> protein, where
each of the arrows cannot be reversed. Describe a violation of the
Central Dogma.
Solution:
The CRISPR/Cas system in prokaryotes involves (1) bacterial proteins
that extract viral DNA, which is incorporated into the CRISPR array
region of the bacterial genome, and (2) transcription of the polycistronic
CRISPR RNAs with subsequent cleavage and incorporation into proteins that
recognize and cleave viral DNA which hybridizes with the CRISPR RNA.
Thus proteins can change DNA, which violates the Central Dogma. Another
example is the yet unclear manner of how long noncoding RNAs direct the
transformation of genomic regions (DNA) into heterochromatin.
This problem concerns IUPAC nucleotide and amino acid codes.
-
What do the IUPAC codes R and Y stand for?
-
What is the IUPAC code for the sequence read "either G or C" -- i.e.
in the case of uncertain data where it is clear that the nucleotide at
a particular position is either G or C, but there is insufficient signal
to determine which.
-
Which amino acid does the IUPAC code G stand for, and what is the
side chain of this amino acid?
Solution:
-
R=purine, Y=pyrimidine.
-
S = strong (G or C)
-
G = Gly = glycine, with side chain H.
-
Restriction enzymes provide a rudimentary
"immune system" for bacteria, in that they recognize
certain specific nucleotide sequences and then cut the sequence in
a specific position in the recognition sequence.
Generally, restriction enzyme recognition sequences are
biological palindromes, such as
BamH1 G^GATCC
EcoR1 G^AATTC
EcoRV GAT^ATC
where the cleavage site is indicated with a '^' (carat) character.
In a random "genome" of
4.6 million base pairs (i.e. double-stranded DNA)
where P(G)=P(C)= 0.3 and P(A)=P(T)=0.2,
what is the expected number of occurrences of the restriction enzyme EcoR1?
Solution:
p = (0.3)*(0.3)*(0.2)**4
p = 0.000144
exp num = p*2*(4600000)
exp num = 1324.8
Probability, Combinatorics, Entropy
-
Suppose that you have a coin whose "heads" probability is 1/8. What is
the expected number of times that you must flip the coin before obtaining
a heads?
Solution:
1/(1/8) = 8.
-
If you flip a coin with heads probability of 1/4 five times, then
what is the probability that you have exactly 2 heads?
Solution:
"5 choose 2" * (1/4)2 * (3/4)3 =
10 * 1/16 * 27/64 = 0.263671875
-
Suppose that you estimate that the probability of an occurrence of an
AUG codon occurring in any given genomic position is 1/64. What is the
expected number of occurrences of AUG in the sense strand of the
E. coli genome?
Solution:
E. coli K-12 has genome size of 4.5 x 106 base pairs, with
relative frequency of 1/4 for each nucleotide. Thus 4,500,000/64 = 70312.5
is the expected number of occurrences of AUG, ASSUMING the DNA sequence is
random. Since E. coli genome is certainly not random, the number of AUG
may deviate from this expected number.
-
Compute the base 2 entropy of the first position of the following
multiple alignment, AND determine the sum of the heights of the stacked
letters in the Web Logo output.
ATTA
CGG-
ATCA
GTTT
TTGT
AAAT
Solution:
In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total
of 6 nucleotides, so by rounding to 3 decimal places we have:
p(A)=0.5,
p(C)=0.167,
p(G)=0.167,
p(T)=0.167.
H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)/log(2)) =
1.792481250360578
which is 1.397 to 3 decimal places. Sequence logo height is maximum
entropy minus the entropy, so in this case maximum entropy is
log2(4), since there are 4 nucleotides, and the sum of the
heights of the stacked letters is 2 - 1.397 = 0.603.
Linkage Disequilibrium
-
Given the following genotyping data, please compute the following value
(and show your work by giving appropriate formulas).
-
P(x=1)
-
P(x=0)
-
P(y=1)
-
P(y=0)
-
P(x=1,y=1)
-
Linkage disequilibrium D
-
Normalized linkage disequilibrium D
-
Value of rho (ρ)
-
Value of n · ρ2
-
p-value using χ2 distribution with 1 degree of freedom
Solution:
P(x=1) = P(diploid X locus is "11") + 1/2 P(diploid X locus is "01")
P(y=1) = P(diploid Y locus is "11") + 1/2 P(diploid Y locus is "01")
P(x=1) = 0.445454545
P(y=1) = 0.681818182
D -0.012809917
Since D < 0, D' is |D|/min(P(x=1)*P(y=1), P(x=0)*P(y=0))
D' 0.072599532
Since D' is close to zero, there is NO LINKAGE between genes.
In hypothesis testing, where the null hypothesis H0 is
H0: D' = 0 (i.e. there is no linkage)
and the alternative hypothesis is
H1: D' different from 0 (i.e. there is linkage)
we determine that
rho -0.055335532
n rho^2 0.336822317
p-val 0.561669209
If the p-value were less than 0.05, then we have sufficient statistical
evidence to REJECT the null hypothesis and conclude that the genes are linked.
However, this is not the case, and we can not reject the null hypothesis and
so find no evidence that the genes are linked.
Sequence Alignment
-
Consider the following alignment:
ACGG-TT-GACCTTACGACT-
A-GTACGA--TCACTCTA-TT
-
Calculate the score of alignment shown above,
using the following scoring rules: +1 for a match, -1 for a mismatch,
-1 for a gap (linear gap penalty).
-
Calculate the score of alignment
using the following scoring rules: +1 for a match, -1 for a mismatch,
-2 for gap initiation and -1 for gap extension (affine
gap penalty).
Solution:
-
There are 7 gaps, 6 matches, and 8 mismatches, so score is
-7+6-8=-9.
-
There are 6 gap blocks (1 of size 2, and 5 of size 1), 6 matches,
and 8 mismatches, so score is
-2(6) -1 + 6 - 8 = -15, where I hope I haven't miscounted.
The main question is whether
T-GAC
GA--T
is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks,
one of size 1 and one of size 2.
-
Calculate the semiglobal alignment score, where
a match counts +1, a mismatch -1,
and gap initiation counts -2, and gap extension -1. In a semiglobal
alignment, gaps at the left and right extremity are not counted;
however internal gaps do indeed have a gap penalty.
--ACGG-TT-GACCTTACGACT-
TTA-GTACGA--TCACTCTA-TT
Solution:
Semiglobal alignment score of the given aligment is the same as the
score of the truncated alignment
ACGG-TT-GACCTTACGACT
A-GTACGA--TCACTCTA-T
There are 5 gap blocks (1 of size 2, and 4 of size 1), 6 matches,
and 8 mismatches, so score is
-2(5) -1 + 6 - 8 = -13, where I hope I haven't miscounted.
The main question is whether
T-GAC
GA--T
is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks,
one of size 1 and one of size 2.
-
Give the path matrix for the global alignment (Needleman-Wunsch)
of GGACC with CGAA,
where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty).
After completing the path matrix,
use a traceback to give the optimal alignment.
Solution:
s is :GGACC
t is :CGAA
C G A A
0 -1 -2 -3 -4
G -1 -2 0 -1 -2
G -2 -3 -1 -2 -3
A -3 -4 -2 0 -1
C -4 -2 -3 -1 -2
C -5 -3 -4 -2 -3
The path matrix with backarrows and one (of many) tracebacks is here:
-
Give the path matrix for the LOCAL alignment (Smith-Waterman)
of GGACC with CGAA,
where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty).
After completing the path matrix,
use a traceback to give the optimal alignment.
Solution:
s is :GGACC
t is :CGAA
C G A A
0 0 0 0 0
G 0 0 1 0 0
G 0 0 1 0 0
A 0 0 0 2 1
C 0 1 0 1 0
C 0 1 0 0 0
First, recall that we start numbering rows and columns with zero (0), not 1.
The maximum value in the path matrix is 2, and since this value is unique,
there is only one optimal local aligment. The traceback arrow from 2 in
row 3, col 3 is to the value 1 located in row 2, col 2. The traceback arrow
from 1 in row 2, col 2 is to the value 0 located in row 1, col 1. We stop,
since we've encountered the value 0, so the optimal local alignment is
the following:
2-GA-3
2-GA-3
meaning that the starting nucleotide in the first sequence is 2, then
ending nucleotide in the first sequence is 3, and similarly for the second
sequence. Recall that BLAST output gives you the starting/stopping position
for both query and target sequences, and that BLAST is a LOCAL alignment
alignment.
Sequence assembly using de Bruijn graphs
-
Previous homework problem:
Consider the following set of eight 3-mers that have been obtained from a
"genome" sequence: {AGT, AAA, ACT, AAC, CTT, GTA, TTT, TAA}.
-
Draw a de Bruijn graph for this same set of eight fragments. Recall that the
nodes will be 2-mers, and an edge is drawn between pairs of nodes if an
only if the 1-letter suffix of a 2-mer matches the one-letter prefix of the
other 2-mer. Remember, a pair of nodes (and the edge between them)
represents a 3-mer from the data. No 2-mer can be present more than once.
-
Describe a path through this graph that visits each edge exactly once.
Here, a path "description" is of the form a->b->c->d, where a,b,c,d are nodes
in the graph -- i.e. 2-mers.
-
Write the superstring corresponding to this path.
Hint:
You are allowed to draw an edge from one
2-mer back to itself, forming a loop that involves no other nodes. [you are
also allowed to form more complicated loops, etc.]
Solution:
-
[Sequence assembly by search for a (directed) Eulerian path]
Define the new graph G=(V,E), where the vertices consist of all 2-mers
from the collection of (experimentally given) 3-mers. Here, we write
the list of 2-mers, with their multiplicities (number of occurrences).
1: AA 4
2: AC 2
3: AG 1
4: CT 2
5: GT 2
6: TA 2
7: TT 3
The collection of directed, labeled edges is given by a list
of ordered pairs, together with the 3-mer label that justifies the edge:
3,5 AGT
1,1 AAA
2,4 ACT
1,2 AAC
4,7 CTT
5,6 GTA
7,7 TTT
6,1 TAA
Start by identifying the sink, which is vertex 7, since
it is the only node which does not occur as the left member of an
ordered pair where the right member is distinct from the left pair
(i.e. 7 is the only value that does not occur as the head of an arrow
leading to a distinct node). Now backtrack, starting with 7 -- in other
words, follow the tail to the head of arrows:
3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels,
each time appending the growing sequence by appending the newest symbol.
This produces AGTAACTTT. Since we also have the ordered pair (3,3),
arguably the sequence should be instead
AAGTAACTTT.