BIOL4200
Review for Midterm: Introduction to Bioinformatics

The midterm is a TAKE HOME exam, available on CANVAS from Friday, Nov 12 at 5:00 pm until Sunday, Nov 14 at 11:59 pm. The test is TIMED, and you have exactly 2 hours, including the time necessary to create a SINGLE PDF file of your answers, and upload it to the Canvas web site in the QUIZZES section.

For anyone who has kept up in class and understood the material we've covered, the midterm should not take more than 60-90 minutes, which is ample time for you to do the test, create a single pdf file and upload it. PLEASE practice making a single pdf file BEFORE you start the test, since the CANVAS site will not allow you to upload anything after 2 hours have passed since you have begun the test.

During the midterm exam, you may use class notes from our web site, your own notes, a calculator, Excel and Mathematica; however, you may NOT use Internet or discuss the test with any person. Here, I am solemnly asking each of you to abide by these directions completely. Remember, if you cheat in this instance, then you will cheat and take short cuts as a physician, business person or in whatever profession you choose. Certainly none of you would want to be treated in an emergency room for a life threatening health problem by a physician who cheats and takes short cuts. There is ZERO tolerance for cheating, and I have confidence that you all will abide by the ethical code of Boston College.

The test will reflect coverage of material in the course, so please review the course slides, homeworks and quizzes. With the exception of a few questions on general notions (see below), the test questions will be similar to quiz and homework problems that require understanding of the material.

On the midterm, there will be NO questions concerning the Poisson or geometricprobability distributions. There WILL BE questions on the binomial and hypergeometric distributions. You will have to use Mathematica to solve a system of differential equations, so please review class notes and homework for Week 2, when we covered how to model the spread of an epidemic and how to solve a system of differential equations using Mathematica (see also the related problems on logistic model for bacterial growth, etc.).

Concepts to review include the following.

GENERAL NOTIONS: Concepts from molecular biology and biochemistry covered in the notes chap0_intro.pdf and in notes from other chapters: Crick's central dogma, examples of violations to the central dogma, CRISPR-Cas system, year of publication of Watson and Crick's paper on DNA structure (1953), year of completion of sequencing of human genome (2003), name of first bacterium sequenced (Haemophilus influenzae 1995), name of first eukaryote sequenced (S. cerevisiae 1996), genome size of HIV-1, genome size of Haemophilus influenzae, genome size of Saccharomyces cerevisiae, genome size of Homo sapiens, description of how Sanger sequencing works, what dideoxynucleotides are and how they are used in Sanger sequencing, what shotgun sequencing is, what is the start codon or codon of methionine (AUG), be able to name at least one stop codon (UAG,UAA, UGA), name 2 hydrophobic amino acids, name 2 hydrophilic (polar) amino acids, name 2 negatively charged amino acids, name 2 positively charged amino acids, what is the smallest amino acid and what is its sidechain (glycine with sidechain of hydrogen atom). what is the sidechain of alanine (methyl group CH3), know all IUPAC nucleotide codes and important amino acid codes, know difference an an example of orthologous and of paralogous genes, know that the GC-content is the compositional frequency of G plus that of C, know what lower and higher numbers mean for both PAM and BLOSUM family matrices, be able to construct a "toy" BLOSUM matrix as done in the notes by computing the log odds score log( p(xy)/q(x)q(y) ).
Elementary Combinatorics and Probability: Combinations and permutations ("n choose k"), binomial distribution, hypergeometric distribution, Poisson distribution, the expected number of times that you need to flip a coin having heads probability 1/8 in order to get the first heads, expected number of occurrences of a motif (such as TATAAA) in a genome of size N having compositional frequency p(A), p(C), p(G), and p(T).
Differential equation models with Mathematica: modeling epidemics using the SIR-model and its extensions, how to solve differential equations or systems of differential equations that model biological phenomena (prey-predator, number of bacteria, number of tumor cells, etc.) using Mathematica.
Linkage disequilibrium (LD): How to compute value D of linkage disequilibrium for genotype data of diploid organisms (e.g. human in a sample as we've done in class and a quiz), normalized linkage disequilibrium D', test statistic n ρ² to test whether linkage disequilibrium D is statistically significant, by using χ² distribution with one degree of freedom, test statistic for T-test of equality of two sample means.
Sequence Alignment: Global pairwise alignment (Needleman-Wunsch). Local pairwise alignment (Smith-Waterman). How to construct a path matrix for local and global alignment with linear gap penalty. Know what the affine gap penalty is and how to score an alignment with the affine gap penalty. Know how to score a semi-global alignment -- i.e. where end gaps are not counted but interior gaps are counted.
Graph Theory: Definitions of directed and undirected graphs, of simple graphs, of connected graphs, of Hamiltonian cycles and paths, of Eulerian cycles and paths. Euler's theorem on the existence of Eulerian paths and cycles. How to construct an Eulerian cycle or path. Sequence assembly using DeBruijn graphs with application of Euler's construction. Sequence assembly by constructing a Hamiltonian path or cycle from k-mer data.
Origin of replication (oriC): G-C skew, cumulative skew, how to predict G-C raw skew and cumulative skew from knowledge of location of the oriC and on which strand most genes are transcribed.
PSSMs, weight matrices: Position-specific scoring matrices, position-specific entropy, maximum entropy, information (maximum entropy minus entropy), relative entropy, accuracy, sensitivity, specificity, positive predictive value, etc.

Example questions

General

Crick's proposed Central Dogma states that the "information flow" in a cell is from DNA to RNA to protein; i.e. DNA -> RNA -> protein, where each of the arrows cannot be reversed. Describe a violation of the Central Dogma.
Solution:
The CRISPR/Cas system in prokaryotes involves (1) bacterial proteins that extract viral DNA, which is incorporated into the CRISPR array region of the bacterial genome, and (2) transcription of the polycistronic CRISPR RNAs with subsequent cleavage and incorporation into proteins that recognize and cleave viral DNA which hybridizes with the CRISPR RNA. Thus proteins can change DNA, which violates the Central Dogma. Another example is the yet unclear manner of how long noncoding RNAs direct the transformation of genomic regions (DNA) into heterochromatin.
This problem concerns IUPAC nucleotide and amino acid codes.
1. What do the IUPAC codes R and Y stand for?
2. What is the IUPAC code for the sequence read "either G or C" -- i.e. in the case of uncertain data where it is clear that the nucleotide at a particular position is either G or C, but there is insufficient signal to determine which.
3. Which amino acid does the IUPAC code G stand for, and what is the side chain of this amino acid?
Solution:
1. R=purine, Y=pyrimidine.
2. S = strong (G or C)
3. G = Gly = glycine, with side chain H.
Restriction enzymes provide a rudimentary "immune system" for bacteria, in that they recognize certain specific nucleotide sequences and then cut the sequence in a specific position in the recognition sequence. Generally, restriction enzyme recognition sequences are biological palindromes, such as
```
BamH1   G^GATCC
EcoR1   G^AATTC
EcoRV   GAT^ATC
```
where the cleavage site is indicated with a '^' (carat) character. In a random "genome" of 4.6 million base pairs (i.e. double-stranded DNA) where P(G)=P(C)= 0.3 and P(A)=P(T)=0.2, what is the expected number of occurrences of the restriction enzyme EcoR1?

Solution:
```
p = (0.3)*(0.3)*(0.2)**4
p = 0.000144
exp num = p*2*(4600000)
exp num = 1324.8
```
Probability, Combinatorics, Entropy
Suppose that you have a coin whose "heads" probability is 1/8. What is the expected number of times that you must flip the coin before obtaining a heads?
Solution:
1/(1/8) = 8.
If you flip a coin with heads probability of 1/4 five times, then what is the probability that you have exactly 2 heads?
Solution:
"5 choose 2" * (1/4)² * (3/4)³ = 10 * 1/16 * 27/64 = 0.263671875
Suppose that you estimate that the probability of an occurrence of an AUG codon occurring in any given genomic position is 1/64. What is the expected number of occurrences of AUG in the sense strand of the E. coli genome?
Solution:
E. coli K-12 has genome size of 4.5 x 10⁶ base pairs, with relative frequency of 1/4 for each nucleotide. Thus 4,500,000/64 = 70312.5 is the expected number of occurrences of AUG, ASSUMING the DNA sequence is random. Since E. coli genome is certainly not random, the number of AUG may deviate from this expected number.
Compute the base 2 entropy of the first position of the following multiple alignment, AND determine the sum of the heights of the stacked letters in the Web Logo output.
```
ATTA
CGG-
ATCA
GTTT
TTGT
AAAT
```
Solution:
In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167.
```
H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)/log(2)) = 
1.792481250360578
```
which is 1.397 to 3 decimal places. Sequence logo height is maximum entropy minus the entropy, so in this case maximum entropy is log₂(4), since there are 4 nucleotides, and the sum of the heights of the stacked letters is 2 - 1.397 = 0.603.

Linkage Disequilibrium

Given the following genotyping data, please compute the following value (and show your work by giving appropriate formulas).

P(x=1)
P(x=0)
P(y=1)
P(y=0)
P(x=1,y=1)
Linkage disequilibrium D
Normalized linkage disequilibrium D
Value of rho (ρ)
Value of n · ρ²
p-value using χ² distribution with 1 degree of freedom

Solution:

P(x=1) = P(diploid X locus is "11") + 1/2 P(diploid X locus is "01")
P(y=1) = P(diploid Y locus is "11") + 1/2 P(diploid Y locus is "01")					
P(x=1) = 0.445454545				
P(y=1) = 0.681818182				
					
D	-0.012809917				
Since D < 0, D' is |D|/min(P(x=1)*P(y=1), P(x=0)*P(y=0))
D'	0.072599532			
Since D' is close to zero, there is NO LINKAGE between genes.

					
In hypothesis testing, where the null hypothesis H0 is
   H0: D' = 0  (i.e. there is no linkage)
and the alternative hypothesis is
   H1: D' different from 0 (i.e. there is linkage)
we determine that
rho	-0.055335532				
n rho^2	0.336822317				
p-val	0.561669209

If the p-value were less than 0.05, then we have sufficient statistical evidence to REJECT the null hypothesis and conclude that the genes are linked. However, this is not the case, and we can not reject the null hypothesis and so find no evidence that the genes are linked.

Sequence Alignment

Consider the following alignment:
```
ACGG-TT-GACCTTACGACT-
A-GTACGA--TCACTCTA-TT
```
1. Calculate the score of alignment shown above, using the following scoring rules: +1 for a match, -1 for a mismatch, -1 for a gap (linear gap penalty).
2. Calculate the score of alignment using the following scoring rules: +1 for a match, -1 for a mismatch, -2 for gap initiation and -1 for gap extension (affine gap penalty).
Solution:
1. There are 7 gaps, 6 matches, and 8 mismatches, so score is -7+6-8=-9.
2. There are 6 gap blocks (1 of size 2, and 5 of size 1), 6 matches, and 8 mismatches, so score is -2(6) -1 + 6 - 8 = -15, where I hope I haven't miscounted.
The main question is whether
```
T-GAC
GA--T
```
is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks, one of size 1 and one of size 2.
Calculate the semiglobal alignment score, where a match counts +1, a mismatch -1, and gap initiation counts -2, and gap extension -1. In a semiglobal alignment, gaps at the left and right extremity are not counted; however internal gaps do indeed have a gap penalty.
```
--ACGG-TT-GACCTTACGACT-
TTA-GTACGA--TCACTCTA-TT
```
Solution:
Semiglobal alignment score of the given aligment is the same as the score of the truncated alignment
```
ACGG-TT-GACCTTACGACT
A-GTACGA--TCACTCTA-T
```
There are 5 gap blocks (1 of size 2, and 4 of size 1), 6 matches, and 8 mismatches, so score is -2(5) -1 + 6 - 8 = -13, where I hope I haven't miscounted. The main question is whether
```
T-GAC
GA--T
```
is treated as 1 gap or 2 gaps -- the answer is that there are 2 gap blocks, one of size 1 and one of size 2.
Give the path matrix for the global alignment (Needleman-Wunsch) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.
Solution:
```
s is :GGACC
t is :CGAA
		C	G	A	A	
	0	-1	-2	-3	-4	
G	-1	-2	0	-1	-2	
G	-2	-3	-1	-2	-3	
A	-3	-4	-2	0	-1	
C	-4	-2	-3	-1	-2	
C	-5	-3	-4	-2	-3	
```
The path matrix with backarrows and one (of many) tracebacks is here:
Give the path matrix for the LOCAL alignment (Smith-Waterman) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.
Solution:
```
s is :GGACC
t is :CGAA
		C	G	A	A	
	0	0	0	0	0	
G	0	0	1	0	0	
G	0	0	1	0	0	
A	0	0	0	2	1	
C	0	1	0	1	0	
C	0	1	0	0	0	
```
First, recall that we start numbering rows and columns with zero (0), not 1. The maximum value in the path matrix is 2, and since this value is unique, there is only one optimal local aligment. The traceback arrow from 2 in row 3, col 3 is to the value 1 located in row 2, col 2. The traceback arrow from 1 in row 2, col 2 is to the value 0 located in row 1, col 1. We stop, since we've encountered the value 0, so the optimal local alignment is the following:
```
2-GA-3
2-GA-3
```
meaning that the starting nucleotide in the first sequence is 2, then ending nucleotide in the first sequence is 3, and similarly for the second sequence. Recall that BLAST output gives you the starting/stopping position for both query and target sequences, and that BLAST is a LOCAL alignment alignment.
Sequence assembly using de Bruijn graphs
Previous homework problem: Consider the following set of eight 3-mers that have been obtained from a "genome" sequence: {AGT, AAA, ACT, AAC, CTT, GTA, TTT, TAA}.
1. Draw a de Bruijn graph for this same set of eight fragments. Recall that the nodes will be 2-mers, and an edge is drawn between pairs of nodes if an only if the 1-letter suffix of a 2-mer matches the one-letter prefix of the other 2-mer. Remember, a pair of nodes (and the edge between them) represents a 3-mer from the data. No 2-mer can be present more than once.
2. Describe a path through this graph that visits each edge exactly once. Here, a path "description" is of the form a->b->c->d, where a,b,c,d are nodes in the graph -- i.e. 2-mers.
3. Write the superstring corresponding to this path.
  Hint: You are allowed to draw an edge from one 2-mer back to itself, forming a loop that involves no other nodes. [you are also allowed to form more complicated loops, etc.]
Solution:
1. [Sequence assembly by search for a (directed) Eulerian path] Define the new graph G=(V,E), where the vertices consist of all 2-mers from the collection of (experimentally given) 3-mers. Here, we write the list of 2-mers, with their multiplicities (number of occurrences).
```
1: AA 	4
2: AC	2
3: AG	1
4: CT   2
5: GT   2
6: TA   2
7: TT   3
```
  The collection of directed, labeled edges is given by a list of ordered pairs, together with the 3-mer label that justifies the edge:
```
3,5	AGT
1,1	AAA
2,4	ACT
1,2	AAC
4,7	CTT
5,6	GTA
7,7	TTT
6,1	TAA
```
  Start by identifying the sink, which is vertex 7, since it is the only node which does not occur as the left member of an ordered pair where the right member is distinct from the left pair (i.e. 7 is the only value that does not occur as the head of an arrow leading to a distinct node). Now backtrack, starting with 7 -- in other words, follow the tail to the head of arrows: 3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels, each time appending the growing sequence by appending the newest symbol. This produces AGTAACTTT. Since we also have the ordered pair (3,3), arguably the sequence should be instead AAGTAACTTT.

BIOL4200 Review for Midterm: Introduction to Bioinformatics

Example questions

General

Probability, Combinatorics, Entropy

Linkage Disequilibrium

Sequence Alignment

Sequence assembly using de Bruijn graphs

BIOL4200
Review for Midterm: Introduction to Bioinformatics