BIOL4200
Review for Final Exam: Introduction to Bioinformatics

The exam is CLOSED book and IN-CLASS, although you may bring to class up to 4-pages of handwritten or typed notes (i.e. a crib sheet of 4 pages, written on both sides), along with a CALCULATOR.

Although the exam is cumulative, it will focus mostly on the material covered since the midterm -- i.e. simple applications of probability theory and probability distributions (binomial, geometric, hypergeometric, normal) and statistical tests (T-test, χ^{2-test),
linkage disequilibrium,bacterial origin of replication (ORI) -- GC-skew,
biological reasons for skew, glycan arrays, Fisher exact test and its use
for the infectivity analysis of the influenze virus, phylogenetic tree
construction.
The exam will reflect coverage of material
in the course, so please review the course slides and homeworks.}

Concepts to review include the following.

General notions: Concepts from molecular biology and biochemistry covered in the notes: central dogma and its violations, CRISPR-Cas system, year of publication of Watson and Crick's paper on DNA structure (1953), year of completion of sequencing of human genome (2003), name of first bacterium sequenced, name of first eukaryote sequenced, genome size of HIV-1, genome size of Haemophilus influenzae, genome size of Saccharomyces cerevisiae, genome size of Homo sapiens, description of how Sanger sequencing works, what dideoxynucleotides are and how they are used in Sanger sequencing, what shotgun sequencing is, what is the start codon or codon of methionine (AUG), be able to name at least one stop codon (UAG,UAA, UGA), be able to recognize and name the most important hydrophobic, hydrophilic (polar), negatively charged, and positively charged amino acids, know all IUPAC nucleotide codes and important amino acid codes, know difference an an example of orthologous and of paralogous genes, know that the GC-content is the compositional frequency of G plus that of C, know what lower and higher numbers mean for both PAM and BLOSUM family matrices, be able to construct a "toy" BLOSUM matrix as done in the notes by computing the log odds score log( p(xy)/q(x)q(y) ), what a BLOSUM log odds means -- e.g. a log odds score of zero means that x and y are found aligned together just as often as expected by chance (no more, no less)..
Linkage disequilibrium": computation of D in haploid and diploid species, normalized value D', test stastic n ρ², determination of p-value, interpretation whether genes are associated
Elementary Combinatorics and Probability: Combinations and permutations ("n choose k"), binomial distribution, hypergeometric distribution, Poisson distribution, the expected number of times that you need to flip a coin having heads probability 1/8 in order to get the first heads, expected number of occurrences of a motif (such as TATAAA) in a genome of size N having compositional frequency p(A), p(C), p(G), and p(T), Z-scores, p-values.
I would encourage you to focus on the following notions:
- General notions: PAM, BLOSUM, IUPAC codes, etc.
- Construction of BLOSUM matrices
- Probability theory, binomial, hypergeometric, Poisson, T-test, χ² test
- Sequence alignment
- Origin of replication
- Weight matrices, PSSM, pseudocounts, entropy, information, LOGO plots
- Fisher exact test, accuracy measures (sensitivity, specificity, PPV, etc.)
- Signed and unsigned inversion distance, greedy algorithm for unsigned inversion distance, DCJ algorithm for signed inversion distance
- UPGMA and WPGMA
Sequence Alignment: Global pairwise alignment (Needleman-Wunsch). Local pairwise alignment (Smith-Waterman). How to construct a path matrix for local and global alignment with linear gap penalty. Know what the affine gap penalty is and how to score an alignment with the affine gap penalty. Know how to score a semi-global alignment -- i.e. where end gaps are not counted but interior gaps are counted. Multiple alignment, Clustal, guide tree -- the guide tree is produced by UPGMA after computing the distance all each pair of two sequences by using a (distance-version) of sequence alignment algorithm (the guide tree is used then to progressively combine pairs of sequences or pseudo-sequences together).
Phylogenetic trees: WPGMA, UPGMA, be able to create a phylogenetic tree that UPGMA or WPGMA would produce for a given distance matrix, list of quartets in a tree, horizontal transfer. You should be able to follow the steps of WPGMA and construct a tree using that algorithm.
Graph Theory: Definitions of directed and undirected graphs, of simple graphs, of connected graphs, of Hamiltonian cycles and paths, of Eulerian cycles and paths. Euler's theorem on the existence of Eulerian paths and cycles. How to construct an Eulerian cycle or path. Sequence assembly using DeBruijn graphs with application of Euler's construction. Sequence assembly by constructing a Hamiltonian path or cycle from k-mer data.
Origin of replication: raw GC-skew, cumulative GC-skew, how to determine the origin of replication from skew diagrams, how to draw the likely skew diagram if you know the origin(s) of replication and whether genes are predominantly transcribed from one strand or not.
Weight matrices: how to construct a PSSM (position specific scoring matrix) or profile from an alignment, pseudocounts, computation of entropy and sequence logo from a multiple alignment, machine learning, training set, true positive, true negative, accuracy, sensitivity, specificity, positive predictive value.
Influenza: binomial and multinomial coefficients, hypergeometric distribution, Fisher's exact test, determination of over-represented treelets in glycan that bind hemagglutinin.
Chromosomal rearrangements: inversion, translocation, greedy algorithm for inversion distance. The test will NOT cover the DCJ (double-cut and join) algorithm.

Some example questions

General

Using the BLOSUM62 (log odds) matrix, answer the following questions.
1. Does isoleucine substitute for leucine more often than by chance?
2. Does glutamine substitute for glutamic acid more often than by chance?
3. Which amino acid is found to substitute for tryptophan most often?
Solution:
BLOSUM2 values, as explained in class (and as you computed in the midterm exam), are given by α · log( p(x,y)/(q(x)q(y)) ) rounded to the nearest integer, where α is a scaling factor, p(x,y) is the joint probabity (relative frequency that x and y both appear in the same column of the multiple alignment), and q(x) [resp. q(y)] is the relative frequency of occurrence of residues x [resp. y] in the multiple alignment. If the BLOSUM value is positive, then substitution happens more often by chance, while if the BLOSUM value is negative , then substitution happens less often than by chance. From this, the answers to a,b,c are clear when you check values in the BLOSUM62 table.
What do the databases GenBank, PDB, Rfam, Pfam, UniProt, PubMed, OMIM contain? Please be precise. If the database contains 3D structures determined by X-ray crystallography or NMR, then please state this; if the database contains RNA sequences and secondary structures, the please state this, etc.
Solution:
Check on-line. These were explicitly discussed in class, though I didn't write all the details in the class slides.
Describe one difference between the format/contents of a GenBank genomic nucleotide sequence file in comparison with an EMBL (EBI) genomic nucleotide sequence file.
Solution:
Same as above.
Describe how one uses PSI-BLAST to determine the PSSM of remote homologs of a user-specified protein.
Solution:
Same as above. What I said in class is that PSI-BLAST ("position specific iterated BLAST) involves a user-specified number of iterations (often 3 or 4). After a first application of BLAST, the user can either manually select, or what is more often the case, simply select ALL the hits returned by BLAST whose E-value is less than the default value on the web server. The hits are unified into a profile or PSSM, which is then BLASTED against the databases again, etc.
Please select one of the options in the following multiple choice selection. What does the database PDB contain:
1. Multiple alignments of DNA and proteins.
2. Metabolic pathways.
3. Explanations of side effects of pharmaceutical drugs.
4. All of the above.
5. None of the above
Solution:
None of the above
In E. coli, which has compositional frequency of 0.25 for each nucleotide, what is the expected length of an ORF computed from basic probability?
Solution:
Rephrasing the question, suppose that you have a coin with heads probability of p. How many times do you need to flip the coin (on average) to get the first heads? (We did this a number of times in class.) For this question, you need to determine the "heads probability" (probability of a STOP codon) for the E. coli genome. Once you determine this, you then determine the expected number of coin flips before getting a "heads" (i.e. a STOP codon).

Linkage disequilibrium
Consider the following statistics from a sample population of people genotyped for having green eyes (locus X) and having a superpower (locus Y); i.e. using a gross oversimplification, the "green eye gene" X has the allele 1 or 0, and the "superpower gene" (being able to fly through the air, walk on water, etc.) Y has the allele 1 or 0.

Compute the linkage disequilibrium value D(X,Y) and its normalized value D'.
Solution:
You should work through the formula to make sure you understand the answer.

In hypothesis testing, whether the genes are associated (linked), the null hypothesis H0 is "locus X and locus Y are not linked", while the alternative hypothesis H1 is "locus X and locus Y are linked". The χ² test statistic can be computed directly, but in our class notes, we saw that the test statistic can also be expressed in terms of the linkage disequilibrium D (not the normalized value D'):
χ² = n D²/(P(x=0)P(x=1)P(y=0)P(y=1))
One then determines the area of the tail region, to the right of the χ² test statistic in the χ² distribution with 1 degree of freedom. If the area, called the "p-value", is less than 0.05, then with 95% statistical certainty we can REJECT the null hypothesis, and conclude that the loci are linked.

Probability, Combinatorics, Entropy
Suppose that an urn contains 4 red balls and 8 black balls.
1. If you randomly select 4 balls, replacing the selected ball in the urn each time after you select it, then what is the probability of selecting exactly 2 red ball and 2 black balls? In other words, what is the probability of selecting 2 red balls and 2 black balls with replacement?
2. If you randomly select 4 balls, where the selected ball is NOT placed back in the urn each time after you select it, then what is the probability of selecting exactly 2 red ball and 2 black balls? In other words, what is the probability of selecting 2 red balls and 2 black balls without replacement?
Solution:
1. Use the binomial distribution:
```
"4 choose 2" · (1/3)² · (1-1/3)² 
= 0.296296
```
2. Use the hypergeometric distribution:
```
("4 choose 2" · "8 choose 2" )/"12 choose 4" =
6 · 28/ ( (8 · 7 · 6 · 5)/24 ) = 2.4
```
Suppose that on average, there are 200 occurrences of the motif TATAAA in a genome of size one megabase. In a random genome of the same size, what is the probability that there are no occurrences of the motif TATAAA? What is the probability that there are exactly two occurrences of the motif TATAAA? Hint: can you model this by the Poisson distribution?
Solution:
If the process of acquiring a motif is uniform (think of randomly choosing the exact location within (say) one megabase for a motif to occur), then the NUMBER of occurrences of the motif within a window of (say) one megabase should be approximately given by the Poisson probability:
```
Prob[X = k] =  e^-μ · (μ)^k/k! 
```
where Prob[X=k] is the probability of exactly k motif occurrences. If the number of motif occurrences is really representative, then we can assume that the EXPECTED NUMBER of motif occurrences is 200, where the words expectation, expected value, mean, and average all mean the same thing -- it follows that μ=200. The probability of NO occurrence is
```
Prob[X = 0] =  e^-200 · (μ)⁰/0!  = e^-200 = 1.3838965267367376e-87
```
```
Prob[X = 2] =  e^-200 · 200²/2!  = 2.767793053473475e-83
```
You are told that the average distance between successive TATAAA motifs in a genome of ten megabases is 10,000 bases. What is the probability of occurrence of the motif TATAAA? Is the genome pyrimidine rich or purine rich? Careful: this question is quite different from the previous question.
Solution:
This is a tricky problem, although we have mentioned in class all the material needed to answer the question.
The probability density function for the exponential distribution is given by
```
p(x) = λ · e^{-λ x}
```
In class, we said that the exponential distribution is a good mathematical model for the genomic DISTANCE between two successive motifs. In particular, the probability that the distance between two SUCCESSIVE motifs is AT LEAST A and AT MOST B is is the integral from A to B of the probability density function p(x). The mean (average value) for the exponential distribution is μ=1/λ . The cumulative density function, or integral from 0 to x of the density function p(x) is given by
```
CDF(x) = 1 - e^{-λ · x}
```
From the problem assumptions, we have μ=10⁴, so λ=10^-4.
Moreover, in class, we also said that the exponential distribution is the CONTINUOUS analogue of the GEOMETRIC distribution, which latter is described as follows:
```
probability of (k-1) tails followed by a heads = P(k) = (1-p)^k-1 · p
```
where p is the probability of obtaining a heads when flipping a biased coin. The mean of the geometric distribution is 1/p, just as the mean of the exponential distribution with parameter λ is 1/λ. Thus if we set p=λ, then a good estimate for the probability of occurrence of TATAAA at any given occurrence is p=0.0001. Now, if we assume that the genomic compositional frequency of A,C,G,T is 1/4, as in the case of E. coli K-12, then we have heads probability q=(0.25)⁶ = 0.000244140625, which is about twice the value of p. Let's assume that the genomic frequency of A equals that of T, and that the genomic frequency of C equals that of G -- the so-called "second" Chargaff law (which is not a "law" at all and can easily be seen to be violated given certain genomes). The probability of TATAAA is then r⁶, where r is the (unknown) probability of T, or equivalently of A. If r⁶ = 0.0001, then by taking the sixth-root of both sides, we have that r=0.2154. Thus we have (by our assumption of Chargaff's second "law") that P(A) = 0.2154 = P(T), P(C) = 0.3923 = P(G). Now A and G are purines, while C and T are pyrimidines. By our assumption of Chargaff's second "law", there are an equal number of purines and pyrimidines.
A final remark: it is usual to model the "waiting time" or equivalently the "genomic distance" between two motif occurrences by the exponential distribution, provided the number of motif occurrences is modeled by the Poisson distribution. However, the exponential distribution is the continuous analogue of the geometric distribution, and requires a bit more math (not much, just a small integral), so we'll settle for the above approximation using the geometric distribution).
If we don't assume Chargaff's second "law", then we have the equation r² · s⁴ = 0.0001, where r [resp. s] denotes the probability of T [resp. A]. To solve 2 unknowns, we need 2 equations; without an additional equation, we can trace a curve of s as a function of r (similar to a curve of y as a function of x), which allows you to determine the probability of T if you know the probability of A.
Suppose that a glycan array contains 100 glycans, 40 of which contain a particular treelet T, and 60 of which do not contain the treelet T. What is the joint probability that hemagglutin binds 39 of the 40 glycans that contain T and that hemagglutin binds 30 of the 60 glycans that do not contain T?
Solution:
First, write out the contingency table: a = number of glycans that contain T and bind H, b = number of glycans that do NOT contain T yet do bind H, c = number of glycans that contain T and do not bind H, d = number of glycans that do NOT contain T and do NOT bind H. The numerator is the multinomial factor "n choose a,b,c,d", where n=a+b+c+d. The denominator is the product of the binomial factor "n choose (a+b)" where a+b is the sum of first row, times "n choose (a+c)" where a+c is the sum of first column. Then crunch numbers with your calculator.

Suppose that you estimate that the probability of an occurrence of an AUG codon occurring in any given genomic position is 1/64. What is the Poisson probability that there are exactly 5 occurrences of AUG in a genomic region of size 100 codons (i.e. 300 nt)?

Solution:

The expected number, or average, number of occurrences is (approximately)
        μ = (1/64)*100 =  1.5625
By definition, P[X=k] = exp(-μ)· μ^k/k!.
Now 
   (1) exp(-μ) = exp(-1.5625) = 0.2096113871510978, 
   (2) μ⁵ = 9.313225746154785,
   (3) μ⁵/(5*4*3*2*1) = 0.07761021455128987, 
so 
   P[X=5] = exp(-μ)*  μ⁵/(5*4*3*2*1) = 0.016267984729190187

Compute the base 2 entropy of the first position of the following multiple alignment, AND determine the sum of the heights of the stacked letters in the Web Logo output.
```
ATTA
CGG-
ATCA
GTTT
TTGT
AAAT
```
Solution:
In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167.
```
H = - [ 0.5 * log(0.5)/log(2) + 3*( 0.167 * log(0.167)) = 
1.3966704947492563
```
which is 1.397 to 3 decimal places. Sequence logo height is maximum entropy minus the entropy, so in this case maximum entropy is log₂(4), since there are 4 nucleotides, and the sum of the heights of the stacked letters is 2 - 1.397 = 0.603.
Compute the PSSM for the following multiple alignment, both without and with a pseudocount of 1.
```
AT
CG
AT
GT
TT
AA
```
Solution:
1. [Without pseudocounts] In the first position, there are 3 A's, 1 C, 1 G, and 1 T and a total of 6 nucleotides, so by rounding to 3 decimal places we have: p(A)=0.5, p(C)=0.167, p(G)=0.167, p(T)=0.167, for probabilities in the first column. Similarly, the probability distribution for column 2 is: p(A)=1/6 p(C)=0 p(G)=1/6 p(T)=2/3. You should display this as follows:
```
	1	2	
   ----------------------
A|	1/2	1/6
C|	1/6	0
G|	1/6	1/6
T|	1/6	2/3
```
2. [With a pseudocount of 1] In the first position, with 1 pseudocount for each of A,C,G,T, there are 4 A's, 2 C's, 2 G's, and 2 T's and a total of 6+4=10 nucleotides (due to a pseudocount of 1 for each nucleotide), so we have: p(A)=0.4, p(C)=0.2, p(G)=0.2, p(T)=0.2. I'll leave the remaining compution for position 2 to you. You should produce a display as follows:
```
	1	2	
   ----------------------
A|	4/10	2/10
C|	2/10	1/10
G|	2/10	2/10
T|	2/10	5/10
```
Sequence Alignment
Consider the following alignment:
```
ACGG-TT-GACCTTACGACT-
A-GTACGA--TCACTCTA-TT
```
1. Calculate the score of alignment shown above, using the following scoring rules: +1 for a match, -1 for a mismatch, -1 for a gap (linear gap penalty).
2. Calculate the score of alignment using the following scoring rules: +1 for a match, -1 for a mismatch, -2 for gap initiation and -1 for gap extension (affine gap penalty).
Calculate the semiglobal alignment score, where a match counts +1, a mismatch -1, and gap initiation counts -2, and gap extension -1. In a semiglobal alignment, gaps at the left and right extremity are not counted; however internal gaps do indeed have a gap penalty.
```
--ACGG-TT-GACCTTACGACT-
TTA-GTACGA--TCACTCTA-TT
```
Give the path matrix for the global alignment (Needleman-Wunsch) of GGACC with CGAA, where a match counts +1, a mismatch -2 and a gap -1 (i.e. linear gap penalty). After completing the path matrix, use a traceback to give the optimal alignment.
Solution:
```
s is :TGAGC 
t is :AGCGT

Path matrix
0	-1	-2	-3	-4	-5	
-1	-2	-3	-4	-5	-3	
-2	-3	-1	-2	-3	-4	
-3	-1	-2	-3	-4	-5	
-4	-2	0	-1	-2	-3	
-5	-3	-1	1	0	-1	

Path arrow matrix
```
The optimal alignment is
```
TGAGC--
--AGCGT
```
Sequence assembly using de Bruijn graphs
Previous homework problem: Consider the following set of eight 3-mers that have been obtained from a "genome" sequence: {AGT, AAA, ACT, AAC, CTT, GTA, TTT, TAA}.
1. Draw a de Bruijn graph for this same set of eight fragments. Recall that the nodes will be 2-mers, and an edge is drawn between pairs of nodes if an only if the 1-letter suffix of a 2-mer matches the one-letter prefix of the other 2-mer. Remember, a pair of nodes (and the edge between them) represents a 3-mer from the data. No 2-mer can be present more than once.
2. Describe a path through this graph that visits each edge exactly once. Here, a path "description" is of the form a->b->c->d, where a,b,c,d are nodes in the graph -- i.e. 2-mers.
3. Write the superstring corresponding to this path.
  Hint: You are allowed to draw an edge from one 2-mer back to itself, forming a loop that involves no other nodes. [you are also allowed to form more complicated loops, etc.]
Solution:
1. [Sequence assembly by search for a (directed) Eulerian path] Define the new graph G=(V,E), where the vertices consist of all 2-mers from the collection of (experimentally given) 3-mers. Here, we write the list of 2-mers, with their multiplicities (number of occurrences). If there is only one occurrence, then we do not write the multiplicity:
```
1: AA 	3
2: AC
3: AG
4: CT   2
5: GT   2
6: TA   2
7: TT   3
```
  The collection of directed, labeled edges is given by a list of ordered pairs, together with the 3-mer label that justifies the edge:
```
3,5	AGT
3,3	AAA
2,4	ACT
1,2	AAC
4,7	CTT
5,6	GTA
7,7	TTT
6,1	TAA
```
  Start by identifying the sink, which is vertex 7, since it is the only node which does not occur as the left member of an ordered pair where the right member is distinct from the left pair (i.e. 7 is the only value that does not occur as the head of an arrow leading to a distinct node). Now backtrack, starting with 7 -- in other words, follow the tail to the head of arrows: 3 -> 5 -> 6 -> 1 -> 2 -> 4 -> 7. Read from left to right the edge labels, each time appending the growing sequence by appending the newest symbol. This produces AGTAACTTT. Since we also have the ordered pair (3,3), arguably the sequence should be instead AAGTAACTTT.
Origin of replication determination
Suppose that the genome of a bacterium consists of one circular chromosome, and it has a unique origin of replication (oriC). Suppose that the genome is available as a FASTA file in NCBI GenBank database, and that the oriC is indicated at the right end of the plus-strand. Draw both the skew plot and the cumulative skew plot (hint - should be similar to B. subtilis).
Solution:
See the GC-skew plot of B. subtilis in class slides.
Now consider the combined effect of transcription and replication on the cumulative GC-skew.
1. Assume that a viral genome consists of one linear chromosome, having an origin of replication (oriC) at EACH END. Assuming that mRNA is transcribed from the plus and minus strands at approximately the same frequency, draw the cumulative skew plot for this genome.
2. Now assume that a viral genome consists of one circular chromosome, having a single origin of replication (oriC), but that ALL genes are physically transcribed from the minus strand so that the mRNA transcripts are identical with the genomic DNA of the plus strand, as indicated in GenBank files. Taking into account the effect of transcription (which causes the duplex DNA to be separated for the action of RNA polymerase), draw the cumulative skew plot in this case.
3. Make the same assumptions as in (b), except that ALL genes are physically transcribed from the plus strand, which is indicated in GenBank files by the word "complement" appearing after CDS, as in
```
  CDS             complement(3300..4037)
```
  Taking into account the effect of transcription, draw the cumulative skew plot in this case.
Solution:
1. This situation is exactly the case of human adenovirus for which the book give a parabolic cumulative skew plot shown in Figure 6.5.
2. When transcription occurs from the minus strand, the plus strand remains single-stranded more than the minus strand and so will have a larger GC-skew. Since replication of the minus strand will leave the plus strand single-stranded more than the minus strand, there will be an additional cause for the plus strand to have a larger GC-skew. This happens up to the terminus, located at the midpoint of the circular genome, after which the minus strand will have a larger GC-skew due to replication. Thus after the terminus, there will be a cancellation and GC-skew will remain constant. This is the case of human papilloma virus HPV shown in Figure 6.6 of the book.
3. When transcription occurs from the plus trand, the minus strand remains single-stranded more than the plus strand and so will have a larger GC-skew. Since replication of the minus strand will leave the plus strand single-stranded more than the minus strand, there will be an effective cancellation so the first half of the cumulative GC-skew diagram will be flat. So the curve will look like a flat line followed by a line with slope -1.
Dijkstra's single source shortest path algorithm

Please give the list of (weighted) edges representation of the directed graph below, and then step through Dijkstra's algorithm, when taking node 1 as source. Please output the optimal path from 1 to 6. Also, you should be able to apply Dijkstra's algorithm to determine the optimal gene, given various potential donor and acceptor sites and scores of likelihood that a certain region is an intron or an exon.

Solution: WARNING: In my solution below, for convenience, I assume the nodes a,b,c,d,e,f are the integers 1,2,3,4,5,6 in that order. Thus the first line (1 2 2) of the following list corresponds to the directed edge from a to b with weight 2.

List of weighted edges representation:
1 2 2
1 3 4
2 3 1
2 5 2
3 5 3
2 4 4
5 4 3
4 6 2
5 6 2
-----------------------------------
Initialization step (time 0):

vertex	D	Pred
1:      0	N/A
2:	inf	N/A
3:	inf	N/A
4:	inf	N/A
5:	inf	N/A
6:	inf	N/A

Q = [1, 2, 3, 4, 5, 6] 

-----------------------------------
Time step 1: select node from Q with smallest distance to source, where
the source is 1 (i.e. find index k such that D[x] is smallest value possible).

Set vertex x=1, since D[1]=0, remove 1 from Q so Q = [2, 3, 4, 5, 6] 
For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

vertex	D	Pred
1:      0	N/A
2:	2	1
3:	4	1
4:	inf	N/A
5:	inf	N/A
6:	inf	N/A

Q = [2, 3, 4, 5, 6] 

-----------------------------------
Time step 2: select node x from Q with smallest distance to source.

Set vertex x=2, since D[2]=2, remove 2 from Q so Q = [3, 4, 5, 6] 
For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

vertex	D	Pred
1:      0	N/A
2:	2	1
3:	3 	2	
4:	6	2
5:	4	2
6:	inf	N/A

Q = [3, 4, 5, 6] 

-----------------------------------
Time step 3: select node x from Q with smallest distance to source.

Set vertex x=3, since D[3]=3, remove 3 from Q so Q = [4, 5, 6] 
For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

In this case, D[3]+weighte(3->5) is greater than D[3], so no updates are
performed to D and Pred.

vertex	D	Pred
1:      0	N/A
2:	2	1
3:	3 	2	
4:	6	2
5:	4	2
6:	inf	N/A

Q = [4, 5, 6] 
-----------------------------------
Time step 4: select node x from Q with smallest distance to source.

Set vertex x=5, since D[5]=4, remove 5 from Q so Q = [4, 6] 
For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

vertex	D	Pred
1:      0	N/A
2:	2	1
3:	3 	2	
4:	6	2
5:	4	2
6:	6	5

Q = [4, 6] 
-----------------------------------
Time step 5: select node x from Q with smallest distance to source.

Set vertex x=4, since D[4]=6, and although D[6]=6, 4 comes before 6, so
we choose 4. Remove 4 from Q so Q = [6] 
For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

In this case, D[4]+weighte(4->6) is greater than D[6], so no updates are
performed to D and Pred.


vertex	D	Pred
1:      0	N/A
2:	2	1
3:	3 	2	
4:	6	2
5:	4	2
6:	6	5

Q = [6] 
-----------------------------------
Time step 6: select node x from Q with smallest distance to source.

Set vertex x=6, since D[6]=6, and there are no other nodes in Q.
Remove 6 from Q so Q = [ ] i.e. empty

For all nodes y such that x->y is edge, 
if D[x] + weight(x->y) is less than D[y], then set D[y] = D[x] + weight(x->y)
Update matrices D (distance to source) and Pred (predecessor) accordingly.

vertex	D	Pred
1:      0	N/A
2:	2	1
3:	3 	2	
4:	6	2
5:	4	2
6:	6	5

Using tracebacks with Pred array, we determine optimal path from source
vertex 1 to target vertex 6: 1->2->5->6

Phylogenetic tree construction

Using the following distance matrix D, follow the steps of the WPGMA algorithm and construct the corresponding (ultrametric) tree. (Remember that ultrametric means that the distance from any node to all of its leaves is the same.)

0	5	4	7	6	8	
5	0	7	10	9	11	
4	7	0	7	6	8	
7	10	7	0	5	9	
6	9	6	5	0	8	
8	11	8	9	8	0

Solution:

Original distance matrix where columns and rows correspond respectively
to 0,1,2,3,4,5 (i.e. the leaves are labeled 0,1,2,3,4,5 instead of A,B,C,D,E,F).
	A	B	C	D	E	F
A	0	5	4	7	6	8	
B	5	0	7	10	9	11	
C	4	7	0	7	6	8	
D	7	10	7	0	5	9	
E	6	9	6	5	0	8	
F	8	11	8	9	8	0	

---------------------------------------------
Step 1:  Combine A,C to form {A,C}.
	A,C	B	D	E	F	
A,C	0	6	7	6	8	
B	6	0	10	9	11	
D	7	10	0	5	9	
E	6	9	5	0	8	
F	8	11	9	8	0	

---------------------------------------------
Step 2:  Combine D,E into {D,E}.
	A,C	B	D,E	F
A,C	0	6	6.5	8	
B	6	0	9.5	11	
D,E	6.5	9.5	0	8.5	
F	8	11	8.5	0	

---------------------------------------------
Step 3:  Combine {A,C} and B into {A,B,C}
	A,B,C	D,E	F	
A,B,C	0	8	9.5	
D,E	8	0	8.5	
F	9.5	8.5	0	

---------------------------------------------
Step 4:  Combine {A,B,C} and {D,E} into {A,B,C,D,E} which we abbreviate by A-E
	A-E	F
A-E	0	9	
F	9	0	

---------------------------------------------
Step 5:  Combine {A-F} and G into {A,B,C,D,E,F} to produce the following
tree:

	--------5
	|
--------4.5
	|
			--------4
			|
		--------2.5
			|
			--------3
		|
	--------4
		|
			--------1
			|
		--------3
			|
				--------2
				|
			--------2
				|
				--------0

Following the class notes, you should be able to give the edge weights for all edges of the tree (not shown here).

The following four trees are drawn with branch lengths to scale. Which statement is correct?
1. All four trees are non-equivalent.
2. All four trees are equivalent.
3. Only trees (i) and (iii) are equivalent.
4. Only trees (iii) and (iv) are equivalent.
Solution:
D is correct: trees (iii) and (iv) are equivalent because C branches first, D second and A and B last.

Genomic rearrangement events (reversals)

Consider the following unsigned sequence of genes:

5 1 2 4 3

What are the adjacencies and the break points? Please indicate by break points by placing a vertical bar between the positions where a break point occurs. How many break points are there?
Please give the pseudocode for the greedy (unsigned) reversal sort algorithm.
Using the following pseudocode for the greedy (unsigned) reversal algorithm, illustrate each step of the greedy algorithm on the previous data, starting from 5 1 2 4 3 and terminating with the sorted order 1 2 3 4 5.
Let F be the permutation which represents the gene order 5 1 2 4 3; in other words, F is a function, whose domain and range is the set {1,2,3,4,5}, and whose values are defined by F(1)=5, F(2)=1, F(3)=2, F(4)=4, F(5)=3.
```
def greedyInversionDistance(F):
  distance = 0
  for i = 1 to n
    j = F_inverse(i)
    //j = position of element i in permutation F 
    //i.e. F(j)=i or equivalently F_inverse(i)=j
    if i != j
      F = F r(i,j) //F composed with the reversal of segment between i and j
      print F
      distance = distance + 1
    if F = identity permutation:
      return distance
```

Solution

There are 2 break points in the original gene set; however if one adds the left telomer 0 and the right telomere 6, then there are 4 break points:
```
0 | 5 | 1 2 | 4 3 | 6
```

 
Simulation of greedy algorithn on initial data, where the data 
or permutation F is given by
    5 1 2 4 3
and is represented by the following table. Here we think of a
permutation F as a 1-to-1 function from domain {1,2,...,5} into
range {1,2,...,5}, so F(1)=5, F(2)=1, etc.

Step 0: Original data F
x|y
---
1 5
2 1
3 2
4 4
5 3


Step 1: Inversion r(1,2) applied to reverse segment (1 5) to (5 1)
x|y
---
1 1
2 5
3 2
4 4
5 3

Step 2: Inversion r(2,3) applied to reverse segment (5 2) to (2 5)
x|y
---
1 1
2 2
3 5
4 4
5 3


Step 3: Inversion r(3,5) applied to reverse segment (5 4 3) to (3 4 5)
x|y
---
1 1
2 2
3 3
4 4
5 5

Since we did the previous example in class, you should try to solve the same problem for the following gene order in a chromosome: Consider the following unsigned sequence of genes:
```
6 4 5 3 1 2
```

BIOL4200 Review for Final Exam: Introduction to Bioinformatics

Some example questions

General

Linkage disequilibrium

Probability, Combinatorics, Entropy

Sequence Alignment

Sequence assembly using de Bruijn graphs

Origin of replication determination

Dijkstra's single source shortest path algorithm

Phylogenetic tree construction

Genomic rearrangement events (reversals)

BIOL4200
Review for Final Exam: Introduction to Bioinformatics