BIOL4200, Solution to Homework 7

Written or Computer Assignment

  1. The CRISPR/Cas system allows bacteria to acquire and incorporate a small portion of viral DNA into the bacterial DNA, in a region of the bacterial array known as the CRISPR array. Subsequently, transcribed bacterial CRISPR RNA can recognize viral DNA by hybridizing to the DNA, leading to the destruction of the invading phage.

    Streptococcus mutans is the causitive agent for tooth decay and gingivitis in humans, and M102 is a bacteriophage which targets S. mutans. S. mutans uses CRISPR/Cas to "acquire immunity" from M102, which leads to a selective advantage for M102 mutants having mutations in the genomic region that S. mutans cleaves and integrates into its CRIPR array. All of this is happening right now in all of our mouths! See paper by Jan R. van der Ploeg of the University of Zurich. Also, see the papers of the recipients, Emmanuelle Charpentier and Jennifer Doudna of the 2020 Nobel Prize in Chemistry for the application of CRISPR to gene editing:

    1. What words does the acronym CRISPR abbreviate?
    2. Why doesn't the bacterium recognize and destroy its own genome? Please be explicit -- you may need to search for information on Internet, or look at an article on PubMed.
    3. Download the genome of Streptococcus mutans in GenBank format, gi|347750429|ref|NC_004350.2| Streptococcus mutans UA159 chromosome, complete genome. What is the size of each genome? Is the CRISPR array annotated in the GenBank file?
    4. The Java program CRT (CRISPR recognition tool) is described in the article here, which you can optionally read if you are interested. Please download CRT by going to the following site, http://www.room220.com/crt/, and clicking on the link called CRT1.1_exe.zip. Create a new window (or directory), place the CRT1.1_exe.zip into the directory and unzip it. You should obtain the file CRT1.1_exe.jar, which you can then run, provided that the Java development kit (analogous to the Python runtime system) is installed on your computer. Double-click this file or you can type in a terminal (command-line window) the following: java -jar CRT1.1_exe.jar. You should see an image such as the following:

      In the upper left corner, use the browse tool to input the FASTA format file containing the S. mutans genome. Now run CRT and print the output. Since the location of the repeats and the spacers are indicated (spacers are integrated viral genomic portions), you can then look at the GenBank format file of the S. mutans genome, to see if there is any appropriate annotation.

      If Java is not installed on your computer, you can find an appropriate version to download on your computer, by typing "download java development kit" or "download jdk" into the Google search engine.

      Run CRT on the FASTA sequence of S. mutans to determine (likely) CRISPR repeats and spacers. Print out the results. Consult the GenBank file of S. mutans, and give the annotation for the region overlapping the predicted CRISPR region.

    5. Now download the FASTA file for the bacteriophage M102 NC_012884.1. What is the size of M102? Run blast2seq with M102 genome as query sequence and S. mutans genome as second sequence. What is the size of the (local) alignment? What is its location in the S. mutans genome?

    Solution:

    1. CRISPR = clustered regularly spaced palindromic repeats
    2. Experimental proof for the mechanism of discrimination between self and other (i.e. between bacterial spacers and viral proto-spacers) is given in CRISPR interference: RNA-directed adaptive immunity in bacteria and archaea, by LA Marraffini and EJ Sontheimer.
    3. The GenBank file is here. The genome is comprised of a single circular chromosome of length around 2 megabases (2,032,925 bp exactly). The only occurrence of the word "CRISPR" is in the following annotation:
      complement(1663207..1663759)
                           /note="potential frameshift: common BLAST hit:
                           gi|337281888|ref|YP_004621359.1| CRISPR-associated protein
                           cas1"
      
      which indicates the location of the gene for the cas1 protein; however, no CRISPR array location is given for incorporation of viral-derived spacers separated by repeats.
    4. The output is as follows:

      Since the annotation of the same region is as follows,

       gene            132644..134389
                           /gene="adhD"
                           /locus_tag="SMU_130"
                           /old_locus_tag="SMU.130"
                           /db_xref="GeneID:1029705"
      
      there is a disagreement between GenBank annotation and the output of CRT. This indicates that (1) databases sometimes contain incorrect or incomplete annotations, (2) the output of prediction software can be incorrect, or (3) both. To determine which is most likely correct requires an investment of much more time, knowledge of the adhD gene annotation (i.e. how the annotation was made, via BLAST or other software, the E-score of statistical significance if BLAST was used, whether related bacteria have similar annotations, the function of the adhD gene, etc.), as well as knowledge of how the CRISPR prediction was made and its statistical significance. A final possibility is that both annotation and prediction are correct, and that the CRISPR array is found in the 5' untranslated region (5'-UTR) of the adhD gene.
    5. Streptococcus phage M102 is a double-stranded DNA virus with 31,147 bp. The alignment size is 28729-28494+1 = 236 nt, and it is given below:

      Note that the GenBank annotation is as follows:

        CDS             176037..178448
                           /locus_tag="SMU_180"
                           /old_locus_tag="SMU.180"
                           /note="Best Blastp Hit: pdb|1D4C|A Chain A, Crystal
                           Structure Of The Uncomplexed Form Of The Flavocytochrome C
                           Fumarate Reductase Of Shewanella Putrefaciens Strain Mr-1
                           >gi|6573310|pdb|1D4C|D Chain D, Crystal Structure Of The
                           Uncomplexed Form Of The Flavocytochrome C Fumarate
                           Reductase Of Shewanella Putrefaciens Strain Mr-1
                           >gi|6573308|pdb|1D4C|B Chain B, Crystal Structure Of The
                           Uncomplexed Form Of The Flavocytochrome C Fumarate
                           Reductase Of Shewanella Putrefaciens Strain Mr-1
                           >gi|6573309|pdb|1D4C|C Chain C, Crystal Structure Of The
                           Uncomplexed Form Of The Flavocytochrome C Fumarate
                           Reductase Of Shewanella Putrefaciens Strain Mr-1"
                           /codon_start=1
                           /transl_table=11
                           /product="oxidoreductase"
                           /protein_id="NP_720649.1"
                           /db_xref="GI:24378694"
                           /db_xref="GeneID:1029753"
      
      Does this mean that M102 also has the same gene for oxidoreductase? No, probably not, since the aligned region is too small. Again an investment of time is necessary to gain knowledge to better understand the biological reasons behind this almost identical segment found in both the virus and bacterium.

  2. The expected distribution of scores from the BLAST algorithm is an Extreme Value distribution because of the following:
    1. It is not possible to define a dynamic programming algorithm to align more than two sequences.
    2. It is derived as an approximation to a dynamic programming algorithm.
    3. It returns the highest scoring match from a database.
    4. It uses a probabilistic alignment model.
    Which of the answers are true? Note that none may be true, or more than one answer may be true.

    Solution:

    1. Answer (a) is false, since we've previously seen that the DP algorithm can be generalized.
    2. Answer (b) is false, since the extreme value distribution has nothing to do with the DP algorithm, but rather with the fact that the EXTREME VALUE distribution is the limiting distribution when taking the MAX of n independent random variables, or "experiments", whereas the NORMAL distribution is the limiting distribution when taking the SUM of n independent random variables, or "experiments".
    3. Answer (c) is true, since we are taking the MAX of the hits returned. One could also argue that (c) is false, since the justification for the exponential distribution is that the database contains "random" proteins or nucleotide sequences, rather than biologically relevant proteins or nucleotide sequences -- but in practice, this doesn't really matter, and answering that (c) is true is closer to the sense of the question.
    4. Answer (d) is true, since we assume that the database contains "random" as just explained.

  3. Please answer the following questions by using the Mathematica demo:

    Please answer the following.

    1. Determine the right-tailed p-value that corresponds to a Z-score of 2.
    2. Determine the right-tailed p-value for a score of 6 with respect to the standard normal distribution (standard normal means mean is 0 and standard deviation is 1).
    3. Unlike the normal distribution, which is completely determined by the parameters mean and standard deviation, the extreme value distribution is completely determined by the parameters location parameter (usually called α) and the scale parameter (usually called β). Determine the right-tailed p-value for a score of 6 with respect to the extreme value distribution with locationParameterAlpha = 2.5 and scaleParameterBeta = 1.05.
    4. If you assume that BLAST scores in a database search are normally distributed, then is the E-value larger or smaller than the real E-value?

    Solution:

    1. If the Z-score is +2, then the right-tailed p-value is the integral of the standard normal density function from 2 to Infinity. In Mathematica, type
      pValue = N[1/Sqrt[2 Pi] Integrate[ Exp[-x^2/2], {x, 2, Infinity}]]
      
      to get the answer 0.0227501.
    2. The right-tailed p-value for a Z-score of 6 is
      pValue = N[1/Sqrt[2 Pi] Integrate[ Exp[-x^2/2], {x, 6, Infinity}]]
      
      which equals 9.86588 e-10.
    3. The right-tailed p-value for a value of 6 with respect to the extreme value distribution with locationParameterAlpha = 2.5 and scaleParameterBeta = 1.05 is
      pValue = N[Integrate[PDF[ExtremeValueDistribution[2.5, 1.05], x], {x, 6, Infinity}]]
      
      which equals 0.0350452.
    4. If you assume that BLAST scores in a database search are normally distributed, then the E-value is SMALLER than the real E-value, because the real E-value is (approximately) equal to the p-value times database size, and the right-tailed p-value for the extreme value distribution having same mean and standard deviation as a normal distribution will be LARGER than that for the normal distribution. (The E-value is not literally the p-value times database size, but you can think of it intuitively as such.)


  4. By hand, please compute the (1) PATH MATRIX, (2) determine the optimal local alignment score, and (3) determine ALL optimal local sequence alignments of the sequence CGAA with GGACC. As in class on Thursday, please use the following parameters: gap (-1), mismatch (-2), match (+1).

    Solution:

    1. The path matrix is given by
      s is :CGAA
      t is :GGACC
      		G	G	A	C	C	
      	0	0	0	0	0	0	
      C	0	0	0	0	1	1	
      G	0	1	1	0	0	0	
      A	0	0	0	2	1	0	
      A	0	0	0	1	0	0
      
      where again, the arrows aren't listed since it's difficult to indicate by typing; however, when we go over this in class, I will indicate the arrows, and in any case your solution SHOULD have the arrows.
    2. Optimal local alignment score is +2
    3. There is only one optimal alignment given by
      - G A -
      - G A - -
      
      where the dashes are NOT part of the local alignment, but only indicated to allow you to see that the first sequence was CGAA of length 4 and the second sequence was GGACC of length 5.


    Please staple all homework sheets together, and write your name, homework assignment, email address, and date.