BIOL4200, Solution to Homework 8

Reading Assignment

Chapter 6 by Grigoriev.
Chapter 7 by Hannenhalli.

Written or Computer Assignment

This problem concerns the origin of replication of bacteria.
1. Download the complete genome of Bacillus subtilis in FASTA form. Since there are many hits returned on NCBI when searching in the nucleotide database, you may have to think of obtaining more information about the organism, who first sequenced it, etc. in order to find the correct genome. What is the genome size?
2. Use my Python program to determine the RAW SKEW. You may have to experiment with window size. My program computes a histogram, which you'll have to plot with Excel or some graphics software of your choice. My program is here.
3. Use my Python program to determine the CUMULATIVE SKEW. Again, my program computes a histogram, which you'll have to plot with Excel or some graphics software of your choice. Be sure to change a constant in my code from False to True in order to compute the cumulative skew. My program is here.
4. At what genomic position (i.e. what is the starting position of the window) where the MAXIMUM cumulative skew occurs?
Solution:
1. The FASTA file with GenBank accession code AL009126.3 is here. To find this, I searched on Google for "Bacillis subtilis complete genome" and found the Nature article with first author Kunst, where the data and an initial analysis of the genome is described. Then I added "Kunst" in my search on NCBI to find the genome.
2. Raw skew is here:
3. Cumulative skew is here: here.
4. I took windows of size 1000, and the window starting at position 1930001 has the maximum cumulative skew of 16.047691.
This problem concerns the origin of replication of bacteria, archaea and viruses.
1. Download the FASTA file of the complete genome of Streptococcus mutans, the bacterial pathogen causing human tooth decay. Determine the origin of replication (oriC) using Ori-Finder. Please print out a screen shot of the skew plot (Figure 1) from the web server output.
2. Download the FASTA file of the complete genome of the Streptococcus phage (virus) M102, also found in the human mouth. (As explained in class, there is an "arms race" between S. mutans and M102, where S. mutans incorporates M102 DNA into its CRISPR array.) Does Ori-Finder. predict an origin of replication for this virus? If so, what is the predicted ori?
  In the past, the SIBZ server was functional, and this portion answers the (now defunct) question of using the SIBZ web server of Craig Benham. Since this server has an upper bound of 10,000 nt, you should input about 1/3 of the genome (I took the first 150 lines of the FASTA file). For this, you should use the BDZtrans (default) algorithm, which analyzes the competition between B,Z transitions of DNA and denaturation (bubble formation).
3. Download the FASTA file of the complete genome of Methanococcus janaschii, the archaeon found near white smokers at deep sea thermal vents at a depth of 2.5 km. This terminology is old, and Methanococcus jannaschii has been renamed Methanocaldococcus jannaschii. Sequenced by Craig Venter's TIGR, this was the first archaeon genome that was sequences -- see Bult et al., Science 1996. Woese's "theory" of three kingdoms of life -- prokaryotes, archaea, eukaryotes -- was first established from the genome analysis of M. jannaschii.
  Determine the origin of replication (oriC) using Ori-Finder 2. Before running the server, be sure to select the radio button "Methanococcaceae". Does this web server predict an origin of replication for this archaeon? If so, what is the predicted ori? Please print out a screen shot of the skew plot (Figure 1) from the web server output.
Solution:
1. The ori is indicated in printout in the left panel.
  
  The gene list (not required) is given below -- this is obtained by a simple ORF-finder -- recall that ORF is an acronym for "open reading frame".
```
       
```
2. For M102 genome, Ori-Finder cannot determine the origin of replication, in part because it is using more than the simple G+C skew plot discussed in class. Here's a screen dump of the ouput.
3. Output from Ori-Finder 2 on unannotated genome of M. jannaschii.
  MjannaschiiSkewGraphUnannotatedGenome.png
  The following site, Convert.htm, contains links to many useful tools (web servers) to translate GenBank files (.gb) to NCBI Protein Table (*.ptt) format, to FASTA format, etc. The tool to convert GenBank to PTT format is gbk2ptt. PTT files contain a list of the annotated proteins and can be additionally input to Ori-Finder and Ori-Finder 2, leading to better prediction of origins of replication.
Read the description about WebLogo, which produces LOGO plots given a multiple sequence alignment.
1. In this part, you design your own problem. From the NCBI server, download EITHER some amino acid motifs OR some nucleotide motifs. Amino acid motifs could be peptide sequences consistent with concensus sequences of the binding region of a nucleotide binding protein. Nucleic acid motifs could be TATA-boxes for a particular organism or organisms. Many examples are given on the Web Logo site. Create and print out a web logo of your sequences.
  Warning: To have a nice web logo, the sequence length must be identical for all examples, and length should not be too small. You could create a LOGO of different myoglobins, but this will be difficult to view. Experiment some.
  If you use different length sequences, then use Clustal Omega on the EBI server to produce a single multiple sequence alignment of your FASTA sequences.
  Note that you can produce an encapsulated postscript (eps) file, a pdf file, png or gif file -- using scripts from public domain, you can transform an eps file to pdf and vice-versa, or into a gif file.
2. Look at the Web LOGO example of E. coli promoters (transcription start signals) -- i.e. the so-called TATA-box for E. coli. Click "Edit Logo", and copy and past all sequences into a text file which you save. The goal of this part is to manually determine the profile, or PSSM, for 10 TATA-boxes (with a pseudocount), and then to use EXCEL to compute the values - log₂( f_x,i) used by the weight matrix method. Here's the outline of what to do:
  1. Manually extract the UPPERCAST Tata-boxes from 10 E. coli promoters. The entire sequence is known as the "promoter", but only the uppercase hexamer is the TATA-box.
  2. Using EXCEL, create a region with 4 rows, labeled respectively by A,C,G,T, and 6 columns, labeled respectively by 1,2,3,4,5,6. In the row with label A, for each i=1,...,6 enter the fraction (x+1)/(n+4) where x is the number of extracted TATA-boxes that have an A in the i-th position, and n is the number of extracted TATA-boxes (i.e. 10). The raw frequency would be x/n, but as explained in class, you should include pseudocounts because the "training" data you have prepared may not fully represent the collection of TATA-boxes found in nature. Do the same for each other row C,G,T. This creates the position specific frequencies of each nucleotide in your collection of 10 TATA-boxes.
  3. Create a second region, with similar layout, except that in row A and column i, you compute: - log( (x+1)/(n+4) )/log(2) . This takes only a couple of seconds if you use "relative references" as explained in class. Note that by dividing by log(2), you are using the "change of base" formula so that the base of the logarithm is base 2.
  4. Compute the weight matrix score of the putative TATA-box GATAAG.
  5. Compute the entropy (base 2) of the probability distribution at position 1 (i.e. the frequency of A,C,G,T at position 1). Compute the information, defined as maximum possible entropy minus the real entropy. Compute the height of the letter 'A', as the proportion of As in position 1 times the information.
Solution:
- TATAproblem.pdf: PDF file of spreadsheet
- TATAproblem.xlsx: Excel spreadsheet
- computePositionSpecificCounts.py: Program to compute the counts
- tataboxFirst10.txt: Manually extracted first 10 TATA boxes.
Doc and Suzy, who live in Steinbeck's delightful and uplifting book, Sweet Thursday, depart from the Palace Flophouse in a car loaded with equipment to capture octopus specimens. As a marine biologist, Doc is studying to what extent cephalopod anger is similar to human anger. Like insects, octopi use hemacyanin, rather than hemoglobin, as oxygen transport molecule. Doc has just extracted and sequenced the following DNA sequence from Molly, one of the captured octopi.
1. Using the gene prediction software EasyGene (for prokaryotes) and HMMgene (for eukaryotes), determine the number of exons each program predicts (none is a possible answer). What is [resp. are] the predicted protein product [resp. products]? What is the length (in amino acids) of the translation?
2. Now answer the same question using the web server GenScan.
3. Search for the GenBank annotated file for this sequence, and compare the number of exons and their location with the predictions of both software. Discuss differences.
Solution:
1. The prokaryotic gene finder EasyGene produces the output here with 9 exons. The eukaryotic gene finder HMMgene produces the output here with 5 exons on the plus strand and 2 exons on the minus strand. Obviously you should never use a prokaryotic gene finder for a eukaryote, such as octapus -- in particular, bacteria have more than one start codon (here I used S. mutans as organism when running EasyGene. In order to determine the protein product, one needs essentially to write a small program that extracts DNA from the genome and applies the genetic code to translate the codons.
2. GenScan produced the output
  
  where we see that there are 10 (predicted) internal exons, and one polyA tail at the locations given. Moreover, we have the initial portions of the peptides coded by the internal exons. If we take the first, XNLIRKDVDALSEDEVLNLQVALRAMQDDETPTGYQAIAAYHGEPADCKAPDGSTVVCCL, and use BLASTp on the NCBI server, we obtain the following output:
  
  Please look at this genscanOutputExplanation.html, taken from the web site here at Univ Lyon I.
3. The GenBank annotation in AF338426 indicates that there are 13 exons -- compare the annotation with the predictions of EasyGene, HMMgene and GenScan with the annotated GenBank file:
  .
  GenScan, developed by Chris Burge (MIT), was one of the first gene finders to use a "generalized hidden Markov model" (generalized HMM), which requires sampling exon length from a training set of known exons. This strategy was soon adopted by all eukaryotic gene finders.
  By the way, Wikipedia lists various gene finders here.
Mycobacterium tuberculosis is a pathogen causing tuberculosis. Due to the waxy coating of the bacterial surface, it is neither Gramm-positive nor Gramm-negative. This pathogen exports various proteins that act as antigens; in particular The 10-kDa (kilo Dalton) culture filtrate protein (CFP-10) is an antigen that contributes to contributes to the virulence of Mycobacterium tuberculosis. CFP-10 forms a heterodimeric complex with 6-kDa early secreted antigen target (ESAT-6), which this pathogen uses to deliver virulence factors into host macrophages and monocytes (types of white blood cells) during tuberculosis infection.
1. Search for "CFP" within the full GenBank file for the Mycobacterium bovis subsp. bovis AF2122/97 complete genome (GenBank: BX248335.1) with link here . Extract the both the nucleotide and corresponding protein for this CDS, an print it.
2. Enter the amino acid sequence you extracted in the DAS Transmembrane Predictiona Server to determine the likelihood that it is a transmembrane (non-exported) protein. What is your conclusion?
3. Please enter the FASTA format amino acid sequence for this CFP protein in the SignalP webserver. Using the SignalP web server, determine the cleavage site, the signal peptide and the mature protein. Please print out the output by SignalP. Note: Please select "LONG" for the output format, in order to obtain numerical information about the SignalP score for each amino acid position.
4. Using the same genome, determine the predicted cleavage site of the protein corresponding to gene Rv0011c. This protein has been experimentally shown not to be exported.
  1. What is the size of the signal peptide for Rv0011c?
  2. Use the PSIPRED web server to predict the secondary structure for the signal peptide of Rv2668. (b) the entire protein for Rv0011c. Please be sure to download and print both the PSIPRED raw scores in plain text format, and the PDF version of the PSIPRED diagram.
5. Do the same for Rv2668 (Genbank:BX248333).
Solution:
1. The sequence is
```
         a tgtcgcaaat catgtacaac taccccgcga tgttgggtca cgccggggat
atggccggat atgccggcac gctgcagagc ttgggtgccg agatcgccgt ggagcaggcc
gcgttgcaga gtgcgtggca gggcgatacc gggatcacgt atcaggcgtg gcaggcacag
tggaaccagg ccatggaaga tttggtgcgg gcctatcatg cgatgtccag cacccatgaa
gccaacacca tggcgatgat ggcccgcgac acggccgaag ccgccaaatg gggcggctag
```
  which codes
```
MSQIMYNYPAMLGHAGDMAGYAGTLQSLGAEIAVEQAALQSAWQ
GDTGITYQAWQAQWNQAMEDLVRAYHAMSSTHEANTMAMMARDTAEAAKWGG
```
2. DAS predicts that the CFP protein is NOT transmembrane. The image produced by DAS is as follows:
3. The images produced by SignalP web server for CFP are:
4. The images produced by SignalP web server for Rv2668 are:
5. PsiPred prediction of the SIGNAL peptide (residues 1-20) of Rv2668:
```
# PSIPRED VFORMAT (PSIPRED V3.3)

   1 M C   0.999  0.001  0.000
   2 S C   0.448  0.229  0.075
   3 Q H   0.303  0.409  0.269
   4 I H   0.193  0.520  0.241
   5 M H   0.242  0.542  0.156
   6 Y H   0.303  0.520  0.153
   7 N H   0.268  0.556  0.115
   8 Y H   0.166  0.756  0.019
   9 P H   0.116  0.786  0.008
  10 A H   0.070  0.851  0.005
  11 M H   0.061  0.879  0.003
  12 L H   0.121  0.769  0.008
  13 G H   0.144  0.723  0.014
  14 H H   0.266  0.534  0.012
  15 A H   0.298  0.528  0.005
  16 G C   0.428  0.358  0.006
  17 D C   0.518  0.329  0.010
  18 M C   0.636  0.267  0.013
  19 A C   0.741  0.151  0.017
  20 G C   0.959  0.000  0.000  
```
6. PsiPred image for Rv2668 entire protein:
  
  PsiPred raw score output for Rv2668 entire protein: here.
Use the NCBI tool OrfFinder with default settings (e.g. minimum ORF length of 75 nt) to find all ORFs within the first 50 kb of human chromosome 21q with GenBank accession code BA000005.3. Recall that the long arm of a chromosome is designated by "q", while the short arm is designated by "p" (for "petit" which means small in French). Note that the web server is limited to 50 kb portions of the q-portion of chromosome 21, whose entire length is 33,543,332 bp. There is a link to download the NCBI ORF finder program, but it requires a computer running the 64-bit Linux operating system; in contrast, you can use my Python program orfFinder1.py , which must be run with an auxilliary program aminoAcidAndGeneticCodes.py, and which has no such length restriction. These two programs are also available in the zip file orfFinder1.zip.
You can copy/paste the numerical output from the NCBI ORF Finder into a TEXT FILE, where the Orf Finder data (in the bottom right corner of your screen) is formatted in columns as follows:
1. How many ORFs are found, whose starting and ending positions appear in the first 50,000 nt? Warning: Two ORFs have an ending position indicated by ">", meaning that the ORF continues after position 50,000, and one ORF has a putative starting position to the left of position 1, so these three spurious data items must be removed. All questions using the ORF Finder relate to the data after removing these spurious items. How many of these ORFs are found on the + strand? How many of these ORFs are found on the - strand? What is the average nucleotide length of the ORFS (in entire set of both + and - strands)?
  Solution: 295 ORFs are found, of which 294 have both start and stop positions within nucleotides 1-50,000. Of the 294, 106 ORFs appear on the - strand, while the remaining 187 appear on the + strand. Average length is 141.49.
2. How many ORFs are found, whose starting and ending positions appear in the first 50,000 nt, which appear in reading frame 1? Same question for reading frames 2 and 3. (This will require for you to import the data into Excel, and to sort the imported data. If you know the syntax of the countif() function, then it will simplify your work.)
  Solution: 102 ORFs appear in frame 1, 80 ORFs appear in frame 2, 111 ORFs appear in frame 3.
  Using Excel, create a histogram and a graph of the histogram for ORF lengths that appear on the - strand. Please use bins (or class delimiters) of 20,40,60,...,300 which accounts for almost all the ORF lengths. This requires you to use Excel->Tools->Data Analysis->Histogram. If the Data Analysis option does not appear in the Tools menu of Excel, then you will have to select "Excel Add Ins" in the Tools menu, and add in the Data Analysis tool pack. This takes only a couple of minutes.
  Recall that as a member of the BC community, you can freely download Excel here. Note that Microsoft Office 2019 (current) and Office 2016 both contain the Data Analysis toolpack, while Microsoft Office 2011 does not.
  Solution:
3. Now, having seen the limitations of the NCBI web server, use my Python orfFinder1.py which can be found in the zip file orfFinder1.zip, along with the necessary supplementary code aminoAcidAndGeneticCodes.py. Please capture (by output redirection) the ORF length output for the - strand of the entire data in FASTA formatted file BA000005.3. Recall that the command you need to type in the terminal window is
  python orfFinder1.py BA000005.3.faa -1 > outputORFfinderMinusStrand.txt
  assuming that you have downloaded from the NCBI server the FASTA file for human chromosome 21q, which you have named "BA000005.3.faa". Here the "-1" is a flag to have the program compute ORFs on the - strand. Then you can import the output text file into Excel. Please determine the average length of nucleotides in the ORFs of this file. WARNING: You will need to multiply ORF length by 3, since my program computes ORF length as the number of amino acids in the putative coding region. Compute the histogram and graph the histogram of the nucleotide lengths in the ORFs. Solution: Average length is 148.80, and the histogram is here:
  
  One of the goals of this exercise is to (rather painfully) understand how essential it is to be able to write computer programs on a computer running linux, if you want to really do much beyond routine use of a web server.
A famous theorem of G. Polya states that a random walk on a lattice of one dimension (i.e. a drunkard can only go one step left or right) or of 2 dimensions (i.e. a drunkard can only go up or down or left or right one step) will with probability ALWAYS return to the origin infinitely often. Test whether the DNA in the genome of P. abyssi is "random" by writing a Python program to do the following. Start from the origin (0,0).
- Every time you read a "A", add (0,1).
- Every time you read a "C", add (1,0).
- Every time you read a "G", add (-1,0).
- Every time you read a "T", add (0,-1).
Output your sequence of x,y coordinates of the points of the two-dimensional lattice traversed in reading the genome of P. abyssi.
Now using Excel, create a scatter plot of the data. (Since the file consists of over 1.7 million points, it is likely that Excel cannot handle such a large amount of data. If this is the case, truncate your file to an amount that Excel can handle.
If you have access to an alternative graphing package with more flexibility, try to print out an image of the "walk" -- i.e. the series of line segments, going from the origin, to the first position, then from the first position to the second position, etc. I was able to do this for the output of my program on the genome of M. jannaschii, which is here.
Solution: