BIOL4200, Homework 8

Reading Assignment

Chapter 5 by Grigoriev.
Chapter 7 by Hannenhalli.

Written or Computer Assignment

This problem concerns the origin of replication of bacteria.
1. Download the complete genome of Bacillus subtilis in FASTA form. Since there are many hits returned on NCBI when searching in the nucleotide database, you may have to think of obtaining more information about the organism, who first sequenced it, etc. in order to find the correct genome. What is the genome size?
2. Use my Python program to determine the RAW SKEW. You may have to experiment with window size. My program computes a histogram, which you'll have to plot with Excel or some graphics software of your choice. My program is here.
3. Use my Python program to determine the CUMULATIVE SKEW. Again, my program computes a histogram, which you'll have to plot with Excel or some graphics software of your choice. Be sure to change a constant in my code from False to True in order to compute the cumulative skew. My program is here.
4. At what genomic position (i.e. what is the starting position of the window) where the MAXIMUM cumulative skew occurs?
This problem concerns the origin of replication of bacteria, archaea and viruses. Since the web server is not currently functional, despite the recent bioRxiv preprint here, please download the fasta files and try to do what you can with my gcSkew program from the previous problem. My guess is that the authors have submitted their paper to a journal and are awaiting acceptance before making the web server available.
1. Download the FASTA file of the complete genome of Streptococcus mutans, the bacterial pathogen causing human tooth decay. Determine the origin of replication (oriC) using Ori-Finder2022. Please print out a screen shot of the skew plot (Figure 1) from the web server output.
2. Download the FASTA file of the complete genome of the Streptococcus phage (virus) M102, also found in the human mouth. (As explained in class, there is an "arms race" between S. mutans and M102, where S. mutans incorporates M102 DNA into its CRISPR array.) Does Ori-Finder2022. predict an origin of replication for this virus? If so, what is the predicted ori?
3. Download the FASTA file of the complete genome of Methanococcus janaschii, the archaeon found near white smokers at deep sea thermal vents at a depth of 2.5 km. This terminology is old, and Methanococcus jannaschii has been renamed Methanocaldococcus jannaschii. Sequenced by Craig Venter's TIGR, this was the first archaeon genome that was sequences -- see Bult et al., Science 1996. Woese's "theory" of three kingdoms of life -- prokaryotes, archaea, eukaryotes -- was first established from the genome analysis of M. jannaschii.
  Determine the origin of replication (oriC) using Ori-Finder 2. Warning! The authors of Ori-Finder originally created a distinct web server for archaebacteria, such as M. jannaschii. That web server has been removed, so this question asks you to run the *bacterial* Ori-Finder on an archaeon, immediately making the prediction possibly suspect. Does Ori-Finder return any answer? Please run my gc-skew plot program. Can you make a prediction?
Read the description about WebLogo, which produces LOGO plots given a multiple sequence alignment.
1. Please create a web logo for a class of proteins, DNA or RNA of your choice. Some suggestions: alpha chain hemoglobins, beta chain hemoglobins, myoglobins, G-protein coupled receptors (GPCR), transfer RNAs, -1 ribosomal frameshift stimulating signal, purine riboswitches, etc. You can download such sequences from NCBI, EBI, or you can download families of such sequences from Pfam or SwissProt (proteins) and Rfam (RNA). The Berkeley web logo web site presents many examples of such protein and nucleotide sequences -- however, you should NOT simply use the example web logos from the Berkeley web logo server.
  WARNING: Your sequences must have identical length (possibly including dashes), so it is best to use Clustal Omega on the EBI server to produce a single multiple sequence alignment of your FASTA sequences.
  For example, if you choose to create a web logo of the HIV ribosomal frameshift signal sequences discussed in class, then search for "HIV ribosomal frameshift" on the Rfam server, select the top hit, choose to view alignments (button "Alignments" on left panel), and download the "seed alignment" family in "FASTA (gapped)" format. You can then create the web logo.
  Note that you can produce an encapsulated postscript (eps) file, a pdf file, png or gif file -- using scripts from public domain, you can transform an eps file to pdf and vice-versa, or into a gif file.
2. Look at the Web LOGO example of E. coli promoters (transcription start signals) -- i.e. the so-called TATA-box for E. coli. Click "Edit Logo", and copy and past all sequences into a text file which you save. The goal of this part is to manually determine the profile, or PSSM, for 10 TATA-boxes (with a pseudocount), and then to use EXCEL to compute the values - log₂( f_x,i) used by the weight matrix method. Here's the outline of what to do:
  1. Manually extract the UPPERCAST Tata-boxes from 10 E. coli promoters. The entire sequence is known as the "promoter", but only the uppercase hexamer is the TATA-box.
  2. Using EXCEL, create a region with 4 rows, labeled respectively by A,C,G,T, and 6 columns, labeled respectively by 1,2,3,4,5,6. In the row with label A, for each i=1,...,6 enter the fraction (x+1)/(n+4) where x is the number of extracted TATA-boxes that have an A in the i-th position, and n is the number of extracted TATA-boxes (i.e. 10). The raw frequency would be x/n, but as explained in class, you should include pseudocounts because the "training" data you have prepared may not fully represent the collection of TATA-boxes found in nature. Do the same for each other row C,G,T. This creates the position specific frequencies of each nucleotide in your collection of 10 TATA-boxes.
  3. Create a second region, with similar layout, except that in row A and column i, you compute: - log( (x+1)/(n+4) )/log(2) . This takes only a couple of seconds if you use "relative references" as explained in class. Note that by dividing by log(2), you are using the "change of base" formula so that the base of the logarithm is base 2.
  4. Compute the weight matrix score of the putative TATA-box GATAAG.
  5. Compute the entropy (base 2) of the probability distribution at position 1 (i.e. the frequency of A,C,G,T at position 1). Compute the information, defined as maximum possible entropy minus the real entropy. Compute the height of the letter 'A', as the proportion of As in position 1 times the information.
Doc and Suzy, who live in Steinbeck's delightful and uplifting book, Sweet Thursday, depart from the Palace Flophouse in a car loaded with equipment to capture octopus specimens. As a marine biologist, Doc is studying to what extent cephalopod anger is similar to human anger. Like insects, octopi use hemacyanin, rather than hemoglobin, as oxygen transport molecule. Doc has just extracted and sequenced the following DNA sequence from Molly, one of the captured octopi.
1. Using the gene prediction software EasyGene (for prokaryotes) and HMMgene (for eukaryotes), determine the number of exons each program predicts (none is a possible answer). What is [resp. are] the predicted protein product [resp. products]? What is the length (in amino acids) of the translation?
2. Now answer the same question using the web server GenScan.
3. Search for the GenBank annotated file for this sequence, and compare the number of exons and their location with the predictions of both software. Discuss differences.
4. Search for the GenBank annotated file for this sequence, and compare the number of exons and their location with the predictions of both software. Discuss differences.
Mycobacterium tuberculosis is a pathogen causing tuberculosis. Due to the waxy coating of the bacterial surface, it is neither Gramm-positive nor Gramm-negative. This pathogen exports various proteins that act as antigens; in particular The 10-kDa (kilo Dalton) culture filtrate protein (CFP-10) is an antigen that contributes to contributes to the virulence of Mycobacterium tuberculosis. CFP-10 forms a heterodimeric complex with 6-kDa early secreted antigen target (ESAT-6), which this pathogen uses to deliver virulence factors into host macrophages and monocytes (types of white blood cells) during tuberculosis infection.
1. Search for "CFP" within the full GenBank file for the Mycobacterium bovis subsp. bovis AF2122/97 complete genome (GenBank: LT708304.1) with link here . Extract the both the nucleotide and corresponding protein for this CDS, an print it.
2. Enter the amino acid sequence you extracted in the DAS Transmembrane Predictiona Server to determine the likelihood that it is a transmembrane (non-exported) protein. What is your conclusion?
3. Please enter the FASTA format amino acid sequence for this CFP protein in the SignalP webserver. Using the SignalP web server, determine the cleavage site, the signal peptide and the mature protein. Please print out the output by SignalP. Note: Please select "LONG", which is the default, for the output format, in order to obtain numerical information about the SignalP score for each amino acid position.
4. Using the same genome, determine the predicted cleavage site of the protein corresponding to gene Rv0011c. This protein has been experimentally shown not to be exported.
  1. What is the size of the signal peptide for Rv0011c?
  2. Use the PSIPRED web server to predict the secondary structure for the signal peptide of Rv2668. (b) the entire protein for Rv0011c. Please be sure to download and print both the PSIPRED raw scores in plain text format, and the PDF version of the PSIPRED diagram.
5. Do the same for Rv2668 (Genbank:BX248333).
Use the NCBI tool OrfFinder with default settings (e.g. minimum ORF length of 75 nt) to find all ORFs within the first 50 kb of human chromosome 21q with GenBank accession code BA000005.3. Recall that the long arm of a chromosome is designated by "q", while the short arm is designated by "p" (for "petit" which means small in French). Note that the web server is limited to 50 kb portions of the q-portion of chromosome 21, whose entire length is 33,543,332 bp. There is a link to download the NCBI ORF finder program, but it requires a computer running the 64-bit Linux operating system; in contrast, you can use my Python program orfFinder1.py , which must be run with an auxilliary program aminoAcidAndGeneticCodes.py, and which has no such length restriction. These two programs are also available in the zip file orfFinder1.zip.
You can copy/paste the numerical output from the NCBI ORF Finder into a TEXT FILE, where the Orf Finder data (in the bottom right corner of your screen) is formatted in columns as follows:
1. How many ORFs are found, whose starting and ending positions appear in the first 50,000 nt? Warning: Two ORFs have an ending position indicated by ">", meaning that the ORF continues after position 50,000, and one ORF has a putative starting position to the left of position 1, so these three spurious data items must be removed. All questions using the ORF Finder relate to the data after removing these spurious items. How many of these ORFs are found on the + strand? How many of these ORFs are found on the - strand? What is the average nucleotide length of the ORFS (in entire set of both + and - strands)?
2. How many ORFs are found, whose starting and ending positions appear in the first 50,000 nt, which appear in reading frame 1? Same question for reading frames 2 and 3. (This will require for you to import the data into Excel, and to sort the imported data. If you know the syntax of the countif() function, then it will simplify your work.)
  
  Using Excel, create a histogram and a graph of the histogram for ORF lengths that appear on the - strand. Please use bins (or class delimiters) of 20,40,60,...,300 which accounts for almost all the ORF lengths. This requires you to use Excel->Tools->Data Analysis->Histogram. If the Data Analysis option does not appear in the Tools menu of Excel, then you will have to select "Excel Add Ins" in the Tools menu, and add in the Data Analysis tool pack. This takes only a couple of minutes.
  Recall that as a member of the BC community, you can freely download Excel here. Note that Microsoft Office 2019 (current) and Office 2016 both contain the Data Analysis toolpack, while Microsoft Office 2011 does not.
3. Now, having seen the limitations of the NCBI web server, use my Python orfFinder1.py, which can be found in the zip file in the zip file orfFinder1.zip, along with the necessary supplementary code aminoAcidAndGeneticCodes.py. Please capture (by output redirection) the ORF length output for the - strand of the entire data in FASTA formatted file BA000005.3. Recall that the command you need to type in the terminal window is
  python orfFinder1.py BA000005.3.faa -1 > outputORFfinderMinusStrand.txt
  assuming that you have downloaded from the NCBI server the FASTA file for human chromosome 21q, which you have named "BA000005.3.faa". Here the "-1" is a flag to have the program compute ORFs on the - strand. Then you can import the output text file into Excel. Please determine the average length of nucleotides in the ORFs of this file. WARNING: You will need to multiply ORF length by 3, since my program computes ORF length as the number of amino acids in the putative coding region. Compute the histogram and graph the histogram of the nucleotide lengths in the ORFs.
  One of the goals of this exercise is to (rather painfully) understand how essential it is to be able to write computer programs on a computer running linux, if you want to really do much beyond routine use of a web server.
(Optional problem): The following problem, which is optional and requires you to write a simple program, is not really a bioinformatics problem, but rather an early attempt to gain some idea of what "junk DNA" might encode. Before 2000 there were a number of papers that attempted to analyze genomic DNA by determining statistical insights (trinucleotide frequency, etc.). Of course some of these insights, such as trinucleotide frequency analysis, ultimately led to useful tools, such as gene finders. The point of this problem is to recreate the "mind set" of researchers trying to see some pattern in large genomic data sets.
A famous theorem of G. Polya states that a random walk on a lattice of one dimension (i.e. a drunkard can only go one step left or right) or of 2 dimensions (i.e. a drunkard can only go up or down or left or right one step) will with probability ALWAYS return to the origin infinitely often. Test whether the DNA in the genome of P. abyssi is "random" by writing a Python program to do the following. Start from the origin (0,0).
- Every time you read a "A", add (0,1).
- Every time you read a "C", add (1,0).
- Every time you read a "G", add (-1,0).
- Every time you read a "T", add (0,-1).
Output your sequence of x,y coordinates of the points of the two-dimensional lattice traversed in reading the genome of P. abyssi.
Now using Excel, create a scatter plot of the data. (Since the file consists of over 1.7 million points, it is likely that Excel cannot handle such a large amount of data. If this is the case, truncate your file to an amount that Excel can handle.
If you have access to an alternative graphing package with more flexibility, try to print out an image of the "walk" -- i.e. the series of line segments, going from the origin, to the first position, then from the first position to the second position, etc. I was able to do this for the output of my program on the genome of M. jannaschii, which is here.

Please staple all homework sheets together, and write your name, homework assignment, email address, and date.