This problem concerns the origin of replication of bacteria, archaea and viruses. Since the web server is not currently functional, despite the recent bioRxiv preprint here, please download the fasta files and try to do what you can with my gcSkew program from the previous problem. My guess is that the authors have submitted their paper to a journal and are awaiting acceptance before making the web server available.
Determine the origin of replication (oriC) using Ori-Finder 2. Warning! The authors of Ori-Finder originally created a distinct web server for archaebacteria, such as M. jannaschii. That web server has been removed, so this question asks you to run the *bacterial* Ori-Finder on an archaeon, immediately making the prediction possibly suspect. Does Ori-Finder return any answer? Please run my gc-skew plot program. Can you make a prediction?
WARNING: Your sequences must have identical length (possibly including dashes), so it is best to use Clustal Omega on the EBI server to produce a single multiple sequence alignment of your FASTA sequences.
For example, if you choose to create a web logo of the HIV ribosomal frameshift signal sequences discussed in class, then search for "HIV ribosomal frameshift" on the Rfam server, select the top hit, choose to view alignments (button "Alignments" on left panel), and download the "seed alignment" family in "FASTA (gapped)" format. You can then create the web logo.
Note that you can produce an encapsulated postscript (eps) file, a pdf file, png or gif file -- using scripts from public domain, you can transform an eps file to pdf and vice-versa, or into a gif file.
Using Excel, create a histogram and a graph of the histogram for ORF lengths that appear on the - strand. Please use bins (or class delimiters) of 20,40,60,...,300 which accounts for almost all the ORF lengths. This requires you to use Excel->Tools->Data Analysis->Histogram. If the Data Analysis option does not appear in the Tools menu of Excel, then you will have to select "Excel Add Ins" in the Tools menu, and add in the Data Analysis tool pack. This takes only a couple of minutes.
Recall that as a member of the BC community, you can freely download Excel here. Note that Microsoft Office 2019 (current) and Office 2016 both contain the Data Analysis toolpack, while Microsoft Office 2011 does not.
python orfFinder1.py BA000005.3.faa -1 > outputORFfinderMinusStrand.txtassuming that you have downloaded from the NCBI server the FASTA file for human chromosome 21q, which you have named "BA000005.3.faa". Here the "-1" is a flag to have the program compute ORFs on the - strand. Then you can import the output text file into Excel. Please determine the average length of nucleotides in the ORFs of this file. WARNING: You will need to multiply ORF length by 3, since my program computes ORF length as the number of amino acids in the putative coding region. Compute the histogram and graph the histogram of the nucleotide lengths in the ORFs.
One of the goals of this exercise is to (rather painfully) understand how essential it is to be able to write computer programs on a computer running linux, if you want to really do much beyond routine use of a web server.
A famous theorem of G. Polya states that a random walk on a lattice of one dimension (i.e. a drunkard can only go one step left or right) or of 2 dimensions (i.e. a drunkard can only go up or down or left or right one step) will with probability ALWAYS return to the origin infinitely often. Test whether the DNA in the genome of P. abyssi is "random" by writing a Python program to do the following. Start from the origin (0,0).
Now using Excel, create a scatter plot of the data. (Since the file consists of over 1.7 million points, it is likely that Excel cannot handle such a large amount of data. If this is the case, truncate your file to an amount that Excel can handle.
If you have access to an alternative graphing package with more flexibility, try to print out an image of the "walk" -- i.e. the series of line segments, going from the origin, to the first position, then from the first position to the second position, etc. I was able to do this for the output of my program on the genome of M. jannaschii, which is here.
Please staple all homework sheets together, and write your name, homework assignment, email address, and date.