Solution:
here.
In the past, the SIBZ server was functional, and this portion answers the (now defunct) question of using the SIBZ web server of Craig Benham. Since this server has an upper bound of 10,000 nt, you should input about 1/3 of the genome (I took the first 150 lines of the FASTA file). For this, you should use the BDZtrans (default) algorithm, which analyzes the competition between B,Z transitions of DNA and denaturation (bubble formation).
Determine the origin of replication (oriC) using Ori-Finder 2. Before running the server, be sure to select the radio button "Methanococcaceae". Does this web server predict an origin of replication for this archaeon? If so, what is the predicted ori? Please print out a screen shot of the skew plot (Figure 1) from the web server output.
Solution:
The gene list (not required) is given below -- this is obtained by a simple ORF-finder -- recall that ORF is an acronym for "open reading frame".
MjannaschiiSkewGraphUnannotatedGenome.png
The following site,
Convert.htm,
contains links to many useful tools (web servers) to translate
GenBank files (.gb) to NCBI Protein Table (*.ptt) format, to FASTA format,
etc. The tool to convert GenBank to PTT format is
gbk2ptt.
PTT files contain a list of the annotated proteins and can be
additionally input to
Ori-Finder and
Ori-Finder 2,
leading to better prediction of origins of replication.
Warning: To have a nice web logo, the sequence length must be identical for all examples, and length should not be too small. You could create a LOGO of different myoglobins, but this will be difficult to view. Experiment some.
If you use different length sequences, then use Clustal Omega on the EBI server to produce a single multiple sequence alignment of your FASTA sequences.
Note that you can produce an encapsulated postscript (eps) file, a pdf file, png or gif file -- using scripts from public domain, you can transform an eps file to pdf and vice-versa, or into a gif file.
Solution:
Solution:
where we see that there are 10 (predicted) internal exons, and one polyA tail at the locations given. Moreover, we have the initial portions of the peptides coded by the internal exons. If we take the first, XNLIRKDVDALSEDEVLNLQVALRAMQDDETPTGYQAIAAYHGEPADCKAPDGSTVVCCL, and use BLASTp on the NCBI server, we obtain the following output:
Please look at this genscanOutputExplanation.html, taken from the web site here at Univ Lyon I.
.
GenScan, developed by Chris Burge (MIT), was one of the first gene finders to use a "generalized hidden Markov model" (generalized HMM), which requires sampling exon length from a training set of known exons. This strategy was soon adopted by all eukaryotic gene finders.
By the way, Wikipedia lists various gene finders here.
Solution:
a tgtcgcaaat catgtacaac taccccgcga tgttgggtca cgccggggat
atggccggat atgccggcac gctgcagagc ttgggtgccg agatcgccgt ggagcaggcc
gcgttgcaga gtgcgtggca gggcgatacc gggatcacgt atcaggcgtg gcaggcacag
tggaaccagg ccatggaaga tttggtgcgg gcctatcatg cgatgtccag cacccatgaa
gccaacacca tggcgatgat ggcccgcgac acggccgaag ccgccaaatg gggcggctag
which codes
MSQIMYNYPAMLGHAGDMAGYAGTLQSLGAEIAVEQAALQSAWQ GDTGITYQAWQAQWNQAMEDLVRAYHAMSSTHEANTMAMMARDTAEAAKWGG
# PSIPRED VFORMAT (PSIPRED V3.3) 1 M C 0.999 0.001 0.000 2 S C 0.448 0.229 0.075 3 Q H 0.303 0.409 0.269 4 I H 0.193 0.520 0.241 5 M H 0.242 0.542 0.156 6 Y H 0.303 0.520 0.153 7 N H 0.268 0.556 0.115 8 Y H 0.166 0.756 0.019 9 P H 0.116 0.786 0.008 10 A H 0.070 0.851 0.005 11 M H 0.061 0.879 0.003 12 L H 0.121 0.769 0.008 13 G H 0.144 0.723 0.014 14 H H 0.266 0.534 0.012 15 A H 0.298 0.528 0.005 16 G C 0.428 0.358 0.006 17 D C 0.518 0.329 0.010 18 M C 0.636 0.267 0.013 19 A C 0.741 0.151 0.017 20 G C 0.959 0.000 0.000
> PsiPred raw score output for Rv2668 entire protein: here.
Solution: 295 ORFs are found, of which 294 have both start and stop positions within nucleotides 1-50,000. Of the 294, 106 ORFs appear on the - strand, while the remaining 187 appear on the + strand. Average length is 141.49.
Solution: 102 ORFs appear in frame 1, 80 ORFs appear in frame 2, 111 ORFs appear in frame 3.
Using Excel, create a histogram and a graph of the histogram for ORF lengths that appear on the - strand. Please use bins (or class delimiters) of 20,40,60,...,300 which accounts for almost all the ORF lengths. This requires you to use Excel->Tools->Data Analysis->Histogram. If the Data Analysis option does not appear in the Tools menu of Excel, then you will have to select "Excel Add Ins" in the Tools menu, and add in the Data Analysis tool pack. This takes only a couple of minutes.
Recall that as a member of the BC community, you can freely download Excel here. Note that Microsoft Office 2019 (current) and Office 2016 both contain the Data Analysis toolpack, while Microsoft Office 2011 does not.
Solution:
python orfFinder1.py BA000005.3.faa -1 > outputORFfinderMinusStrand.txtassuming that you have downloaded from the NCBI server the FASTA file for human chromosome 21q, which you have named "BA000005.3.faa". Here the "-1" is a flag to have the program compute ORFs on the - strand. Then you can import the output text file into Excel. Please determine the average length of nucleotides in the ORFs of this file. WARNING: You will need to multiply ORF length by 3, since my program computes ORF length as the number of amino acids in the putative coding region. Compute the histogram and graph the histogram of the nucleotide lengths in the ORFs. Solution: Average length is 148.80, and the histogram is here:
One of the goals of this exercise is to (rather painfully) understand how essential it is to be able to write computer programs on a computer running linux, if you want to really do much beyond routine use of a web server.![]()
Now using Excel, create a scatter plot of the data. (Since the file consists of over 1.7 million points, it is likely that Excel cannot handle such a large amount of data. If this is the case, truncate your file to an amount that Excel can handle.
If you have access to an alternative graphing package with more flexibility, try to print out an image of the "walk" -- i.e. the series of line segments, going from the origin, to the first position, then from the first position to the second position, etc. I was able to do this for the output of my program on the genome of M. jannaschii, which is here.
Solution: