Cytoplasmic polyadenylation elements

Biologists, biochemists and chemists often want to search for some string in a nucleotide or amino acid sequence. Gene-finding is a very sophisticated tool using hidden Markov models, but often one wants to find locations within the genome of a certain signal (e.g. TATA-box). Here's a problem suggested by Laura Hake, in her own words (personal communication) -- for more information, see Stebbins-Boaz, Hake and RIchter, 1996, EMBO J. 15:2582-2592.

"In the processing of an mRNA in frog eggs, the mRNA is first transcribed from the DNA, spliced to removed introns, clipped at the 3' end, and a polyadenylate tail is added. The mRNA is then transported to the cytoplasm. Now, many of the mRNAs in Xenopus oocytes are stockpiled and stored for use later in development- so, they sit in the cytoplasm, not being translated into proteins. In fact, as soon as these "stored" mRNAs are transported into the cytoplasm, their polyadenylate (poly A) tails are shortened to about 15-40 nucleotides. mRNAs are then activated for translation into proteins at particular times and in particular regions of the oocyte and developing embryo. The CPE (cytoplasmic polyadenylation element) sequence comes into play at one of the times- when the oocyte is reinitiating its development in preparation for fertilization. Basically, the CPE recruits the cytoplasmic polyadenylation element binding protein (CPEB). CPEB then recruits a protein complex, called CPSF to the formerly quiescent mRNA. CPSF physically interacts with both CPEB and another sequence in the mRNA, to the 3' side of the CPE, called the nuclear polyadenylation hexanucleotide. This particular sequence (AAUAAA) is required for the polyadenylation of mRNAs that occurs en masse in the nucleus. It also turns out, that AAUAAA is ALSO necessary for cytoplasmic polyadenylation (as is apparent by the requirement for the CPSF complex). CPSF, once bound to the mRNA, recruits the enzyme poly A polymerase to the mRNA. THis enzyme then goes about the business of elongating the poly (A) tail of the mRNA. The elongated poly (A) tail then binds up another set of proteins, which interact with proteins bound to the 5' end of the mRNA. These 5' end bound proteins are necessary for translation initiation. Thus, this interaction between the 3' and 5' ends of the mRNA enhances the translation of the mRNA."

Examples of CPE:

UUUUUAU
UUUUAAU
UUUUUAUAAAG

CPSs are generally located on the 5' side of the nuclear polyadenylation hexanucleotide (AAUAAA). The table below gives the CPEs (lower case), followed by number of nucleotides separating the CPE from the downstream hexanucleotide sequence, followed by the hexanucleotide sequence. Data from Stebbins-Boaz, Hake and RIchter, 1996, EMBO J. 15:2582-2592.

Origin CPE     CPE separation  hexanucleotide 
B4 RNA		        uuuuuaau-13nt-AAUAAA 
G10			uuuuuuauaaag-7nt-AAUAAA 
c-mos		        uuuuauAAUAAA (CPE/ hex overlap) 
cdk2			uuuuau-15nt-uuuuuaauuuuau-57nt-AAUAAA 
cyclin A1		uuuuuAAUAAA (CPE/ hex overlap) 
cyclin B1		uuuuuaau-10nt-uuuuAAUAAA 
cyclin B2		uuuuuauu-45t-aauaaa-8nt-uuuuuuauuu   
wee1			uuuuuau-12nt-uuuuaAAUAAA-2nt-uuuuaau

Note: In the examples of c-mos and cyclin A1, the CPE overlaps the hexanucleotide! It appears that the CPE and hex can be up to 100nts apart.

This problem motivates the following questions, in building up to a program for the recognition of the motif consisting of CPE sequence, followed by separating nucleotides, followed by AAUAAA.

What is the probability of finding the AAUAAA hexanucleotide in a random nucleotide sequence of length n (i.e. uniform distribution)?
What about when the random nucleotide sequence has a given nucleotide usage or compositional frequency (i.e. 0-th order Markov chain).
What about when the random nucleotide sequence has a given dinucleotide frequency (i.e. first order Markov chain), or trinucleotide frequency (i.e second order Markov chain).
What is the distribution of CPE elements (with AAUAAA downstream within 100 bases)? Are the CPE elements clustered in certain locations on certain chromosomes? Does the graph of the histogram of separating distances between successive CPE elements (analogue of "interarrival time") look like a bell curve, or what?

In part this problem cannot be answered, because currently there is no fully sequenced frog genome; moreover, the frog is tetraploid, unlike the diploid fly, human, etc. Nevertheless, the programs could be developed to run on partial genomic sequences.

RNA sequence

3'utr sequence of the Xenopus cyclin B1 cDNA

agg actacgtggc attccaattg tgtattgttg gcaccatgtg cttctgtaaa tagtgtattg 
    tgtttttATAAA gctcatttta acatg

We write the previous 3'UTR EST, which is cDNA (not RNA) without intervening spaces.

aggactacgtggcattccaattgtgtattgttggcaccatgtgcttctgtaaatagtgtattgtgtttttaatgttttactggttttAATAAAgctcattttaacatg

We write and apply a Python to compute the reverse complement of the previous sequence, which yields the reverse complement RNA, and then apply RNAfold from Vienna RNA Package.

catgttaaaatgagcTTTATTaaaaccagtaaaacattaaaaacacaatacactatttacagaagcacatggtgccaacaatacacaattggaatgccacgtagtcct
..((((..((((...((((((......)))))).))))...))))................((...((.((((.((((..........))))...)))).))..))..
 minimum free energy =  -9.90 kcal/mol