Rfam-based RNA design Manual

Designing synthetic RNAs using the 3-step pipeline

You can bypass Steps 1 and 2 to go directly to Step 3, where you enter a target secondary structure and constraints. This allows you complete freedom in RNA design. However, if you wish to design synthetic RNAs to function in a similar manner as RNAs from the Rfam database, then you should follow the 3-step design protocol, which we used to design functional hammerhead type III ribozymes, described in

Complete RNA inverse folding: computational design of functional hammerhead ribozymes.
Dotu I, Garcia-Martin JA, Slinger BL, Mechery V, Meyer MM, Clote P.
Nucleic Acids Res. 2015;42(18):11752-62. doi: 10.1093/nar/gku740. Epub 2014 Sep 10.


  1. Step 1: Fill in your email -- not required, but useful if you wish to be notified by email in the case that computation time is long. Before selecting an Rfam family if you want include to sequences whose minimum free energy structure is not identical to the alignment consensus structure mark the checkbox.
    • In the default mode (unchecked) only those Rfam families which contains at least one sequence whose MFE structure is the target structure are shown. In this case step 2 will display only those sequences in the selected family whose MFE structure is identical to the consensus structure.
    • If this field is checked all Rfam 12.0 families are available for desig and Step 2 will display all sequences in the selected family ordered by base pair distance between the MFE structure and the consensus structure.
    Then select an Rfam family -- in the case of hammerhead III ribozymes, this is RF00008. Next, select the energy model -- either Turner'99 or Turner'04, whose energy parameter are described at the NNDB (Nearest Neighbor Database. The choice does not matter, though it is commonly held that Turner'04 parameters are more accurate. This is not necessarily the case, since RNAfold predicts the correct, functional structure for Peach Latent Mosaic Viroid (PLMVd) hammerhead ribozyme using the Turner'99 parameters (left image), while the incorrect structure is predicted using the Turner'04 parameters (right image). Also choose the treatment of dangles, also known as single-stranded, stacked nucleotides. Choices are no dangle energy parameters are used (-d 0), or the minimum energy choice is made in the case a single nucleotide can be treated as a 5'-dangle, 3'-dangle, or double-dangle (-d 1), or the choice is to treat a nucleotide as a double-dangle when available (-d 2). For design of our functional hammerheads, we selected Turner'99 energy parameters, with dangle treatment of minimum free energy (-d 1).


  2. Step 2: The behavior in this step is slightly different depending on if sequences whose minimum free energy structure is identical to the alignment consensus structure are allowed:
    • If you are using the default mode that shows only sequences whose minimum free energy structure is identical to the alignment consensus structure it is possible that there is no sequence for the specified energy model and dangling treatment (e.g. RF00008 and Turner'04). In this case you will see the message: "No matching sequences in RFXXXXX family using Vienna X.X.X (dangling end treatment X)"
      Otherwise you will see a pull-down tab of the list of Rfam seed alignment sequences, whose minimum free energy (MFE) structure agrees with the Rfam concensus structure. In our case, there is only one sequence with this property -- PLMVd with EMBL accession code AJ005312.1/282-335.
    • If you chose including sequences whose minimum free energy structure is not identical to the alignment consensus structure you the list will be a prioritized list of sequences, whose MFE structures differ from the Rfam consensus structure by the fewest number of base pairs. The number of base pairs on which the MFE structure differs to the Rfam consensus structure is between parenthesis.
      For this case you have the option to select the MFE structure or the consensus structure as design target.

    Next, you should specify a minimum conservation level, which will force all designed sequences to have the same nucleotides at all positions, whose sequence identity in the Rfam seed alignment is at least this level. The default value is 95%, the value we used in our hammerhead design. Then click "Get Conserved Positions" in order to see the percent sequence conservation in the Rfam seed alignment of various positions.
    You may choose to force those non-conserved positions to disagree with the specified Rfam sequence -- otherwise your "designed" sequence could end up to be similar or identical to the Rfam sequence.
    After these steps, we see that the target structure and sequence constraints (using IUPAC single-letter nucleotide codes) is as follows:

    Structure .((((((.(((((...))))).......((((........))))...)))))).
    Sequence: GAUGAGUCUGUGCUAAGCACACUGAUGAGUCUAUGAAAUGAGACGAAACUCAUA

    Now, click the button "Continue to Step 3" in the lower right portion of the screen of Step 2. This is important, since if you select the tab "Step 3" at the top of the screen, you will not carry over the target structure and sequence constraints from Step 2. To be absolutely clear, if you select the tab "Step 3" at the top of the screen, then you will be taken to a blank form where you need to fill in the target structure and constraints. This tab is used only for sequence design, where you have complete freedom and choose not to use the pipeline.

  3. Step 3: On the screen of Step 3, you may add additional sequence constraints, and even, in the case of designed messenger RNA, of constraints that the triplet codons code for a specific amino acid sequence -- here the length of the amino acid sequence must be exactly one-third the length of the desired nucleotide sequence. You may require that the sequences returned by RNAiFold not only fold into the target structure, but additionally be compatible with another secondary structure. You may prohibit certain base pairs from occurring by specifying a list of disallowed base pairs. If i j is a prohibited base pair, then the i-th and j-th positions of sequences returned by RNAiFold cannot be occupied by nucleotides that could form either Watson-Crick or wobble pair. Finally, you may specify the temperature for folding, and specify additional constraints, such as a range for the GC-content, a maximum number of successive nucleotides of the same type, and a range for different types of base pairs. This latter means that in the target structure, you can control the number of base pairs of various types. In our case, we selected hammerhead candidates with different levels of GC-content, not shown here. Finally, you should submit the job.

Complete the form at Step 1, then click the button "Continue to Step 2" (located in the lower right portion of the screen), subsequently complete the form at Step 2, then click the button "Continue to Step 3" (located in the lower right portion of the screen), and finally complete the form at Step 3 and launch the computation.


RNA-CPdesign Manual

RNA-CPdesign is an easy, fast method to find RNA sequences that fold into a desired target secondary structure.

Though not required, it is best to enter your email address, since the results will then be sent to by email if that field is filled in. If the email address is not entered, then it is necessary to leave a browser window open until the computation terminates and results are returned.

INPUT



The desired target structure may be pasted into the text field, or a file containing the target structure may be uploaded, or a verbose form may be used, discussed later in the section Verbose Form. Optional FASTA comment and optional sequence constraints may be included. The input must be in the following form:

  • > FASTA comment (optional)
  • target structure in dot bracket notation (required)
  • sequence constraints (optional)

where any permutation of the order of lines is allowed. If sequence constraints are given, then the input sequence length must be the same as that of the structure. An example is given below.

Example:

(((((.(..((((.........)))).(((((.......))))).....(((((.......)))))).))))).

or for an example with FASTA comment and sequence constraints:

> tRNA structure with sequence constraints
(((((.(..((((.........)))).(((((.......))))).....(((((.......)))))).))))).
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCUGGCUCG

where GC-content can be required to be from 40% to 60%, by giving thes bounds in the search constraints, discussed later. Note the presence of 'ACG' in positions 35-37 and 'CUGGCUCG' in positions 67-74. The sequence constraints ensure that any sequence returned by CPdesign have the stipulated nucleotides at the indicated positions ('N' mean any nucleotide). Allowable sequence constraints are 'A', 'C', 'G', 'U', 'N', 'B'='not A'='C' or 'G' or 'U', etc. -- i.e. any 1-character IUPAC code is permitted in the sequence constraint line.

In this example, the desired target structure is the cloverleaf structure of a transfer RNA, and the sequence constraints ensure an 'ACG' anticodon, corresponding the the CGU codon for arginine. As well, the last eight nucleotides 'CUGGCUCG', corresponding to the right side of the acceptor stem (binding site for aminoacyl synthetase), are the last eight nucleotides in a known arg-tRNA of the Chlamydomonas reinhardtii chloroplast, with EMBL accession code L13782.1/442-515. This example is interesting, since the Rfam consensus secondary structure of L13782.1/442-515 is the following:

> L13782.1_442-515
GAGCUUGUAGCUCAGUGGACUAGAGCACAUGGCUACGAACCAUGGGGUCGGGGGUUCGAAACCCUCCUGGCUCG
(((((.(..((((.........)))).(((((.......))))).....(((((.......)))))).))))).

while the minimum free energy structure of this (real) tRNA is not a cloverleaf structure, but rather the following:

> MFE structure for L13782.1_442-515
GAGCUUGUAGCUCAGUGGACUAGAGCACAUGGCUACGAACCAUGGGGUCGGGGGUUCGAAACCCUCCUGGCUCG
.(((((((.((((((....)).))))))).))))..........((((((((((.........)))))))))).

having -26.5 kcal/mol. In other words, none of the base pairs in the MFE structure appear in the biologically functional, consensus structure. The output of CPdesign is discussed below, in the OUTPUT section.

General search parameters and energy model


User can specify the maximum number of solutions returned by RNA-CPdesign by selecting, 1, 5, 10, 50 or all the solutions found in the in the webserver time limit (currently set to 120 minutes).

RNA-CPdesign can use either Turner '99 and Turner'04 energy models implemented in Vienna RNA 1.8.5 and 2.1.7 versions respectively, allowing different dangling end treatment:

  • No dangle energy parameters are used (-d 0)
  • Minimum energy choice is made in the case a single nucleotide can be treated as a 5'-dangle, 3'-dangle, or double-dangle (-d 1)
  • Treat a nucleotide as a double-dangle when available (-d 2)

It is also possible to define the folding temperature, set by default at 37 ºC.

Search constraints


One of the main features of RNA-CPdesign is the flexibility this software provides for RNA synthetic design due to the wide variety of design contraints included.
RNA-CPdesign allows the user to give additional constraints on all returned sequences. For usability constraints are organized in four sections:

  • Base pair constraints
    • Compatible base pairs: RNA-CPdesign allows the user to specify those positions in the returned sequence(s) that can form a base pair, even if these base pairs are not part of the target structure. In this fashion, one could design an RNA whose MFE structure is the given target structure, but which is compatible with another structure. To specify these positins, the input should include the desired "compatible" structure, given in dot bracket notation.

    • Incompatible base pairs: RNA-CPdesign allows the user to specify positions in the returned sequence(s) that cannot form a base pair.

      These positions can be specified in two different formats:
      • Dot bracket notation: A secondary structure where paired positions are disallowed in the solution. An example is given below.


      • Base pair list: A list of base pairs or base pair ranges separated by commas. Single base pairs are represented by 2 numbers (i j) and base pair ranges by the expression (P i j k), indication that i cannot form a base pair with any position between j and j+k-1.
        The following example represents the same incompatible base pairs of the previous example using dot bracket notation:
        Another example using ranges, note that in this case position 14 cannot pair with positions 19 or 20, but there are no incompatible base pairing constraints for position 15.


  • Amino acid constraints

    RNA-CPdesign allow three ways of constraining the amino acid sequence encoded by the output nucleotide sequences.

    • Compatible amino acid sequence: Solutions must code the given amino acid sequence, it is possible to use wildcards that represent a set of amino acids. For this purpose we created a dictionary for the most common amino acid classifications. If you want to use this type of amino acid constraints, write your sequence in the following box:


    • Compatible amino acid sequence based on similarity: A second way is auto generate amino acid constraints, allowing only codons that are similar to a target amino acid sequence. This similarity constraints can be:
      • BLOSUM62 matrix based: Allow amino acids with a similarity score over a given threshold (-4 to 4)
      • Classification based: We propose two classifications defined in our dictionary.


      If you want to use this type of amino acid constraints, fill out "Target amino acid sequence" field, set "Expand search to similar amino acids based on" to the desired option and leave "Search maximizing blosum similarity score" unchecked.


    • Maximize blosum score: Given a target amino acid sequence, RNA-CPdesign can find a nucleotide sequence that encodes an amino acid sequence where the sum of BLOSUM62 similarity scores is maximum. You reduce the run time of by constraining the search only to those amino acids that are similar to the target sequence selecting the desired similarity criteria.

      To enable maximization of blosum score, check "Search maximizing blosum similarity score" box, input your target amino acid sequence in "Target amino acid sequence" and optionally select the desired similarity criteria in "Expand search to similar amino acids based on".
    • Starting positions of amino acid sequences: This field indicates the starting nucleotide from which amino acid target sequences and constraints apply. In this fashion is possible to select the appropriate the reading frame.

    • Using multiple amino acid target/constraints: Multiple target amino acid sequences, amino acid constraints and starting are allowed. To use this feature enter the amino acid sequences and starting position in the respective field separated by commas taking in account that:
      • It is possible to add a target and an amino acid constraint at the same time, but they must have the same length. When both target amino acid sequence and amino acid constraints are in the same position target amino acid sequence will not be expanded independently of the value of "Expand search to similar amino acids based on". Thus this should only be done for blosum score maximization.

        In the following example the objective is to maxmimize the blosum score with respect to sequences FFREDLAFPQGKAREFS starting at position 1 and FLGKIWPSHKGRPGNFL starting at position 2. Sowever the search is constrained to sequences that fit into ==R-DLA@P&G+A+%=& constraint starting at position 1 and =LG+=WPSH+=+PG&FL constraint starting at at position 2. Nothe that "Blosum score > -1" is ignored because there are specific amino acid constraints.
      • It is possible to have empty cases in both amino acid target or constraints. In the following example the target sequence starting at position 1 is expanded using Classification 1 so any solution similar to IFSSLPGLVPKGKE is valid for positions 1-42, however amino acids at position 2-10 must code exactly for FSL.

  • Energy constraints
      Free energy: Allows to set limits to constraint the search to sequences whose free energy (in kcal/mol) when they are folded into the minimum free energy structure is in a specific range.

  • Nucleotide composition
    • Limit GC content: Minimum and máximum percent of GC-content (number of either G or C nucleotides divided by sequence length). Default is 0% for mímimum and 100% for máximum.

    • Limit consecutive nucleotides: Minimum and máximum number of consecutive nucleotides of each type (A,C,G,U) in all returned sequences. Note that the value 0 (zero) indicates that the MFE structure (target structure) of any returned sequence cannot contain any base pair of the specified type. Limit of consecutive nucleotides can be constrained to specific regions by indicating the start and end positions. The syntax is a series of conditions separated by commas and each must be either specifyng the number of nucleotides (N) or three numbers indicating the number of nucleotides, starting position and end position (N Start End).

    • Limit canonical base pairs: Minimum and maximum number of base pairs in the MFE structure of any returned sequence. Values can be stipulated for strong (GC), weak (AU) and wobble (GU) pairs.

    • Limit number of bases: Minimum and maximum number of occurences of each nucleotide in the MFE structure of any returned sequence. Minimum and maximum number of nucleotides can be constrained to specific regions by indicating the start and end positions. The syntax is a series of conditions separated by commas and each must be either specifyng the number of nucleotides (N) or three numbers indicating the number of nucleotides, starting position and end position (N Start End)

Constraints included in the input file


All constraints can be included directly into the input file using the appropriate label preceeded by the "pound" symbol ("#") and writing the value in the next line. Valid labes are:

  • MAXsol: Maximum number of solutions
  • dangles: Dangling end treatment
  • temp: Target temperature

  • RNAcompstr: Compatible RNA secondary structure
  • IncompBP: List of incompatible base pairs
  • AAtarget: List of target amino acid sequences for blosum similarity score maximization
  • AAseqcon: List of input amino acid sequences constraints
  • AAstartPos: List of starting positions for amino acid target and constraints.
  • AAsimilCstr: Blosum similarity threshold (-4 to 4, 5 forces aminoacids to be equal to those specified) or allow similar amino acids based on a specific classification (6- Classification 1, 7-Classification 2 )
  • MaxBlosumScore: Maximize blosum score of target sequences

  • minGCcont: Minimum GC content
  • maxGCcont: Maximum GC content
  • minAU: Minimum number of AU base pairs
  • maxAU: Maximum number of AU base pairs
  • minGC: Minimum number of GC base pairs
  • maxGC: Maximum number of GC base pairs
  • maxGU: Maximum number of GU base pairs
  • minGU: Minimum number of GU base pairs
  • consA: Maximum number of consecutive As
  • consC: Maximum number of consecutive Cs
  • consG: Maximum number of consecutive Gs
  • consU: Maximum number of consecutive Us
  • minA: Minimum number of As
  • maxA: Maximum number of As
  • minC: Minimum number of Cs
  • maxC: Maximum number of Cs
  • minG: Minimum number of Gs
  • maxG: Maximum number of Gs
  • maxU: Maximum number of Us
  • minU: Minimum number of Us

Example input files


  1. This input file contains commands to design a sequence that folds into a hammerhead ribozyme structure, contains sequence constraints for nucleotides having 95% sequence conservation in the Rfam alignment with AJ005312.1/282-335, sequence constraints for positions having less than 95% conservation to be distinct from the nucleotide present at that position in AJ005312.1/282-335, and to be compatible with another structure having only one stem-loop. RNAiFold 2.0 returns the following solution: AGGCGGUAAC CCGAUCCGGG UCUGAAGAGC UCGAGUUAAA GGGCGAAACC GCCC.
    > Ex 1: Designing synthetic hammerhead ribozymes
    .((((((.(((((...))))).......((((........))))...)))))).
    HBVHBGUDVHVHDVBBHDBDBCUGAVGAGVDVBVGBBAVHBGBCGAAACVDBVB
    #MAXsol
    1
    #dangles
    1
    #RNAcompstr
    ................(((((.......))))).....................
    #temp
    37
    #LNS
    0
    
  2. This input file contains commands to design a sequence that folds into GNRA-tetraloop at positions 6-9 and in which the first position is prohibited from pairing with any other nucleotide in this 27-nt sequence (P 1 2 26), position 3 is prohibited from pairing with positions 16 and 17 (P 3 16 2), and the nucleotides at positions 4,17 and 4,18 and 4,19 are prohibited from pairing. Note also that a partial target structure or fragment is given, where commas appear at position 16, 18-25, 27. All positions that contain a comma may be paired or unpaired; positions containing a dot must be unpaired; positions containing matching left and right parentheses must be paired together.

    WARNING: It is very important to stipulate incompatible (prohibited) base pairs with no space after each separating comma, and with no more than one space occurring between successive integers; e.g. "P 3 7 2,P 5 12 4,7 15,8,14", where mixing of syntax is permitted.
    > Ex 2: Partial target, IUPAC codes, incompatible base pairs
    .((((....))))..,(,,,,,,,,),
    ANNNNGNRANNNNNNNNNNNNNNNNNN
    #MAXsol
    1
    #IncompBP
    P 1 2 26,P 3 16 2,4 17,4 18,4 19
    #dangles
    1
    #temp
    37
    #LNS
    0
    
  3. This input file contains commands to design a sequence to produce a programmed -1 ribosomal frameshift at the Gag-Pol junction in retroviruses, such as that found in HIV-1. Solutions are required to begin by the known slippery sequence UUUUUUA, which reads as U UUU UUA in the Gag reading frame and as UUU UUU A in the Pol reading frame. The indicated peptides FFREDLAFPQGKAREFS [resp. FLGKIWPSHKGRPGNFL] are coded in the Pol reading frame (1631-1681) [resp. Gag reading frame (1632-1682)], where nucleotide positions refer to the complete HIV-1 genome in GenBank file AF033819.3.
    > Ex 3: overlapping amino acid constraints
    #RNAscdstr
    ......((((((..((((((((((((....))))))))))))...)))))).
    #RNAseqcon
    UUUUUUANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    #AAseqcon
    FFREDLAFPQGKAREFS,FLGKIWPSHKGRPGNFL
    #AAstartPos
    1,2
    #MAXsol
    0
    #AAsimilCstr
    3
    

Verbose Form


An alternate input format is given in the advanced form, depicted below. Here, the user may enter the formerly described target structure, sequence constraints, and constraints for compatible and incompatible base pairs. Any field may be left blank, with the exception of the field for the target structure.

In addition, the user can stipulate the temperature at which folding occurs (default is 37 degrees Celsius), and treatment of dangles. Since CPdesign uses a plug-in for Vienna RNA Package , the treatment of dangles (stacked, single-stranded nucleotides) follows that of the Vienna Package -- the user is referred to Vienna Package web site for further explanation. Default dangle treatment for CPdesign is 2, which is the default for Vienna Package partition function computation (RNAfold -p). However, the user can choose dangle treatment 0 (no dangle), 1 (default for Vienna Package minimum free energy structure computation), 2 (default for Vienna Package partition function computation), 3 (Vienna Package treatment of dangles and coaxial stacking). For technical reasons, CPdesign requires much less ocmputation time when dangle treatment is 2, rather than 1.

Cofold


RNA-CPdesign allows cofolding: given a hybridization structure, together with optional constraints, CPdesign returns two sequences, whose MFE hybridization is the input, target structure. Note that a hybridization may involve inter-molecular base pairs, as well as intra-molecular base pairs, provided that there are no pseudoknots.

The input format for the RNA inverse cofolding problem consists of the following form:

  • > FASTA comment (optional)
  • target hybridization structure, with ampersand ('&') (required)
  • sequence constraints with ampersand ('&') (optional)

The target hybridization structure is represented in dot bracket notation, following the convention of RNAcofold of the Vienna RNA Package . Namely, an ampersand ('&') separates the two single-molecule portions of the hybridization structure. Sequence constraints are represented by the concatenation of two sequences, separated by ampersand, where as before, the user may indicate that certain positions in all returned sequences must have specific nucleotides in indicated positions. The sequence constraint is optional, but if provided, it must have same length as the input hybridization structure including the ampersand.

In order to represent the hybridization of an unknown sequence of 20 nt with another unknown sequence of 10 nt, where there is a stem-loop structure from positions 1-10 for the unknown sequence of 20 nt, while positions 11-20 of the first sequence hybridize with positions 1-10 of the second sequence, we give the dot bracket notation (with ampersand): (((....))).(((((((((&))))))))). which corresponds to the following hybridization figure:

OUTPUT


If an email address is entered, then an email will be sent as soon as the results are available.

Three possible types of results can be returned:

  • Solution found: Results for the first sequence are displayed. A file of results will be attached, with the following information:
    • Input target structure
    • GC content and the number of base pairs of each type (strong, weak and wobble).
    • Amino acid sequence and blosum score if blosum maximization is active.
    • Free energy of the structure in kcal/mol
    • Additional measures
  • No solution found: Search time is limited to 120 minutes, so a solution may not be found within this time limit.

  • No possible solution: If the target structure (with specified constraints) has no solution and the time limit has not been reached, then CPdesign will state this in the results file.

On the earlier-explained example of a target transfer RNA structure with the sequence constraints for 'ACG' anticodon and 'CUGGCUCG' portion of acceptor, CPdesign returns the following:


CPdesign returns an input and output file. The input file contains the user-specified input, including constaints. The output file includes the solution sequence(s) and its MFE structure (guaranteed to be the target structure), the number of types of base pairs, the free energy of the structure in kcal/mol, as well as a number of values that measure the extent to which the ensemble of low energy structures resembles the MFE target structure. Measures related to quantifying the extent to which the ensemble of low energy structures resemble the target structure are defined in

Juan Antonio Garcia-Martin, Peter Clote, Ivan Dotu.
RNAiFold: A constraint programming algorithm for RNA inverse foldingand molecular design.
J Bioinform Comput Biol 11(2): 1350001, 2013.

All of these measure are defined in terms of the base pairing probabilities pi,j.
Expected pointwise entropy is defined by:

Expected base pair distance from the target structure S0 is:

where the Boltmann probability of the target structure is defined by

where Z denotes the partition function. Ensemble defect is defined to be the expected number of nucleotides, whose base pairing status differs from that of the target structure, taken over the ensemble of secondary structures of the solution sequence. Formally, if a denotes the sequence returned by CPdesign and S0 denotes the target structure, then ensemble defect n(a,S0) is defined by:
where p*i,j denotes the base pairing probability, provided that 1 ≤ i,j ≤ n and i,j are distinct, and otherwise p*i,n+1 is defined to be 1 minus the sum of p*i,j, for all 1 ≤ i,j ≤ n. As well, I denotes the indicator function, taking the value 1 if the expression in square brackets is true, and 0 otherwise. Since ensemble defect is a distance measure, the closer this value is to zero, the more most low energy structures resemble the target structure.
Probability of structure is the Boltmann probability of the target structure defined by:

Vienna structural diversity is:

Morgan Higgs structural diversity is: