Introduction to DiANNA

Introduction to DiANNA

DiANNA (DiAminoacid Neural Network Application) is a web server that provides two services:

cysteine classification prediction
disulfide connectivity prediction

DiANNA 1.1 determines the cysteine species (free cysteine, half-cystine or ligand-bound) by using a support vector machine (SVM) with degree 2 polynomial kernel for the spectrum representation. Additionally, if a cysteine is predicted to be ligand-bound, then the most likely of the four most common ligands (iron, zinc, cadmium, carbon) is proposed.

DiANNA 1.1 determines the disulfide connectivity is predicted using a state-of-the-art method involving a novel architecture neural network. By disulfide connectivity, we mean, for example, in the case of four half-cystines, to determine that (1,2) and (3,4) are the disulfide bonds, or that (1,3) and (2,4) are the disulfide bonds, etc.

Cysteine classification prediction
A ternary classification is attempted by applying a support vector machine (SVM) with spectrum kernel to determine whether a cysteine is reduced (free in sulfhydryl state), half-cystine (involved in a disulfide bond) or bound to a metallic ligand. In the latter case, DiANNA predicts the ligand among iron, zinc, cadmium and carbon. The SVMs are trained and tested on a non-redundant list of proteins in which each of the three classes is well represented (the complete list is available here). To apply SVMs to the ternary cysteine classification problem, we must encode amino acid sequences (contents of size w windows) into vectors with real coordinates. To that end, we use the spectrum representation by Leslie et al., which proved more effective than the more sophisticated mismatch (Leslie et al.) and profile (Kuang et al.) representation for amino acid sequences. We then use libSVM with a polynomial kernel (degree 2) to train a ternary predictor. A similar approach is used for binary classification for each pair of cysteine classes, i.e. ligand-bound vs. half-cystines, ligand-bound vs. free cysteines, half-cystines vs. free cysteines. All three binary classifiers are available in addition to the ternary classifier.
Disulfide bonds prediction
A diresidue Neural Network (Ferre and Clote) is trained to recognize pairs of bonded half-cystines given input of half-cystines symmetric flanking regions. The network is trained using disulfide bonds information derived from high-quality protein structures (the complete list is available here, and is derived from Vullo and Frasconi and Fariselli et al .papers). The neural network input includes evolutionary as well as secondary structure information.
Given two size w windows centered at an N- resp. C-terminus putative half-cystine, we run PSIPRED on the whole input sequence to predict the secondary structure (helix, coil, sheet) of each of the 2w residues, then we use the PsiBlast run performed by PSIPRED to produce the profile of each position 1 ≤ i ≤ 2w.
Trained and tested on disulfide bonds extracted from a list of proteins having at most five and at lest two bonds, the software achieves 81% accuracy and 43% Matthews' correlation coefficient (see Ferre and Clote). The connectivity prediction (i.e. the prediction of disufide bond partners) is obtained by the Ed Rothberg's implementation of the Edmonds-Gabow maximum weight matching algorithm (wmatch). This algorithm is applied to the graph whose nodes are the putative half-cystines and whose edges are pairs of half-cystines weighted by the diresidue neural network in the disulfide bonds prediction module.
After training and testing on a list of proteins having at most five and at lest two bonds, the connectivity prediction achieves a rate Qp of 49% for perfect predictions (i.e. the fraction of proteins for which there are no false positive or false negative predictions made), 86% accuracy and 51% Matthews' correlation coefficient (see Ferre and Clote).

References:

Fariselli P, Riccobelli P, Casadio R. (1999). Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins 36(3):340-6 Pubmed
Fariselli P, Casadio R. (2001). Prediction of disulfide connectivity in proteins. Bioinformatics 17(10):957-64.Pubmed
Ferre F, Clote P. (2005). Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics. Pubmed
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2004). Profile-based string kernels for remote homology detection and motif extraction. Proc IEEE Comput Syst Bioinform Conf.. PubMed
Jones DT. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 292(2):195-202.PubMed Web Site
Leslie C, Eskin E, Noble WS (2002). The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput. PubMed
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467-76. PubMed
libSVM. Web Site.
Ed Rothberg's wmatch. Web Site
Vullo A, Frasconi P. (2004). Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 20(5):653-9. PubMed

DiANNA's homepage - Author's homepage