Protein structure prediction

While genome wide sequencing projects produce torrents of sequence information about the genes, transcripts and proteins, experimental methods for protein tertiary and quaternary structure determination are considerably lagging the need and facing multiple technical difficulties. This gap necessitates the development of prediction methods that would facilitate the access to reliable macromolecular models and provide insights into structural features of proteins without solving those structures experimentally. The prediction can be simplified by projecting a 3D structure onto strings of structural assignments, and therefore it can be cast as a classification problem. For example, one can assign a secondary structure state for each residue, or a number for the solvent accessibility of that residue. Such strings of per-residue assignments are essentially one-dimensional (Figure 1). These 1D predictions are often the first step to gaining insight into the 3D structure and function of a protein.

Accurate prediction of relative solvent accessibilities of amino acid residues in proteins may be used to facilitate protein structure prediction and functional annotation. Toward that goal, we have developed a novel method (SABLE) for improved prediction of relative solvent accessibilities (RSA) (Ref 1). Contrary to other machine learning based methods from the literature we do not impose a classification problem with arbitrary boundaries between the classes. Rather, we seek a continuous approximation of the real valued RSA using non-linear regression with several feed forward and recurrent neural networks, which are combined into a consensus predictor.

Accurate RSA prediction using SABLE, as proved by validation on a large set of non-redundant (with low or no sequence homology to the training set as well as between themselves) proteins, was found to improve performance of 1D prediction methods for other structural characteristics of proteins. By incorporating predicted RSA into the model, we were able to reduce the number of features used and to improve the generalization. Specifically, we showed that predicted RSA can increase accuracy of the prediction of secondary structure (Ref 2), trans-membrane domains (Ref 3), disorder regions and phosphorylation sites (Ref 4).


Figure 1. Projection of protein structural aspects onto strings. For example, α-helices are assigned to 'H', β-sheets to 'E', unstructured loops to 'C'; area exposed to solvent can be represented by relative solvent accessibility expressed in percentile; trans-membrane regions vs soluble parts can be mapped as 'T' and 'N', respectively (here, TM regions highlighted yellow).

References

  1. Adamczak R, Porollo A, Meller J. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins. 2004 Sep 1;56(4):753-67. PubMed PMID: 15281128.
  2. Adamczak R, Porollo A, Meller J. Combining prediction of secondary structure and solvent accessibility in proteins. Proteins. 2005 May 15;59(3):467-75. PubMed PMID: 15768403.
  3. Cao B, Porollo A, Adamczak R, Jarrell M, Meller J. Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics. 2006 Feb 1;22(3):303-9. Epub 2005 Nov 17. PubMed PMID: 16293670.
  4. Swaminathan K, Adamczak R, Porollo A, Meller J. Enhanced prediction of conformational flexibility and phosphorylation in proteins. Adv Exp Med Biol. 2010;680:307-19. PubMed PMID: 20865514.

Protein-protein interaction

The recognition of protein interaction sites is an important intermediate step toward identification of functionally relevant residues, interacting partners, and understanding the protein function, therefore facilitating experimental efforts in that regard. To this end, we propose a novel model representation for the prediction of protein-protein interaction sites that integrates enhanced RSA predictions (from SABLE) with high resolution structural data. An observation that RSA predictions are biased toward the level of surface exposure consistent with protein complexes led us to investigate the difference between the predicted and actual (i.e., observed in an unbound structure) RSA of an amino acid residue as a fingerprint of interaction sites (Figure 2A).

RSA prediction methods that are based on sequence-derived evolutionary signatures of a local environment tend to be more consistent with the level of solvent exposure observed in complexes, rather than unbound structures (Ref 1). The ability of evolutionary-based RSA prediction methods to account implicitly for long range contacts should also result in biases towards the level of surface exposure observed in protein complexes for sites in interaction interfaces (Figure 2B). Consequently, the difference between the latter, i.e. surface exposure observed in an unbound structure, and the predicted (from sequence) RSA can be used for enhanced recognition of interfaces for protein-protein interaction. We demonstrated that RSA prediction-based fingerprints of protein interactions significantly improve the discrimination between interacting and non-interacting sites, compared to evolutionary conservation, physicochemical characteristics, structure-derived and other features considered before (Ref 1). On the basis of these observations, we have developed a new method (SPPIDER) for the prediction of protein-protein interaction sites, using machine learning-based approaches to combine the most informative features into the final predictor. The method demonstrates top performance in the field based on a large scale evaluation (Ref 2).

A
B

Figure 2. Novel fingerprints for prediction of protein-protein interaction sites, dSA=RSA(SABLE)-RSA(DSSP). A. SABLE produces a systematic bias by over-predicting the residues at protein interfaces to be more buried than they would appear in an unbound structure (dashed line, dSA is negative), whereas non-interacting residues are predicted more accurately (solid line, dSA is closer to 0). B. The concept underlying the observed biases in RSA prediction: the evolutionary-based RSA predictor can account for the long range contacts consistent with protein complexes.

References

1. Porollo A, Meller J. Prediction-based fingerprints of protein-protein interactions. Proteins. 2007 Feb 15;66(3):630-45. PubMed PMID: 17152079.
2. Porollo A, Meller J. Computational Methods for Prediction of Protein-Protein Interaction Sites. In: Protein-Protein Interactions - Computational and Experimental Tools; W. Cai and H. Hong, Eds. InTech 2012; 472: pp. 3-26.

Protein structure analysis

Despite the enormous efforts worldwide to provide biologists with computational tools for different types of protein structure analysis, specialized software still remains quite sophisticated in use and requires significant technical and programming knowledge what repels many users. At the same time, a growing need for educators in structural and molecular biology to convey their stories to students at different levels necessitates some easy way of generating revealing images of macromolecular structures and complexes. To this end, we designed and implemented a web-based platform, dubbed POLYVIEW, which enables quick versatile analysis of protein structures and high quality visualization of macromolecules. The platform provides: (i) the analysis of proteins and their complexes and display of the derived data using sequence profiles (POLYVIEW-2D, Ref 1); (ii) structural and functional annotation integrated with the high quality 3D rendering (POLYVIEW-3D, Ref 2); and (iii) quantitative analysis and animation of molecular motion and ensembles of alternative conformers from experiments and computer simulations (POLYVIEW-MM, Ref 3).

References

  1. Porollo AA, Adamczak R, Meller J. POLYVIEW: a flexible visualization tool for structural and functional annotations of proteins. Bioinformatics. 2004 Oct 12;20(15):2460-2. Epub 2004 Apr 8. PubMed PMID: 15073023.
  2. Porollo A, Meller J. Versatile annotation and publication quality visualization of protein complexes using POLYVIEW-3D. BMC Bioinformatics. 2007 Aug 29;8:316. PubMed PMID: 17727718; PubMed Central PMCID: PMC1978507.
  3. Porollo A, Meller J. POLYVIEW-MM: web-based platform for animation and analysis of molecular simulations. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W662-6. Epub 2010 May 26. PubMed PMID: 20504857; PubMed Central PMCID: PMC2896192.

The POLYVIEW platform is currently used worldwide and has a growing record of the use in educational courses and scientific publications. Below are selected examples of the use of POLYVIEW:

Cover of Clinical Pharmacology and Therapeutics, Issue November 2008, Vol 84 No 5

Hot paper, The Scientist, Issue January 1, 2009

Cover of the book, September 2012

Pharmacy Gate 4D, Project idea and concept by Peter Stasek, Germany, 2012

Cover of ACS Combinatorial Science, February 11, 2013: Vol. 15, Iss. 2

Protein mutations analysis

Cytochrome P450 monooxygenases (CYPs) represent a large and diverse family of enzymes involved in various biological processes in humans. Individual genome sequencing has revealed multiple mutations in human CYPs, and many missense mutations have been associated with variety of diseases. Since 3D structures are not resolved for most human CYPs, there is a need for a reliable sequence-based prediction that discriminates benign and disease causing mutations.

A new prediction method (MutaCYP) has been developed for scoring de novo missense mutations to have a deleterious effect. The method utilizes only five features, all of which are sequence-based: predicted relative solvent accessibility (RSA), variance of predicted RSA among the residues in close sequence proximity, Z-score of Shannon entropy for a given position, difference in similarity scores and weighted difference in size between wild type and new amino acids (Figure 3).

Evolutionary-based features indicate that disease causing mutations occur in CYPs primarily at conserved sites and have unfavorable similarity scores for mutation amino acids. In this respect, the distribution of Abs_dSS displays the tendency for deleterious mutations to have a wider difference in similarity scores between the mutation and a wild type amino acid.

Predicted RSA (predRSA) appears among the strongest sequence-based characteristics capturing the disease causing mutations. As we showed previously, see also above, with the overall accurate prediction of RSA by SABLE, the method is prone to over-prediction in terms of burial state for residues that are located in trans-membrane regions, at protein-protein interaction interfaces, and within structurally restrained regions. These are exactly the places where one would expect deleterious missense mutations most likely to occur. Therefore, a certain bias in predicted RSA towards the burial state is expected to correlate with such mutations.


Figure 3. Distribution of the features used in MutaCYP over benign and deleterious mutations. A. Abs_dSS − absolute difference between similarity scores of wild type amino acid and mutation for a given position. B. ss_Abs_dSize − absolute difference between sizes of wild type amino acid and mutation weighted by the difference of the corresponding similarity scores. C. zsEntropy21 − Z-score for Shannon entropy at a given position based on a window of 21 neighboring amino acids. D. predRSA − predicted RSA. E. varPredRSA21 − variance of predicted RSA for the window of 21 neighboring amino acids.

Reference

  • Fechter K, Porollo A. MutaCYP: Classification of missense mutations in human cytochromes P450. BMC Med Genomics. 2014 Jul 30;7(1):47. PubMed PMID: 25073475; PubMed Central PMCID: PMC4119178.