1. Protein Identification by MS/MS in Proteomics:
MS/MS involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. The two most frequently used computational approaches to recognizing sequences from mass spectra are
(a) peptide fragment fingerprinting approach, in which spectrum analysis is performed specifically for candidate proteins extracted from a database by building theoretical model spectra (from theoretical proteins) and comparing the experimental spectra with the theoretical model spectra. This approach is not suitable for the proteins with missing post-translational modifications (PTMs) and from unsequenced genomes.
(b) de novo sequencing approach, in which the inferences of the peptide sequences or partial sequences is independent of the information extracted from pre-exsiting protein or DNA databases. Sequence similarity search algorithms are specially developed to compare the inferred complete or partial sequences with theoretical sequences. Once a protein has been sequenced by de novo methods, one can look for related proteins in a GTDB using a matching algorithm such as MS-Blast.
Consistency-checking of de novo sequencing tools: In this scenario, we investigate the possibility of improving the accuracy of peptide sequence identification through consistency-checking of the results from different de novo sequencing methods for MS/MS interpretation. The automated harvesting of results from different de novo sequencing tools for a target mass spectra and collation to allow easy comparison of answers to cache the consistency between them was realized through the OpenKnowledge framework. Three example de novo sequencing tools, including web server (PEAKS) and local programs (PepNovo and Lutefisk), with developed OpenKnowledge Components (OKCs), play the role as data source to provide peptide identifications answers which will then be compared by peer data comparer to obtain a re-renked list of candidate peptide sequences at three different confidence levels. Recursion function is employed to allow submission and analysis of multiple target mass spectra.
Peer Ranking of protein identification by MS/MS tools: There is a level of confusion surrounding the selection of specific protein identification approach using validated MS/MS data, and specific protein identification tool with MS/MS data, for specific tasks. The peer ranking algorithm integrated into the OpenKnowledge system helps to evaluate the relative popularity/importance of different protein sequence identification tools employing different approaches, including PFF approach (Subscribed example peer MASCOT and OMSSA are available to play this role), and a combination of de novo sequencing (performed by example peer PepNovo or Lutefisk) followed by database similarity searching approach (performed by example peer MS-BLAST).
2. Protein Structure Prediction
Protein structure prediction is one of the best-known goals pursued by bioinformaticians. A protein’s three-dimensional (3-D) structure contributes crucially to understanding its function, to targeting it in drug discovery and enzyme design, etc. However, there is a continually widening gap between the number of protein amino acid sequences that are deduced rapidly through the ongoing genomics efforts, and the number of proteins for which atomic coordinates of their 3-D structures are deposited in the Protein Data Bank (PDB), i.e. those that are determined by structural biological techniques. To bridge this gap many computational biology research groups have been focussing on developing ever improving methodology to produce structural models for proteins based on their amino acid sequences. Still the resulting methods are far from perfect, and there is no one method that is always producing an accurate model. However, particularly in comparative modelling cases (where a protein with known structure can be used as a template for a protein of interest, based on similarity between their sequences), high-quality modelled structures can be useful resources for biological research. Consistency checking and consensus building are commonly used strategies in the field to select high quality models from the pool of available models produced by different methods.
Consistency-checking of 3-D Models for Yeast Protein Structure Prediction: Similar to the consistency-checking of de novo sequencing by MS/MS experiment, in this experiment we aimed to check consistency among pre-computed comparative models from three public repositories, for the proteins encoded by the genome of the budding yeast Saccharomyces cerevisiae. The three example public repositories, SWISS-MODEL, ModBase, and SAM-T20, provide protein 3-D models generated by different structure prediction approach/pipelines. Systematic retrieval and comparison of the three data sources were done over yeast proteins. The results were made available as a new resource called the Comparison of Yeast 3-D Structure Prediction (CYSP).
OK-omics is a new form of knowledge sharing for expression proteomics with the aim of (1) augmenting significantly the percentage of peptides and proteins to be sequenced and identified by means of mass-spectrometry-based analysis, and (2) reducing significantly the sequencing and identification time needed. For this we combine current bioinformatics techniques for proteomics with novel multiagent system architectures and distributed knowledge coordination mechanisms in peer-to-peer networks, which have been developed in the context of the OpenKnowledge project.
It is an important problem in proteomics to identify known and new protein sequences using high-throughput methods. Protein sequences are usually stored in public databases. However, these protein sequences are mostly inferred by the direct translation of gene sequences, not directly determined by physical experiments. This means that neither proteins with post-translation modifications (PTM) nor proteins whose genomes have not been sequenced would find exact matches in such databases. An efficient experimental technique for the identification of proteins is mass spectrometry (MS). However, among other factors the following issues complicate this task:
The ExperimentWe have carried out an experiment in which we set up a P2P network of nine proteomics laboratories from ProteoRed, Spain’s National Proteomics Institute, each handling its own database of sequences and spectra. When queried, a laboratory looked for matches between the input sequence or spectrum and the information collected in its database. For our test data we have decided to use preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test sample. It consists of a mixture of 48 purified and recombinant proteins (plus an unknown number of protein contaminants) extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide survey. 78 laboratories participated in the analysis of these mixtures. Among these, only 35% could correctly identify more than 40 protein components. Thus, the sample, being relatively handy for the purpose of testing the OK system, is of a complexity not far from that found in real proteomics work.