LSDIS > Projects > Glycomics > ProPreO


Bioinformatics for Glycan Expression - Integrated Technology Resource for Biomedical Glycomics: Technological Research and Development Project IV, A Project funded by NIH

ProPreO: comprehensive Proteomics data and process provenance ontology

Proteomics discipline and glycoproteomics in particular, are focused on two core objectives:

  • Identification of a biomolecule - what is it?
  • Quantification of the identified biomolecule - How much of it is there?

The experimental protocols for proteomics are rapidly maturing and attaining the proportions of industrial scale for data generation; also termed as high-throughput experiment. Similar to genomics, the limiting step in this scenario will be the computational and related analytical tools that are available to process this large volume of data and generate useful information.

Many queries in proteomics involve comparison of the proteome in different organisms, tissues, or cells in different developmental or disease states. But, proteomics experimental protocols are characterized by heterogeneity in sources of sample (biological organism), process to generate data (separation techniques or mass spectrometric instruments), parameters used in the process (instrumental parameters or separation method parameters), and data formats used to store the data. Hence, proteomics research requires not only finding specific data sets obtained using the relevant biological sources, but also require one to ensure that the data sets are comparable. For example, differences in the sample preparation, data acquisition or data processing can invalidate a comparison. Provenance, the information regarding the ancestry of a dataset and the description of how the data is created, transformed and processed, forms the foundation which allows for multiple proteomics datasets to be compared in a relevant manner. ProPreO, a proteomics process ontology, models not only data provenance but also process provenance to enable consistent and coherent comparison as well as analysis of proteomics datasets. Hence, ProPreO is one of the first ontologies focused on capturing comprehensive provenance information in proteomics and related fields.

As part of the bioinformatics research in the National Center for Research Resources (NCRR) Integrated Technology Resource (ITR) for biomedical glycomics, ProPreO ontology is one of the two ontologies we have developed. The other related ontology is GlycO, a domain ontology to model the structure and function of glycans. These two ontologies enable the creation of a computational framework for the annotation, retrieval and analysis of high-throughput experimental proteomics and glycoproteomics data, in order to facilitate the discovery of biological knowledge that it embodies.

We adhered to four major criteria during the development of the ProPreO:

  • Logical rigor: We are using ProPreO for annotation of experimental proteomics data. Using this annotated experimental data; information management application will not only be able to store, retrieve, and integrate multiple datasets but also infer implicit knowledge that will provide insight to proteomics researchers for hypothesis formulation and validation. Hence, to allow computational tools to use ProPreO for reasoning purposes, we ensured the absence of incorrectly determined classes, incorrect or inappropriate naming schemes, and ill-defined relationships between concepts in the ProPreO schema. ProPreO schema includes 390 rigorously defined classes, 32 generic relations and 172 specific restrictions on the generic relations to correctly describe each concept and its relation to other concepts.

  • Compatibility with existing bio-medical ontologies: It is now well understood and accepted that the life sciences domain requires multiple ontologies to manage the inherent complexities of the domain. Hence, in the scenario involving multiple but related ontologies, it is critical that these ontologies can be used in an integrated manner by semantic applications. We have followed the Basic Formal Ontology (BFO) (Smith B. et. al. 2002) approach in class and relationship creation in ProPreO. The three top-level classes of ProPreO are 'data' (datasets and parameter data), (experimental) 'instrument', and (experimental) 'task'. Additionally, we created the relations in ProPreO by defining generic and easily understandable relations at top-level classes. Using various restrictions, we defined the application of the generic relations for each class thereby effectively and efficiently modeling the characteristics of each concept and its relation with other concepts accurately. Currently, we are working on issues related to the integration, mapping and alignment of ProPreO with ontologies listed in the Open Biomedical Ontologies (OBO) repository.

  • Use of OWL-DL language: The Web Ontology Language (OWL) has three flavors namely, OWL-Lite, OWL-DL and OWL-Full. As we planned ProPreO ontology to be used by computational applications while being as accurate as possible in expressing the inherent complexity of the proteomics experimental domain, we chose OWL-DL as the language for ProPreO. OWL-DL enables us to be expressive while ensuring acceptable computational properties..

  • Populated ontology: We believe that an ontology schema is of limited use without real world knowledge. We have populated ProPreO with instances corresponding to concepts modeled as part of the ontology schema. ProPreO has 3.1 million instances and 18.6 million triples. This population of ProPreO with million of real world instances enables us build computational tools that integrate the large volumes of high-throughput experimental data within an overarching semantic framework and reason over it for knowledge discovery. These four criterions has enabled ProPreO to provide the formal semantic foundation for modeling and incorporation of comprehensive provenance information in wide ranging, high-throughput proteomics research.

Access ProPreO (version: 0.5)

  • ProPreO schema: The schema of the ontology featuring its 390 classes and attendant relations. This is an *.owl file which is best viewed using the Protege ontology development environment ( The ProPreO schema file is relatively small and hence may be used to gain an understanding of the structure of the ontology and its applicability to various scenarios. Download

  • ProPreO populated ontology: This file includes the 3.1 million instances and hence is a relatively large *.owl file. ProPreO currently is populated with instance related to human tryptic peptides, their parent proteins and related enzyme entities. This populated ProPreO ontology may be used as foundation for developing various semantic applications that leverage the instances in it and its comprehensive provenance framework. Download

For citation and further details on ProPreO: Satya S. Sahoo, Christopher Thomas, Amit Sheth, William S. York, and Samir Tartir, "Knowledge Modeling and its application in Life Sciences: A Tale of two ontologies" the 15th World Wide Web (WWW, 2006) conference, Edinburgh, UK, May 2006.

Funding: Bioinformatics of Glycan Expression (one of the four components of the "Integrated Technology Resource for Biomedical Glycomics," appox. $6 million+), National Institute of Health, July 1, 2003 - June 30, 2008.