Bioinformatics for Glycan Expression - Integrated Technology Resource for Biomedical Glycomics: Technological Research and Development Project IV, A Project funded by NIH
The Glycomics Ontology GlycO focuses on the glycoproteomics domain to model the structure and functions of glycans and glycoconjugates, the enzymes involved in their biosynthesis and modification, and the metabolic pathways in which they participate. GlycO is intended to provide both a schema and a sufficiently large knowledge base, which will allow classification of concepts commonly encountered in the field of glycobiology in order to facilitate automated reasoning and information analysis in this domain.
The GlycO schema exploits the expressiveness of OWL-DL to place restrictions on relationships, thus making it suitable to be used as a means to classify new instance data. These logical restrictions are necessary due to the chemical nature of glycans, which have complex, branched structures that cannot be represented in any simple way. Glycans are thus distinguished from DNA (e.g., genes) and proteins, which can be represented (at least in their most basic forms) as simple character strings. The structural knowledge in GlycO is modularized, in that larger structures are semantically composed from smaller canonical building blocks. In particular, glycan instances are modeled by linking together several instances of canonical monosaccharide residues, which embody knowledge of their chemical structure (e.g., ?-D-GlcpNAc) and context (e.g., attached directly to the Asn residue of a protein). This bottom-up semantic modeling of large molecular structures using smaller building blocks allows structures in GlycO to be placed in a biochemical context by describing the specific interactions of its component parts with proteins, enzymes and other biochemical entities.
The information needed to populate the knowledge base is automatically extracted from several partially overlapping but divergent sources, including the widely used KEGG, SweetDB, and CARBBANK databases. Transformation and disambiguation techniques are applied in order to avoid multiple entries. The ultimate goal is to generate a large ontology that can be used for the retrieval of information from numerous and diverse information sources, including structural, functional and taxonomic databases, along with experimental data that is exposed on the Internet, and the discovery of knowledge implicit in that information.
The current release version 0.95 of GlycO contains 573 classes and 113 types of named relationships. It has been semi-automatically populated with more than 480 N-Glycans. In order to assure correctness on the lower levels of granularity, several hundred instances that act as building blocks for glycans have been inserted manually. One of these concepts is the carbohydrate_residue, or basic unit of glycan structure. Carbohydrate residues are classified according to their structural features, such as absolute conformation (d or l), overall configuration (e.g., gluco or manno), anomeric configuration (α or β), ring form (f or p), and number of main-chain carbons (e.g., hexosyl or pentosyl). Thus, the concepts in GlycO can be mapped to language commonly used by the glycobiologist to describe the building blocks of glycans. By formalizing the specification of glycosyl linkages between carbohydrate residues, GlycO also provides a means to represent the chemical environment of specific instances of these residues. GlycO implements a powerful extension of this approach by defining "canonical" residue instances, as described in the following example. A typical N-glycan contains a single β-d-Manp residue in its core. This residue is glycosidically linked to a specific site (oxygen-4) of the next residue, which is invariably a β-d-GlcpNAc residue. The identity of the β-d-Manp residue and its precise location in the core of the N-glycan allow it to be unambiguously classified. In fact, glycobiologists often refer to this residue as "the core β-Man residue", with the implied assertion that this residue is in a particular molecular location and that its biosynthetic addition to the glycan was catalyzed by a specific class of glycosyl transferases (i.e., a GDP-mannose-dolichol diphosphochitobiose mannosyltransferase, EC 126.96.36.199). The trained glycobiologist can intuitively make a large number of structural and biochemical inferences when the core β-Man residue is invoked. This can be viewed as a colloquial classification of a canonical glycosyl residue, as each unique N-glycan structure contains a single glycosyl residue called "the core β-Man residue." However, very few of the residues that make up N-glycans have a common name based on their chemical identity and context.
As a novelty for ontology design on the semantic web we do not only see the ontology as a means to achieve agreement amongst the users who subscribe to the ontology's view of the world, but we also provide a basis for the actual modeling of individual structures. The keywords for this modeling approach are modularity and canonical individuals. We are building a bottom-up model of the glycochemistry domain by providing building blocks encoded as ontology individuals that can be used to build larger structures. For example, a glycan-moiety-individual is composed of several monosaccharide residue individuals. The glycan moiety individual in turn can be used as a building block of larger molecules, such as glycoproteins or glycolipids.
Specification of canonical residues in GlycO shows this powerful concept for all of the monosaccharide residues within the glycan. For N-glycans, this is accomplished by defining a canonical tree that subsumes all possible N-glycans. That is, almost all known N-glycans can be completely specified by choosing a subset of the nodes of this canonical tree that form a connected (directed) graph. Such a graph (known as glycoTree) has been previously described (N. Takahashi and K. Kato, Trends Glycosci. Glycotech., 15: 235-251), and we have formalized that structure as a collection of interconnected, canonical residue instances in GlycO. This provides a mechanism by which the chemical and biological properties of each residue within the glycan, as well as the cellular machinery involved in its biosynthesis and degradation, can be semantically inferred. That is, other semantically defined objects (such as glycosyl transferases) and processes (such as metastasis) can be associated with canonical residues that they depend on or interact with. Some of these associations may be indirect (via other objects in the ontology), or inferred by analysis of quantitative information (e.g. correlation of the abundance of glycans containing a specific canonical residue and the observation of a cellular property like invasiveness) that could be extrracted from a semantically annotated database. An example is specification (within GlycO) that addition of "N-glycan_b-D-GlcNAc_9" is catalyzed by an instance of the GNT-V class of glycosyl transferases, and that glycans containing this residue is present are recognized by the lectin LPHA. Then, the hypothesis that GNT-V overexpression is correlated with elevated invasiveness of various types of cancer cells can be inferred from a semantically annotated database that includes information regarding of the binding of different lectins to various cancer cell lines and the physiological properties of these cell lines.
Enzyme activity plays a crucial role in the synthesis of glycans. However, we believe that enzymes, as a subset of proteins, form a separate domain of interest. For this reason, rather than extending GlycO, we enriched it through the development of a domain specific ontology, EnzyO. EnzyO (for Enzyme Ontology) has been populated with rich descriptions of enzyme structures and reactions, thus embodying a deep knowledge of the domain.
Many disparate sources of information were used during the creation and population of EnzyO, thus it may be held as a comprehensive source of knowledge that will allow for the efficient retrieval of semantically relevant enzyme information.
One major application goal of GlycO is to provide access to all pathways that are a part of the glycan biosynthesis. In contempoary glycan databases, pathways are cut off at some point to show that a particular part of the pathway can be seen as a single entity. In GlycO, path queries allow us to show complex relationships of glycans and biochemical processes that cross these assigned boundaries. Hence we can show complete pathways leading from any arbitrarily chosen compound to any other one.
Automated protocols are being used to populate GlycO. In order to harvest this data, we use the Semagix Freedom toolkit that allows extraction of data from semi-structured internet sources, such as CarbBank, KEGG and SweetDB. Simply collecting this information is not enough, since database schemas are usually shallow and categorization is typically done by keywords rather than by a class hierarchy. Keywords rarely provide sufficient information for the complete classification of glycan instances after extraction from the source. For incorporation into GlycO, instances of glycans and their constituent residues have to be classified according to their structure. This process is facilitated by first converting the imported glycan structure (usually in IUPAC format) into the LINUCS format (Bohne-Lang A, Lang E, Forster T, von der Lieth CW. 2001. LINUCS: linear notation for unique description of carbohydrate sequences. Carbohydr Res. 336:1-11), and then to our GLYDE format. Both LINUCS and GLYDE are tree-based formats in which the natural topology of the glycan is mirrored in the data structure. GLYDE differs from LINUCS in that GLYDE is an XML-based format that can be readily parsed by widely available software. GLYDE-encoded glycan instance information can be parsed according to a canonical tree, such as GlycoTree (reference) embodied within GlycO (see above). In this process, the glycan is split up into its component residues and each residue is categorized according to its chemical structure and context.
Access GlycO and EnzyO
When referring to GlycO, please cite:
Funding: Bioinformatics of Glycan Expression (one of the four components of the "Integrated Technology Resource for Biomedical Glycomics," appox. $6 million+), National Institute of Health, July 1, 2003 - June 30, 2008.
©2005 LSDIS and the University of Georgia. All rights reserved.