The Semantic Vocabulary Interoperation Project

Goal

This project plans to investigate methods of providing semantic interoperation across multiple biomedical vocabularies.

Motivation

Streamlining and making the healthcare IT infrastructure efficient and responsive is being viewed as a means of reducing the costs of medical treatment and healthcare. Standards based information integration and interoperation across multiple healthcare information systems is a critical requirement of streamlining the healthcare IT infrastrucuture. This has been the focus of e-Government initiatives such as the Consolidated Health Informatics and workshops such as Information Technologies for Healthcare: Barriers to Implementation (organized by NIST). On the other hand efforts are underway to for integration of clinical information in the context of a medical record, and storage of the data in a clinical data warehouse for analysis and mining in the Clinical Research Information System underway at the NIH.

Biomedical vocabularies and ontologies have a critical role to play in the process of integration of healthcare information. Clinical and hospital information systems have used terms from a variety of biomedical vocabularies to specify codes for healthcare transactions and other pieces of information. On the other hand initiatives such as CHI and government regulations such as HIPAA have standardized on biomedical vocabularies that are a part of the Unified Medical Language System (UMLS) effort, For e.g., SNOMED, RxNORM and UMDNS. There will be a critical requirement to support mapping of terms from/to the various biomedical vocabularies that are a part of the UMLS.

The UMLS Metathesaurus and Semantic Network are biomedical knowledge resources, in which an attempt has been made to standardize the semantics of the various terms in various biomedical vocabularies and capture relationships between those terms, that might be both within and across vocabularies. The main application focus of the UMLS has been Information Retrieval Applications. In this project, we propose to leverage and apply this knowledge in the context of Information Integration Applications, specifically for mapping and translation of terms across multiple biomedical vocabularies.

Problem Statement

Given a concept in a particular (source) biomedical terminology, e.g., ICD-9, how does one enable determination of an equivalent or semantically closest concept or concept expressions (typically involving multiple concepts combined using boolean and other types of operators) in a target biomedical vocabulary, e.g., SNOMED. The main research challenges in the above effort are: The set of investigations that we plan to pursue as a part of this project are discussed below.

Demos

Click Here for a simple demo of term mappings across vocabularies based on synonymy.

Project Plan

Investigation 1: Characterization of Semantic Distance Metrics

Semantic distance metrics will be used in the context of translating a term in a source biomedical vocabulary to the semantically closest term or term expression in a target biomedical vocabulary. In this context, we will investigate approaches for characterizing semantic distances across a broad set of computer science and medical informatics research areas.

Method

A systematic survey and exploration of semantic distance metrics used in a wide variety of disciplines will be undertaken. A literature search of related work in disciplines such as Medical Informatics, Knowledge Representation, Statistical Clustering, Data Mining, Machine Learning, Information Retrieval and Natural Language Processing. The various semantic distance measures will be characterized based on criteria such as: measures based on the probability distributions characterizing concept extensions v/s those based on intensional definitions of concepts. The relationship between the type of semantic distance measures and the nature of the applications using the measure in conjunction with the availability of resources will be explored. Prototypical semantic distance metrics based on the characterizations will be designed.

Evaluation Plan

The various semantic distance metrics will be evaluated in the context of investigation of approaches for translations of concepts across biomedical vocabularies discussed below. For a brief description, refer to Investigations 3 and 4 below.

Investigation 2: Formal Representations of UMLS Knowledge

In order to exploit the knowledge available in the UMLS Metathesaurus and the Semantic Network in the context of algorithms that perform semantic translations across vocabularies, a formal representation of that knowledge is required. We shall investigate approaches to explicitly represent UMLS semantics in a formal manner using KR formalisms.

Method

A systematic evaluation of Semantic Web ontology languages, such as DAML+OIL and OWL along with their associated Description Logics implementations will be performed. Various approaches for representation of concepts and relationships in the UMLS Metathesaurus and Semantic Networks will be considered and evaluated based on various criteria. The representation of a healthcare-based information model, the Reference Information Model (RIM) and its interrelationships to the Metathesaurus, Semantic Network and various biomedical terminologies will also be explored. In particular, a concept in a biomedical vocabulary can be associated with a combination of concepts, expressed by associated expressions and definitions. These relationships will be used by algorithms in the translation process. Semi-automatic approaches for acquisition of current knowledge into formal KR specifications will be also be designed and considered. In particular, definitions and associated expressions available in MRATX and MRDEF will be expressed as DL expressions using NLP techniques.

Evaluation Plan

The evaluation plan has two parts:

Investigation 3: Direct Vocabulary Translations

In this phase of the investigation, we will propose algorithms that try to translate a term from a source to a target biomedical vocabulary directly using the UMLS Semantics.

Method

An initial set of biomedical vocabularies, e.g., SNOMED, RxNORM, UMDNS will be chosen for experimentation. A baseline benchmark of the 1-1 mappings across multiple vocabularies will be computed. Algorithms that compute a translation of a term in a source vocabulary to terms in a target vocabulary will be designed. These algorithms will use the UMLS Semantics represented in a formal specification. Multiple candidate translations might consist of hypernyms and/or hyponyms, or concept expressions in the target vocabulary. Algorithms that merge multiple vocabularies and construct candidate translations by navigating the merged vocabularies will be developed and tested. The role of semantic constraints in guiding the navigation-based search for candidate translations will be explored and used to improve the algorithms. Various semantic distance measures identified in Investigation 1, will be applied to choose the best of the various candidate translations. In case the bad quality translations are obtained for a particular term, we may want to notify a domain expert to identify the translation. Thresholding schemes that identify situations for involvement of human experts will be identified.

Evaluation Plan

The various algorithms designed will be evaluated along the following dimensions:

Status

Work has still to begin on this part of the project.

Investigation 4: Mediated Vocabulary Translations

In some cases, either there may not exist a translation of a term from a source vocabulary to a target vocabulary, or the translations obtained by using the direct approach might be of a very poor quality. In those cases we may choose a third vocabulary to mediate a translation between two biomedical vovacbularies.

Method

An initial set of (source and target) biomedical vocabularies, e.g., SNOMED, RxNORM, UMDNS and a mediating vocabulary, e.g., MeSH will be chosen for experimentation. The baseline benchmark in this case will be the translations obtained using algorithms developed in Investigation 3. Two broad approaches will be explored in this investigation:

Evaluation Plan

The various algorithms designed will be evaluated along the following dimensions:

Presentations

Publications

Related Web Sites