The Semantic Vocabulary Interoperation Project
Goal
This project plans to investigate methods of providing semantic interoperation
across multiple biomedical vocabularies.
Motivation
Streamlining and making the healthcare IT infrastructure efficient and
responsive is being viewed as a means of reducing the costs of medical treatment
and healthcare. Standards based information integration and interoperation
across multiple healthcare information systems is a critical requirement of
streamlining the healthcare IT infrastrucuture. This has been the focus of
e-Government initiatives such as the Consolidated Health Informatics and workshops such as Information Technologies for
Healthcare: Barriers to Implementation (organized by NIST). On the other
hand efforts are underway to for integration of clinical information in the
context of a medical record, and storage of the data in a clinical data
warehouse for analysis and mining in the
Clinical Research Information System
underway at the NIH.
Biomedical vocabularies and ontologies have a critical role to play in the
process of integration of healthcare information. Clinical and hospital
information systems have used terms from a variety of biomedical vocabularies
to specify codes for healthcare transactions and other pieces of information.
On the other hand initiatives such as CHI and government regulations such as
HIPAA have standardized on
biomedical vocabularies that are a part of the Unified Medical Language System (UMLS) effort, For e.g.,
SNOMED, RxNORM and UMDNS. There will be a critical requirement to support
mapping of terms from/to the various biomedical vocabularies that are a part of
the UMLS.
The UMLS Metathesaurus and Semantic Network are biomedical knowledge resources,
in which an attempt has been made to standardize the semantics of the various
terms in various biomedical vocabularies and capture relationships between those
terms, that might be both within and across vocabularies. The main application
focus of the UMLS has been Information Retrieval Applications. In
this project, we propose to leverage and apply this knowledge in the context of
Information Integration Applications, specifically for mapping and
translation of terms across multiple biomedical vocabularies.
Problem Statement
Given a concept in a particular (source) biomedical terminology, e.g., ICD-9,
how does one enable determination of an equivalent or semantically
closest concept or concept expressions (typically involving multiple
concepts combined using boolean and other types of operators) in a target
biomedical vocabulary, e.g., SNOMED.
The main research challenges in the above effort are:
- The cases where there is a 1-1 mapping of a term in one vocabulary to a termin another vocabulary are likely to be few. In fact terms might be related to
each other using hyponyms/hypernyms and other hierarchical relationships. The
key research issue is to estimate the semantic distance between a term
and its multiple translations and to minimize this distance.
- One strategy to minimize the semantic distance between a term and its
translations is the use of term definitions (available in MRATX and MRDEF)
to obtain a mapping to a composition of concepts in the target vocabulary. We
need to explore ways and means of representing the term definitions in a formal
manner to enable automatic translations.
- Whereas we recognize that the translation process is not always possible
without human intervention, we need to explore ways and means for minimizing
human intervention. For this we need to come up with metrics to evaluate our
techniques that will enable us to determine when human input is required to
determine a mapping, for example, when the semantic distance between the term
in the source vocabulary and the term in the target vocabulary is very large.
The set of investigations that we plan to pursue as a part of this project are
discussed below.
Demos
Click Here for a simple demo of term mappings across
vocabularies based on synonymy.
Project Plan
Investigation 1: Characterization of Semantic Distance Metrics
Semantic distance metrics will be used in the context of translating a term in a
source biomedical vocabulary to the semantically closest term or term expression
in a target biomedical vocabulary. In this context, we will investigate
approaches for characterizing semantic distances across a broad set of computer
science and medical informatics research areas.
Method
A systematic survey and exploration of semantic distance metrics used in a wide
variety of disciplines will be undertaken. A literature search of related work
in disciplines such as Medical Informatics, Knowledge Representation,
Statistical Clustering, Data Mining, Machine Learning, Information Retrieval and
Natural Language Processing. The various semantic distance measures will be
characterized based on criteria such as: measures based on the probability
distributions characterizing concept extensions v/s those based on intensional
definitions of concepts. The relationship between the type of semantic distance
measures and the nature of the applications using the measure in conjunction
with the availability of resources will be explored. Prototypical semantic
distance metrics based on the characterizations will be designed.
Evaluation Plan
The various semantic distance metrics will be evaluated in the context of
investigation of approaches for translations of concepts across biomedical
vocabularies discussed below. For a brief description, refer to Investigations
3 and 4 below.
Investigation 2: Formal Representations of UMLS Knowledge
In order to exploit the knowledge available in the UMLS Metathesaurus and the
Semantic Network in the context of algorithms that perform semantic translations
across vocabularies, a formal representation of that knowledge is required. We
shall investigate approaches to explicitly represent UMLS semantics in a formal
manner using KR formalisms.
Method
A systematic evaluation of Semantic Web ontology languages, such as DAML+OIL and
OWL along with their associated Description Logics implementations will be
performed. Various approaches for representation of concepts and relationships
in the UMLS Metathesaurus and Semantic Networks will be considered and evaluated
based on various criteria. The representation of a healthcare-based information
model, the
Reference Information Model (RIM) and its interrelationships to the
Metathesaurus, Semantic Network and various biomedical terminologies will also
be explored.
In particular, a concept in a biomedical vocabulary can be associated with a
combination of concepts, expressed by associated expressions and definitions.
These relationships will be used by algorithms in the translation process.
Semi-automatic approaches for acquisition of current knowledge into formal KR
specifications will be also be designed and considered. In particular,
definitions and associated expressions available in MRATX and MRDEF will be
expressed as DL expressions using NLP techniques.
Evaluation Plan
The evaluation plan has two parts:
- The impact of the formal specifications on the term translation process.
In particular, we will evaluate the formal specifications on criteria such as
expressiveness (do they capture sufficient information to enable the translation
process, does the specification represent possible relations), tractability (are
computations on the formal specifications such as reasoning efficient) and the
appropriateness of the KR constructs for computing semantic distances.
- The effectiveness of the automatic acquisition process. The use of NLP
approaches to semi-automatically generate formal concept descriptions will
create incorrect concept expresssions, especially those that are being generated
from the definitions. Experiments to evaluate the error rate of the concept
descriptions generated will be performed.
Investigation 3: Direct Vocabulary Translations
In this phase of the investigation, we will propose algorithms that try to
translate a term from a source to a target biomedical vocabulary directly using
the UMLS Semantics.
Method
An initial set of biomedical vocabularies, e.g., SNOMED, RxNORM, UMDNS will be
chosen for experimentation. A baseline benchmark of the 1-1 mappings across
multiple vocabularies will be computed. Algorithms that compute a translation
of a term in a source vocabulary to terms in a target vocabulary will be
designed. These algorithms will use the UMLS Semantics represented in a formal
specification. Multiple candidate translations might consist of hypernyms and/or
hyponyms, or concept expressions in the target vocabulary. Algorithms that
merge multiple vocabularies and construct candidate translations by navigating
the merged vocabularies will be developed and tested. The role of semantic
constraints in guiding the navigation-based search for candidate translations
will be explored and used to improve the algorithms. Various semantic distance
measures identified in Investigation 1, will be applied to choose the best of
the various candidate translations. In case the bad quality translations are
obtained for a particular term, we may want to notify a domain expert to
identify the translation. Thresholding schemes that identify situations for
involvement of human experts will be identified.
Evaluation Plan
The various algorithms designed will be evaluated along the following
dimensions:
- Evaluation of the increased coverage wrt to the baseline benchmarks.
- The quality of the translations, i.e., the number of relevant or irrelevant
translations returned. The gold standard for this will be developed by a
subjective evaluation of the candidate translations by domain experts.
- The quality of the semantic distance measures. The final choice of a term
translation might be different depending on the semantic distance measure
chosen. The choices made by the domain experts will be contrasted for semantic
closeness, which in turn will enable evaluation of the quality of the semantic
measures.
Status
Work has still to begin on this part of the project.
Investigation 4: Mediated Vocabulary Translations
In some cases, either there may not exist a translation of a term from a source
vocabulary to a target vocabulary, or the translations obtained by using the
direct approach might be of a very poor quality. In those cases we may choose a
third vocabulary to mediate a translation between two biomedical
vovacbularies.
Method
An initial set of (source and target) biomedical vocabularies, e.g., SNOMED,
RxNORM, UMDNS and a mediating vocabulary, e.g., MeSH will be chosen for
experimentation. The baseline benchmark in this case will be the translations
obtained using algorithms developed in Investigation 3. Two broad approaches
will be explored in this investigation:
- Application of algorithms developed in Investigation 3 in two phases: once
for translation of a term from the source to the mediating vocabulary, and next
for translation of the terms in the mediating to the target vocabulary. The key
challenge will be to combine the semantic distances across the two phases into
a composite semantic distance measure to evaluate candidate translations.
- Merging of the three vocabularies into a single integrated vocabulary using
the UMLS semantics. Algorithms need to be developed to navigate this integrated
structure to determine candidate translations. The algorithms for navigating
this integrated structure is likely to be similar to the algorithms developed
in Investigation 3, but with potential differences in the use of semantic
constraints and other information to guide the search process.
- Approaches to combine term translations obtained from the direct translation
approach and those that are obtained from the mediated translation approach.
Some interesting issues that need to be investigated is how the semantic
distance measure may be updated.
- As in the case of Investigation 3, various semantic distance measures
identified in Investigation 1, will be applied to choose the best of the various
candidate translations in both the approaches. In case the bad quality
translations are obtained for a particular term, we may want to notify a domain
expert to identify the translation. Thresholding schemes that identify
situations for involvement of human experts will be identified.
- We will also explore adaptation of these techniques to discover matching
concepts across different biomedical domains, for example matching concepts
in the UMLS with concepts in the Gene Ontology.
Evaluation Plan
The various algorithms designed will be evaluated along the following
dimensions:
- Evaluation of the increased coverage wrt to the baseline benchmarks, in
this case the results obtained after applying the algorithms developed in
Investigation 3.
- The quality of the translations, i.e., the number of relevant or irrelevant
translations returned. The gold standard for this will be developed by a
subjective evaluation of the candidate translations by domain experts.
- The quality of translations and the level of coverage will also be compared
across the two broad approaches for computing translations discussed in the
Methods section above.
- The quality of the semantic distance measures. The final choice of a term
translation might be different depending on the semantic distance measure
chosen. The choices made by the domain experts will be contrasted for semantic
closeness, which in turn will enable evaluation of the quality of the semantic
measures.
- The quality of the mediating vocabulary. We propose to use different
mediating vocabularies to translate the same set of terms and try to determine
which of the vocabularies is a better mediating vocabulary.
Presentations
Publications
Related Web Sites