|
# of documents
|
SMART + NLP
|
LSI + TNE
|
|
1000
|
touchgraph 1000 NLP
|
touchgraph 1000 LSI + TNE
|
|
2000
|
touchgraph 2000 NLP
|
touchgraph 2000 LSI + TNE
|
|
3000
|
touchgraph 3000 NLP
|
touchgraph 3000 LSI + TNE
|
|
4000
|
touchgraph 4000 NLP
|
touchgraph 4000 LSI + TNE
|
|
5000
|
touchgraph 5000 NLP
|
touchgraph 5000 LSI + TNE
|
|
6000
|
touchgraph 6000 NLP
|
touchgraph 6000 LSI + TNE
|
|
7000
|
touchgraph 7000 NLP
|
touchgraph 7000 LSI + TNE
|
|
8000
|
touchgraph 8000 NLP
|
touchgraph 8000 LSI + TNE
|
|
9000
|
touchgraph 9000 NLP
|
touchgraph 9000 LSI + TNE
|
The Taxonomy Generation Framework
The various components of the Taxonomy
Generation Framework are illustrated in Figure 1 below followed by a brief
discussion of each of the components.
Data Extraction and Sampling
The Medical Subject Headings (MeSH)
hierarchy, in particular, the sub-tree underthe concept Cardiovascular
Diseases consisting of 339 concepts, was chosen as the Gold Taxonomy.
MEDLINE citations annotated by concepts appearing in the chosen taxonomy
described above which had abstracts associated with them were chosen. These
abstracts had the relevant concepts marked as "preferred". The documents were
sampled at different sizes using different techniques based on the underlying
distribution of the documents wrt the concepts in the taxonomy, for e.g.,
uniform v/s density biased sampling.
NLP Techniques for Pre-processing Sample Data
NLP tools such as a part
of speech tagger and a chunk parser were used to identify the noun phrases in
those documents. Variations for applying these techniques, for e.g., extracting
"micro" noun phrases comprising of 2-3 words v/s "macro" noun phrases comprising
of 4-6 words and their impact on the results will be explored.
Document Indexing
A vector-spaced model was used for indexing the
documents in the sampled data set. Words or phrases could be chosen as features
that specify dimensions in the vector space. It may be noted that word based
indexing might still be used in conjunction with "micro" or "macro" noun
phrases. We used the SMART Indexing System to index the document set.
Document Clustering
Clusters of documents are identified by using
K-Means clustering. Document vectors obtained from the vector space
representation are used to compute the distance between them. The K-means
algorithm is implemented using a "bisecting K means" strategy. A hierarchy is
induced by assuming that a new level is created each time a 2-means is invoked.
To stabilize the hierarchy at the Nth level, an N-means run is invoked.
Taxonomy Extraction
The hierarchy generated by the document clustering
process doesn't capture the notion of taxonomy. According to our taxonomy
extraction hypothesis, nodes at lower levels in the taxonomy should capture
subject categories that correspond to a narrower information space as compared
to nodes at higher levels, and successive levels in the taxonomy should be
sufficiently differentiated to be of interest to the user. The notion of
differentiation is capture by the difference in the "cluster cohesiveness"
between successive layers of the taxonomy. The taxonomy user is expected to
suggest a set of "cohesiveness" levels which correspond to differentiation
between the various layers of the taxonomy. Based on these levels, the taxonomy
extraction algorithm extracts a subset of nodes from the clustering hierarchy
and identifies the taxonomic structure as illustrated in the figure below.
Label Assignment and Smoothing
The nodes corresponding to the extracted
taxonomy are assigned labels by analyzing the centroid vector. We identify the
terms corresponding to dimensions with the highest weightage in the centroid
vector and choose the top K values. These labels are "smoothened" by various
techniques such as propagation of common labels to the parent nodes, etc. We are
also exploring the use of domain independent thesauri such as WordNet to help in
this process.
Taxonomy Evaluation
The generated taxonomy is evaluated with respect to
the gold taxonomy using a variety of different measures. These measures
typically involve matching individual concepts in the two taxonomies and
checking whether they satisfy the same parent-child relationships in the two
taxonomies. Synonymy may also be used to improve the evaluation process.
Project Participants
At NLM -
Dr. Vipul Kashyap, Dr. Tom
Rindflesch.
At LSDIS -
Cartic Ramakrishnan,
Christopher Thomas.
At Telcordia - Debasis Basu
Publications
- TaxaMiner: An Experimentation Framework for Automated
Taxonomy Bootstrapping [.pdf]
V Kashyap, C. Ramakrishnan, C. Thomas, D. Bassu, T. C. Rindflesch and A. Sheth Technical Report, January 2004
- Towards
(Semi-)automatic Generation of Bio-medical Ontologies [Poster]
V.
Kashyap, C. Ramakrishnan and T. C. Rindflesch, Poster Proceedings of the
AMIA 2003 Annual Symposium, November, 2003, Washington, DC.