The TaxaMiner Project

Goal

This project plans to design and develop a framework for automated taxonomy construction based on a large corpus of documents, a first step towards large scale, automated ontology construction.

Motivation

Ontologies are a central component of the Semantic Web infrastructure. However, it is well acknowledged that the design and construction of domain specific ontologies is a human intensive process and requires allocation of huge resources in terms of cost and time. For the semantic web vision to be realized in a scalable manner, it is critical to investigate approaches that reduce human effort and resource commitments. Whereas, the broad goal of the endeavor should be semi-automatic creation of domain ontologies, we begin with an attempt to create an initial thesaurus/taxonomy of concepts using an unsupervised (or minimally supervised) learning approach. This taxonomy forms the vital first step in bootstrapping an ontology from textual documents, which are form an overwhelming proportion of the information content available on the web today.

Experimental Results

Experimental results are presented below. Please use "View Full Screen" to get a clear picture while viewing powerpoint files. Also the matched labels in the generated taxonomies are represented in capital letters. Nodes with matching labels are displayed in orange, nodes that don't contain a matching label are diplayed in turquoise

# of documents SMART + NLP LSI + TNE
1000 touchgraph 1000 NLP touchgraph 1000 LSI + TNE
2000 touchgraph 2000 NLP touchgraph 2000 LSI + TNE
3000 touchgraph 3000 NLP touchgraph 3000 LSI + TNE
4000 touchgraph 4000 NLP touchgraph 4000 LSI + TNE
5000 touchgraph 5000 NLP touchgraph 5000 LSI + TNE
6000 touchgraph 6000 NLP touchgraph 6000 LSI + TNE
7000 touchgraph 7000 NLP touchgraph 7000 LSI + TNE
8000 touchgraph 8000 NLP touchgraph 8000 LSI + TNE
9000 touchgraph 9000 NLP touchgraph 9000 LSI + TNE

 

The Taxonomy Generation Framework

The various components of the Taxonomy Generation Framework are illustrated in Figure 1 below followed by a brief discussion of each of the components.

Data Extraction and Sampling

The Medical Subject Headings (MeSH) hierarchy, in particular, the sub-tree underthe concept Cardiovascular Diseases consisting of 339 concepts, was chosen as the Gold Taxonomy. MEDLINE citations annotated by concepts appearing in the chosen taxonomy described above which had abstracts associated with them were chosen. These abstracts had the relevant concepts marked as "preferred". The documents were sampled at different sizes using different techniques based on the underlying distribution of the documents wrt the concepts in the taxonomy, for e.g., uniform v/s density biased sampling.

NLP Techniques for Pre-processing Sample Data

NLP tools such as a part of speech tagger and a chunk parser were used to identify the noun phrases in those documents. Variations for applying these techniques, for e.g., extracting "micro" noun phrases comprising of 2-3 words v/s "macro" noun phrases comprising of 4-6 words and their impact on the results will be explored.

Document Indexing

A vector-spaced model was used for indexing the documents in the sampled data set. Words or phrases could be chosen as features that specify dimensions in the vector space. It may be noted that word based indexing might still be used in conjunction with "micro" or "macro" noun phrases. We used the SMART Indexing System to index the document set.

Document Clustering

Clusters of documents are identified by using K-Means clustering. Document vectors obtained from the vector space representation are used to compute the distance between them. The K-means algorithm is implemented using a "bisecting K means" strategy. A hierarchy is induced by assuming that a new level is created each time a 2-means is invoked. To stabilize the hierarchy at the Nth level, an N-means run is invoked.

Taxonomy Extraction

The hierarchy generated by the document clustering process doesn't capture the notion of taxonomy. According to our taxonomy extraction hypothesis, nodes at lower levels in the taxonomy should capture subject categories that correspond to a narrower information space as compared to nodes at higher levels, and successive levels in the taxonomy should be sufficiently differentiated to be of interest to the user. The notion of differentiation is capture by the difference in the "cluster cohesiveness" between successive layers of the taxonomy. The taxonomy user is expected to suggest a set of "cohesiveness" levels which correspond to differentiation between the various layers of the taxonomy. Based on these levels, the taxonomy extraction algorithm extracts a subset of nodes from the clustering hierarchy and identifies the taxonomic structure as illustrated in the figure below.

Label Assignment and Smoothing

The nodes corresponding to the extracted taxonomy are assigned labels by analyzing the centroid vector. We identify the terms corresponding to dimensions with the highest weightage in the centroid vector and choose the top K values. These labels are "smoothened" by various techniques such as propagation of common labels to the parent nodes, etc. We are also exploring the use of domain independent thesauri such as WordNet to help in this process.

Taxonomy Evaluation

The generated taxonomy is evaluated with respect to the gold taxonomy using a variety of different measures. These measures typically involve matching individual concepts in the two taxonomies and checking whether they satisfy the same parent-child relationships in the two taxonomies. Synonymy may also be used to improve the evaluation process.

Project Participants

    At NLM - Dr. Vipul Kashyap, Dr. Tom Rindflesch.

    At LSDIS - Cartic Ramakrishnan, Christopher Thomas.

     At Telcordia - Debasis Basu

Datasets

 

Publications