Altering Document Term Vectors for Classification - Ontologies as Expectations of Co-occurrence |
|
This work is an investigative study towards exploting the semantic relatedness between terms in documents to affect document term vectors, demonstrated for the task of Document Classification. Since Ontologies as semantic models of a domain model relationships between entities, they are used as expectations of co-occurence to alter the importance (weights) of terms in a document. The contribution of this work is in semantically and intuitively altering document term vectors and not the classifier algorithms. Although the clcassification algorithm used in this work is a simple centroid based classifier, it is possible for any classification algorithm to use the altered term vectors to achieve what we call a semantic document classification. The following figure shows an overview of our approach in altering term vectors
An example of how weights of terms are altered using the Ontology is shown here.
Evaluations and results
DatasetCategories - National Security domain Total number of categories 60 : Terror Groups (33), Terror Events(27) Total number considered in evaluations 12 list of categories Training documents source: Homeland Security digital library www.hsdl.org Testing documents sources: http://www.ict.org.il/inter_ter/orgattack.cfm?orgid=38; http://www.buyandhold.com/index.html; http://www.house.gov/house/MemStateSearch.shtml; http://www.lead411.com/top-companies-list.taf; http://www.state.gov/s/inr/rls/4250.htm; http://www.senate.gov/general/contact_information/senators_cfm.cfm; http://www.un.org/Docs/sc/committees/1267/tablelist.htm; http://www.treas.gov/offices/enforcement/ofac/sdn/ EvaluationComparison criteria syntactic term vector Vsyn vs. semantic term vector Vsem syntactic term vector Vsyn vs. enhanced term vector Venh-sem syntactic term vector Vsyn vs. [ syntactic Vsyn U semantic term vector Vsem] syntactic term vector Vsyn vs. [ syntactic Vsyn U enhanced semantic term vector Venh-sem] Metrics of evaluation Recall: Of all the documents that should have been classified in a category, how many of them were actually classified, given the application’s ranking of semantic relationships in the Ontology. Precision: Of all the documents classified in this category, how many of them were correctly classified, given the application’s ranking of semantic relationships in the Ontology. Classification algorithm Centroid based classification [Han Eui-Hong Sam and George Karypis, Centroid-Based Document Classification: Analysis Experimental Results Principles of Data Mining and Knowledge Discovery, 2000.] Implementation supports plug and play of classifiers; contribution is not in affecting classification algorithms but in changing document term vectors.
|