Semantic Association Identification and Knowledge Discovery for National Security Applications

ITR-0219649

Principal Investigator

Amit Sheth
LSDIS Lab, Department of Computer Science
415 Boyd GSRC,
University of Georgia
Athens
GA 30602-7404

706-542-2310
706-542-4771
amit@cs.uga.edu

http://lsdis.cs.uga.edu/~amit
 

Co-PI

Asst. Prof. I. Budak Arpinar
LSDIS Lab, Department of Computer Science
415 Boyd GSRC,
University of Georgia
Athens
GA 30602-7404

706-583-8249
706-542-2911
budak@cs.uga.edu

Co-PI

Prof. Krys Kochut
Department of Computer Science
415 Boyd GSRC,
University of Georgia
Athens
GA 30602-7404

706-542-3441
706-542-2911
kochut@cs.uga.edu

Keywords

semantic Web
knowledge discovery
semantic association
complex relationships
RDF graphs

semantic metadata extraction
metadata analysis
ontology-driven information system

Project Summary

Role of information technology (IT) is recognized to be a critical component in the effort of improving national security, including homeland defense. Applications of importance to national security, such as aviation security, pose significant challenges to current information technology and provide excellent source for further research in developing next generation IT solutions. This project looks at the opportunity to discover complex relationships from

of large amounts of heterogeneous data. To achieve this, it relies on the new capabilities in automatically extracting
the semantic metadata (i.e., large scale semantic annotations) represented in RDF graphs, and defines meaningful complex relationships called semantic associations. It also looks at methods of computing semantic associations with the relevant issues of applying context and ranking the results.

Publications and Products

K. Anyanwu and A. Sheth. The r Operator: Discovering and Ranking Associations on the Semantic Web, SIGMOD Record, Vol. 31, No. 4, December 2002, pp. 42-47.

K. Anyanwu and A. Sheth. “The r Operator: Discovering and Ranking Associations on the Semantic Web,” The Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003. 

B. Aleman-Meza, C. Halaschek, I. B. Arpinar, and A. Sheth. Context-Aware Semantic Association Ranking, Int. Conf. on Semantic Web and Databases, September 2003, Berlin Germany.

A. Sheth. Ontology-driven Integration and Analysis for Semantic Applications in Business Intelligence and National Security, Ontology and Semantic Web Technical Exchange Meeting, MITRE , McLean, VA June 12-13, 2003. Abstract Slides: powerpoint-show pdf htm

 

A. Sheth.  Ontology Driven Information Systems in Action (Capturing and Applying Existing Knowledge to Semantic Applications), invited talk at Sharing the Knowledge- International CIDOC CRM Symposium, March 26-27, Washington, DC Abstract Slides: .PDF

 

A. Sheth. Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating and Exploiting Complex Semantic Relationship, Keynote address, SOFSEM 2002 (29th Annual Conference on Current Trends in Theory and Practice of Informatics), Milovy, Czech Replublic, November 2002.  Related paper  Presentation

 

A. Sheth. “Semantic Content Management for Enterprises and National Security,” Keynote at the Content and Semantic-based Information Retrieval – in conjunction with the 6th World Multi-conference on Systemics, Cybernetics, and Informatics (SCI 2002),  Orlando, Florida, USA, July 14-18, 2002.  Abstract Presentation

Project Impact

Proposed research is leading to new techniques and improving effectiveness of techniques to identify semantic associations and knowledge discovery by exploiting a large knowledge base of semantic metadata.
It also contributes for efficient filtering and ranking of these associations by capturing context related to national security. Thus more relevant semantic associations leading to a threat analysis can be computed efficiently.

Furthermore, the project is training three to four graduate students (it funded four graduate students in FY2003: 3 PhD, 1 MS.

Goals, Objectives and Targeted Activities

Specific objectives include (a) ontology driven lazy semantic metadata extraction (i.e., annotation) to complement traditional active metadata extraction techniques, and (b) formal modeling

of semantic associations and discovery techniques,
(c) computation for semantic association identification including ontology-based contextual processing and (d) relevancy ranking of interesting relationships. Our approach involves bootstrapping earlier research on semantic metadata extraction, multi-ontology query processing and other tools from on-going InfoQuilt project so that we can create knowledge bases and metadata from publicly available sources to enable meaningful evaluation of the techniques.

Accomplishments over the past year can be summarized as follows:

·        Semantic associations are formalized based on semantic connectivity. A preliminary formalization for semantic similarity-based associations is also produced.

·        An ontology for national security applications is developed in RDFS.  This comprehensive ontology is populated with an RDF knowledgebase using extractors on trusted Web resources related to national security.

·        A PISTA prototype and test-bed are developed for discovering semantic associations. Initial discovery techniques are implemented in the test-bed and being further improved.

·        A preliminary work is completed for defining relevant context using ontological regions. An initial ranking scheme is developed too.

Objectives for the next year will include the following:

·        Further work on semantic similarity definition and discovery algorithms will be performed.

·        Semantic connectivity finding algorithms will be optimized to better scale for large knowledge-bases.

·        New techniques will be explored for a more user-friendly capture of the context. The ranking scheme will be also tested and improved according to the tests results.

·        Ontology driven lazy semantic metadata extraction techniques will be explored.

·        The PISTA prototype will be released with an extended knowledgebase.

 

Area Background

Recently, there is significant advance in applying techniques from database and information systems, knowledge representation, AI, information retrieval including text categorization, lexical and language analysis and others in developing a new generation of semantic technologies. Semantic technologies help in associating meaning of data and in more meaningfully organizing data, in meaningfully correlating data, as well as in converting data into information for more effective decision making and in finding information that contextually relevant to users’ needs.  They help with syntactic and representational as well as semantic interoperability. This general area of research is also getting renewed attention now that there is considerable excitement in the vision of the Semantic Web, characterized as the next phase of the Web.

Results from several of our past and continuing research projects have led to the development a semantic technology called Semantic Content Organization and Retrieval Engine (SCORE). Using SCORE’s ability to quickly create ontology-driven agents without programming, it has been possible to (a) quickly create and maintain large knowledge bases (such as over one million entities and relationships per domain) base from multiple semi structured and structured sources of knowledge in largely (but not fully) automated ways, and (b) ability to create semantic (domain specific) metadata from unstructured (text), semi structured and structured sources of static and dynamic (e.g., query driven) content. This technology has also been commercialized and is being used in aviation security and intelligence applications. While specifics of these applications cannot be discussed due to government and agency regulations, and many technologically possible capabilities have yet to pass through policy considerations, we imagine a prototype application of homeland security interest that help in identifying and screening a passenger with respect to security risk to develop requirements for relevant IT research. Two important challenges posed by such an application include (a) rapid identification of semantic associations involving entities (such as a passenger or a group of passengers on a flight), and (b) knowledge discovery that identify semantic associations of interest (such as those that may pose a risk).

Area References

J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajamaran. Indexing Semistructured Data. Technical report, Stanford University, Computer Science Department, 1998. http://citeseer.nj.nec.com/mchugh98indexing.html  

A. Aho, R. Sethi, J. Ullman. Compiler. Principles, Techniques and Tools. Addison Wesley Longman. 1988.

http://www.python.org/doc/essays/graphs.html

http://www.hpl.hp.com/semweb/doc/tutorial/index.html

R. Tarjan. Fast Algorithms for Solving Path Problems. J. ACM Vol. 28, No. 3, July 1981, pp.594-614.

A. Sheth, C. Bertram, D. Avant, B. Hammond, K. Kochut, and Y. Warke. Semantic Content Management for Enterprises and the Web, IEEE Internet Computing, July/August 2002, pp. 80-87.

S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis and K. Tolle. The RDFSuite: Managing Voluminous RDF Description Bases. In: Proc. of the 2nd Int. Workshop on the Semantic Web, Hong-Kong, 2001.

K. Wilkinson, C. Sayers, and H. Kuno "Efficient RDF Storage and Retrieval in Jena2", Int. Conf. on Semantic Web and Databases, 2003, Berlin Germany, to appear.
 

Project Websites

http://lsdis.cs.uga.edu/proj/SAI/
This is the main website for our project.
 

Illustrations

Although a very early test bed has been completed, it will be available for more robust Web-based demonstration during the second year of the project.
 

Online Data

A prototype implementation of an aviation security application-specific ontology, corresponding prototypical application and an initial testbed were completed. Several open source terrorist databases and Web sites were extracted to populate the ontology. Additional sources were extracted to create semantic (ontology-driven) metadata including information about terrorist events and locations (cities, countries, etc.).  This enabled testing of the initial algorithms using a test knowledgebase of entities and relationships between them. The entities (RDF Data) are stored in serialized RDF/XML syntax in a separate file than that of the ontology (RDF Schema). Thus far, the size of the file with the RDF data currently is of 1.6 MB and has 6,000 entities. These entities have relationships between them; the number of explicit relationships among the entities is over 11,000. This does not account for the relationships that are implicit in the RDF Schema describing the ontology. Testbet data set will increase substantially next year.