LSDIS > Projects > SemDis > SwetoDblp

Semantic Discovery: Discovering Complex Relationships in Semantic Web

A NSF Medium ITR project

SwetoDblp

SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP.
Latest version is now at SwetoDblp Ontology in Knoesis

SwetoDblp was created from a large XML document available at DBLP's website and other datasets that are used to add relationships to other entities such as Publishers, Companies and Universities. A SAX-parsing process (in Java) takes the XML document as input to create the SwetoDblp ontology. The schema-vocabulary part of the ontology utilizes concepts and relationships from FOAF, Dublin Core and OPUS (an ontology used by the back-end system of the LSDIS Library). Maintenance of SwetoDblp is possible by re-precessing newer versions of the original XML document from DBLP website.

SPARQL endpoint unavailable :( since it would keep failing frequently - we're open for suggestions on which rdf-store to use! ... also, one of our servers is on the shop!

When refering to SwetoDblp, please cite the following:
  B. Aleman-Meza, F. Hakimpour, I.B. Arpinar, A.P. Sheth: SwetoDblp Ontology of Computer Science Publications In: Web Semantics: Science, Services and Agents on the World Wide Web, volume 5, issue 3, pages 151-155, 2007

Download Latest Version: August 2007


Version: August-2007
Statistics Summary
      Resources: 2,395,467
Literals: 3,064,704
      Resource-to-Resource Triples: 3,740,438
Resource-to-Literal Triples: 7,274,180
Number of Entities (in main classes) Number of Relationships (in main relationships)
560,792 Person (foaf:Person) 900,440 publication-has-author (author)
561,895 Articles in Proceedings 438,531 contained in proceedings (isIncludedIn)
340,488 Journal Articles 112,303 cites publication
10,610 Webpages of persons 10,639 has-homepage (foaf:homepage)
9,027 Proceedings 10,461 has-publisher (dc:publisher)
2,530 Book Chapters 5,850 in series
1,235 Books 7,308 has affiliation (foaf:workplaceHomepage)
  2,013 owl:sameAs (between people)

 

SwetoDblp goes beyond one-to-one mapping of XML elements to RDF data

  • Every person in the original data becomes an entity having its own URI that actually points to her/his DBPL entry page on the web. For example, a data value such as <author>Prabhakar Raghavan</author> from the original XML data becomes an RDF entity with an URI more likely to be (re-)used elsewhere. (show/hide data snippet sample in rdf/xml)

  • Whenever the homepage of a person is known in the original dataset, such relationship is kept in the resulting RDF by using widely used vocabulary. (show/hide data snippet sample in rdf/xml)

  • In some cases, the 'affiliation' of a person is automatically extracted from his/her homepage by looking at the actual URL. (show/hide data snippet sample in rdf/xml)

  • The affiliation information can be automatically extracted depending on one of the additional data sources, namely the Universities dataset or the Organizations dataset. The universities dataset consists of two parts. The first is a list of universities obtained from a web-source. The following is an example of an instance of the Universities dataset.

    Affiliation is also extracted from note elements in some XML elements of homepages authors, such as <note>University of Waterloo</note>. In this cases, a lookup operation can provide the affiliation relation by relying upon match of the name or 'alternative' name of a university. Thus, the second part of the Universities dataset is a (much smaller) manually created list of universities containing synonyms and alternative spellings. It also includes universities not listed in the web source before mentioned. The Universities and Organizations datasets are encoded in RDF.

  • DBLP has made a great job dealing with ambiguos names or name changes. Whenever the original data from DBLP indicates that a person can be referred to by more than one name, the corresponding entities in SwetoDblp are explicitly related with a owl:sameAs relationship. (show/hide data snippet sample in rdf/xml)

  • Publisher's information is converted to relationships to 'publisher' entities in RDF by using a data source of Publishers (encoded in RDF). (show/hide data snippet sample in rdf/xml)

  • Series' information such as Lecture Notes in Computer Science, CEUR Workshops, etc. is converted to relationships to 'series' entities in RDF by using a data source of Series (encoded in RDF). (show/hide data snippet sample in rdf/xml)

Schema Vocabulary:

The schema vocabulary of the ontology reuses existing vocabulary whenever possible (e.g., FOAF, DC). In addition, statements are included to indicate equivalence of classes or properties with respect to other (similar) schemas for describing publications/researchers. In particular, we use owl:equivalentClass and owl:equivalentProperty (where applicable) to relate our schema with that of: MarcOnt Initiative, KnowledgeWeb Portal, SWRC Ontology, AKT Portal Ontology, SWPortal Ontology, and a bibTeX Ontology. This (touchgraph) applet illustrates the equivalent classes for SwetoDblp (screenshot)

XML Datatypes

We did not include xml datatype for literals that are of type string or for which no direct mapping is available, such as the case for 'pages' as it could have values with dash or letters. We included xml datatypes for the following datatype properties

  • chapter - xsd:integer
  • mdate - xsd:date
  • month - xsd:gMonth (we kept original value of month in opus:month for backwards compatibility; we didnt produce gMonth values for the few cases that had values such as January/February)
  • year - xsd:gYear

Code:

SwetoDblp is created by a SAX-parser process on the dblp.xml (available at DBLP website). There is a number of domain-dependent mappings for producing the RDF. This process reads data files of Organizations, Universities, Publishers, and Series (available above) and uses them to look up values in order to establish relationships to entities within them (instead of keeping just the literal values). Such files are encoded in RDF (facilitating representation of synonyms) and read using the SemDis API. Hence, the code needs few jar files from here and there; we are not supposed to place such jar files in here but we indicate which and from where to get them. The code consists of few files and is organized as an ant project (the file dblp.xml should be placed in the data directory; the file dblp.dtd should be placed in the working directory)
Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

  • Code: code-swetodblp-april2007.zip
  • commons-lang-2.1.jar - from apache's commons-lang
  • commons-logging.jar - from apache
  • icu4j_3_0.jar - ICU4J v3.0 from IBM
  • jena_v2_3.jar - Jena's jena.jar version 2.3 (we renamed it to avoid confusion on the version number)
  • semdisAPI_v0_3.jar - from SemDis API
  • semdisImpl_v0_6.jar - from SemDis API (version 0.5 also ok)
  • xercesImpl.jar - from xerces

Latest Referals:
- http://swat.cse.lehigh.edu/resources/onto/
- http://ebiquity.umbc.edu/blogger/2007/08/10/new-swetodblp-dataset-released-with-11m-triples/
- http://planetrdf.com/
- http://ivanherman.wordpress.com/2007/01/13/bibtex-in-rdf/
- http://www.rdfweb.org/topic/ExpertFinder_2fExtendedRDF
- http://clarkparsia.com/weblog/2007/03/19/why-not-sparql-for-dblp/
- http://www.ifi.unizh.ch/ddis/research/semweb/isparql/

Contact Person: Boanerges Aleman-Meza (baleman @ uga . edu)


This material is based upon work supported by the National Science Foundation under Grant No. IIS-0325464 titled "SemDis: Discovering Complex Relationships in Semantic Web". Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.