LSDIS > Projects > SemDis > Sweto > Sources

Semantic Discovery: Discovering Complex Relationships in Semantic Web

A NSF Medium ITR project

Semantic Web Technology Evaluation Ontology (SWETO)

Data Sources used for Public Access part of the Ontology

  1. USGS gazetteer files. Lists of cities, counties, geographic features, manmade strutures, etc.
  2. Rural Airports
  3. http://www.corporateinformation.com/ctrylist.asp?letter=non This site contains a list of worldwide companies; Estimated Number of Extractable Entities: 50,000 (Companies). Classes to Extract: Company (subclass of organization), Country (We may already have this).
    Relationships to Extract: Company thing_located_in Country
  4. http://www.weforum.org/site/knowledgenavigator.nsf/Content/KB+Contributors This particular page contains list of people who are contributors to weforum (World Economic Forum). Estimated Number of Extractable Entities: 1400 people names, 1400 company names. Classes to Extract: Company (subclass of organization), Person, Country (We may already have this).
    Relationships to Extract: Company thing_located_in Place (most of the times it is a country). Example: the first line is the name of the person. The other line has the "Position" (create a class for it) and a relationship from Person to Position called "has_position". After the position, typically is the name of the company/organization, but better get it from the third line, together with a "organization_description" attribute of the organization (that will be the 4th line). Back to the second line, there is most of the times a country name, make a relationship of the person "situated_at" such country. Also make a relationship of the "Organization" locates_at such country. Take the text of personal profile of the person as an attribute of Person (personal_profile). Also (manually) create an entity "World Economic Forum" (Organization) and add a "participates_in" relationship for each person extracted.
  5. http://www.aaadir.com/world.jsp This particular page contains lists of banks located throughout the world. Estimated Number of Extractable Entities: 1000+. Classes to Extract: Bank (name) (create class Financial_Organization as subclass of Organization), Country (We may already have this).
    Relationships to Extract:Bank thing_located_in Place (most of the times it is a country). Example: First you select a country. Then you may to select a state. Then it presents you with a list of banks. The bank name is a hyperlink which takes you to the additional information (URL, Email, Country)
  6. Extract both http://ribbs.usps.gov/files/vendors/cassalln.TXT and http://ribbs.usps.gov/files/vendors/cassallO.TXT (from the parent source: http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm) Extractable: Person, Company (subclass of Organization), City.
    Relationships: Person works_at Company; Company thing_located_in City; and other attributes without including email/phone (that is, not personal information)
  7. http://www.calle.com/world/ Cities of the world, extract the relationship "located_in" to its respective country
  8. http://www.state.gov/s/inr/rls/10543.htm Create class "Dependency" with attribute: "long_name"; with attribute "code_FIPS". With relationship "sovereignty_by" to a Country. With relationship "capital" to a City (last column of the table).
  9. publications Around 2500 "Publications" (create such class, with attributes title, year), with authors (Person) and make the relationship "listed_author_of" from Person to Publication. In order to get the entities, click on "resources", then "Bibliographic References" and on the field "Title" choose the option "contains" and put the value "a", also chosse "all" in the "Find up to" option. This will return over 2800 records. I believe you can copy/paste the search results to a local location and then extract from there.
  10. DBLP This links to TOC_OUT.txt from which we want to extract (Researcher), etc.
  11. http://portal.acm.org/lookup/ccsnoun.cfm Implicit Subject Descriptors in ACM. Extract node (do not "force new") and noun (do not "force new").
  12. http://www.acm.org/class/1998/overview.html Top Two Levels of The ACM Computing Classification System. We have specific classes in the ontology for them. Keep the relationship with the appropiate "implicit subject descriptor".
  13. http://www.sigmod.org/sigmod/dblp/db/indices/AUTHORS Computer Science (researchers) authors listed by DBLP.
  14. http://citeseer.nj.nec.com/mostcited.html 659481 Computer Science Researchers. Remember here to extract the full name when there is more than one "Miller", etc.
  15. http://www.bis.org/cbanks.htm Create class "Financial Organization" (from Organization). Then "Central Bank" will be a subclass of Financial Organization. Entities in this page are "Central Bank" entities that are "located" in a Country (we should already have something like located_in).

Data Sources used for restricted access part of the ontology

  1. Either http://www.usdoj.gov/dea/fugitives/fuglist.htm or http://www.usdoj.gov/dea/directory.htm DEA Fugitives ( links to http://www.usdoj.gov/dea/fugitives/* )
  2. http://www.usdoj.gov/dea/programs/explorers/attendees.html Entities of Class: Law Enforcement Organization
    Relationships: Law Enforcement Organization thing_located_in City; City located_in State; Law Enforcement Organization participates_in_event "2002 Explorer's Conference"; "2002 Explorer's Conference" occured_in Flagstaff (City); Flagstaff located_in Arizona (State); "2002 Explorer's Conference" event_date "2002-07-08"
  3. http://www.isp.state.il.us/sor/frames.htm Sex Offenders (create such class from Person). Relationships: "Sex Offender" is_situated_at City; City located_in Illinois (State); and other attributes like DOB, name, address
  4. http://www.ticic.state.tn.us/SEX_ofndr/search_short.asp Sex offenders in TN (create class "Sex Offender" from Person). This site requires to enter a county, try with "Hamilton", "Montgomery", "Knox", "Shelby", "Rutherford", "Davidson". Relationships to extract, Place (city & state if possible), and other attributes.
  5. If extraction of sex offenders is ok, then follow http://www.usatrace.com/sex.html
  6. http://www.txdps.state.tx.us/mpch/MissingPersons.asp?Date=new Missing Persons (create such class under Person). Relationships to extract, Place (city & state if possible) where last seen, DOB, and other attributes. (this is an example of a site that is updated frequently)
  7. http://www.state.gov/s/ct/rls/pgtrpt/2002/html/19990.htm Terrorist_Attacks, include Date, Country, and description (for 2002). Relationships: Terrorist_Attack occurred_in Country

Sites for possible expansion of the ontology in a later phase:

  1. Databases of plants, animal science, agriculture, ... http://www.fao.org/ag/guides/resource/data.htm

Notes:

  1. 'thing_located_in' should be used instead of 'located_at'
  2. the attribute 'attack_type' seems to be the same as 'type_of_event', please verify this
  3. we currently use as entity URI the entityID from Semagix, which may not be a good idea considering it could change if Semagix is re-installed, right? Solution: we'll have backups in different computers and locations
  4. change 'code' to 'airport_code'
  5. could we change the attribute 'place_of_birth' in Terrorist or Person (preferred for it to be in Person) so that it is a relationship to a Place entity?
  6. could we change the attribute 'location' in Event so that it is a relationship to a Place entity?
  7. Fix: SWEET_141024
  8. Fix extraction of "Egyptian" as a country (SWEET_141016), Egypt is the correct one (SWEET_130126); similarly for Saudi Arabian, etc.
  9. Merge SWEET_139558 with the correct one?
  10. change the relationship of Airport 'located_at' to: 'thing_located_in' (which is already defined in Thing)
  11. change 'is_assosiated_with' in one extractor of terrorist organizations to 'thing_located_in' (which is already defined in Thing); we wont need 'is_assosiated_with' anymore because it is very misleading with one we have already defined from a Person to an Organization
  12. We should clarify why we need 'city_located_in_state'
  13. We should clarify why we need 'state_located_in_country'
  14. Need: Faculty
  15. Need: Researchers from Industry labs
  16. Need: papers in ACM by relating them to subject descriptors
  17. Can we get rid of 'located_in'?


This material is based upon work supported by the National Science Foundation under Grant No. IIS-0325464 titled "SemDis: Discovering Complex Relationships in Semantic Web". Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.