Semantic Discovery: Discovering Complex Relationships in Semantic Web
A NSF Medium ITR project
Data Sources used for Public Access part of the Ontology
USGS gazetteer files. Lists of cities, counties, geographic features, manmade strutures, etc.
- http://www.corporateinformation.com/ctrylist.asp?letter=non This
site contains a list of worldwide companies; Estimated Number of Extractable
Entities: 50,000 (Companies). Classes to Extract: Company (subclass of organization),
Country (We may already have this).
Relationships to Extract: Company thing_located_in Country
- http://www.weforum.org/site/knowledgenavigator.nsf/Content/KB+Contributors This
particular page contains list of people who are contributors to weforum
(World Economic Forum). Estimated Number of Extractable Entities: 1400
people names, 1400 company names. Classes to Extract: Company (subclass
of organization), Person, Country (We may already have this).
Relationships to Extract: Company thing_located_in Place (most of the times
it is a country). Example: the first line is the name of the person. The
other line has the "Position" (create a class for it) and a relationship
from Person to Position called "has_position". After the position, typically
is the name of the company/organization, but better get it from the third
line, together with a "organization_description" attribute of the organization
(that will be the 4th line). Back to the second line, there is most of the
times a country name, make a relationship of the person "situated_at" such
country. Also make a relationship of the "Organization" locates_at such country.
Take the text of personal profile of the person as an attribute of Person
(personal_profile). Also (manually) create an entity "World Economic Forum" (Organization)
and add a "participates_in" relationship for each person extracted.
- http://www.aaadir.com/world.jsp This
particular page contains lists of banks located throughout the world. Estimated
Number of Extractable Entities: 1000+. Classes to Extract: Bank (name)
(create class Financial_Organization as subclass of Organization), Country
(We may already have this).
Relationships to Extract:Bank thing_located_in Place (most of the times it
is a country). Example: First you select a country. Then you may to select
a state. Then it presents you with a list of banks. The bank name is a hyperlink
which takes you to the additional information (URL, Email, Country)
- Extract both http://ribbs.usps.gov/files/vendors/cassalln.TXT and http://ribbs.usps.gov/files/vendors/cassallO.TXT (from
the parent source: http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm)
Extractable: Person, Company (subclass of Organization), City.
Relationships: Person works_at Company; Company thing_located_in City; and
other attributes without including email/phone (that is, not personal information)
- http://www.calle.com/world/ Cities
of the world, extract the relationship "located_in" to its respective country
- http://www.state.gov/s/inr/rls/10543.htm Create
class "Dependency" with attribute: "long_name"; with attribute "code_FIPS".
With relationship "sovereignty_by" to a Country. With relationship "capital" to
a City (last column of the table).
- publications Around 2500 "Publications" (create
such class, with attributes title, year), with authors (Person) and make
the relationship "listed_author_of" from Person to Publication. In order
to get the entities, click on "resources", then "Bibliographic References" and
on the field "Title" choose the option "contains" and put the value "a",
also chosse "all" in the "Find up to" option. This will return over 2800
records. I believe you can copy/paste the search results to a local location
and then extract from there.
- DBLP This
links to TOC_OUT.txt from which we want to extract (Researcher), etc.
- http://portal.acm.org/lookup/ccsnoun.cfm Implicit
Subject Descriptors in ACM. Extract node (do not "force new") and noun
(do not "force new").
- http://www.acm.org/class/1998/overview.html Top
Two Levels of The ACM Computing Classification System. We have specific
classes in the ontology for them. Keep the relationship with the appropiate "implicit
- http://www.sigmod.org/sigmod/dblp/db/indices/AUTHORS Computer
Science (researchers) authors listed by DBLP.
- http://citeseer.nj.nec.com/mostcited.html 659481
Computer Science Researchers. Remember here to extract the full name when
there is more than one "Miller", etc.
- http://www.bis.org/cbanks.htm Create
class "Financial Organization" (from Organization). Then "Central Bank" will
be a subclass of Financial Organization. Entities in this page are "Central
Bank" entities that are "located" in a Country (we should already have
something like located_in).
Data Sources used for restricted access part of the ontology
- Either http://www.usdoj.gov/dea/fugitives/fuglist.htm or http://www.usdoj.gov/dea/directory.htm DEA
Fugitives ( links to http://www.usdoj.gov/dea/fugitives/* )
- http://www.usdoj.gov/dea/programs/explorers/attendees.html Entities
of Class: Law Enforcement Organization
Relationships: Law Enforcement Organization thing_located_in City; City located_in
State; Law Enforcement Organization participates_in_event "2002 Explorer's
Conference"; "2002 Explorer's Conference" occured_in Flagstaff (City); Flagstaff
located_in Arizona (State); "2002 Explorer's Conference" event_date "2002-07-08"
- http://www.isp.state.il.us/sor/frames.htm Sex
Offenders (create such class from Person). Relationships: "Sex Offender" is_situated_at
City; City located_in Illinois (State); and other attributes like DOB,
- http://www.ticic.state.tn.us/SEX_ofndr/search_short.asp Sex
offenders in TN (create class "Sex Offender" from Person). This site requires
to enter a county, try with "Hamilton", "Montgomery", "Knox", "Shelby", "Rutherford", "Davidson".
Relationships to extract, Place (city & state if possible), and other
- If extraction of sex offenders is ok, then follow http://www.usatrace.com/sex.html
- http://www.txdps.state.tx.us/mpch/MissingPersons.asp?Date=new Missing
Persons (create such class under Person). Relationships to extract, Place
(city & state if possible) where last seen, DOB, and other attributes.
(this is an example of a site that is updated frequently)
- http://www.state.gov/s/ct/rls/pgtrpt/2002/html/19990.htm Terrorist_Attacks,
include Date, Country, and description (for 2002). Relationships: Terrorist_Attack
Sites for possible expansion of the ontology in a later phase:
- Databases of plants, animal science, agriculture, ... http://www.fao.org/ag/guides/resource/data.htm
- 'thing_located_in' should be used instead of 'located_at'
- the attribute 'attack_type' seems to be the same as 'type_of_event', please
we currently use as entity URI the entityID from Semagix, which may not
be a good idea considering it could change if Semagix is re-installed, right?
Solution: we'll have backups in different computers and locations
change 'code' to 'airport_code'
- could we change the attribute 'place_of_birth' in Terrorist or Person (preferred
for it to be in Person) so that it is a relationship to a Place entity?
- could we change the attribute 'location' in Event so that it is a relationship
to a Place entity?
- Fix: SWEET_141024
- Fix extraction of "Egyptian" as a country (SWEET_141016), Egypt is the
correct one (SWEET_130126); similarly for Saudi Arabian, etc.
- Merge SWEET_139558 with the correct one?
change the relationship of Airport 'located_at' to: 'thing_located_in'
(which is already defined in Thing)
- change 'is_assosiated_with' in one extractor of terrorist organizations
to 'thing_located_in' (which is already defined in Thing); we wont need 'is_assosiated_with'
anymore because it is very misleading with one we have already defined from
a Person to an Organization
We should clarify why we need 'city_located_in_state'
We should clarify why we need 'state_located_in_country'
- Need: Faculty
- Need: Researchers from Industry labs
- Need: papers in ACM by relating them to subject descriptors
- Can we get rid of 'located_in'?