SWETO Data Sources
Data Sources used for Public Access part of the Ontology
-
http://www.corporateinformation.com/ctrylist.asp?letter=non
This site contains a list of worldwide companies;
Estimated Number of Extractable Entities: 50,000 (Companies).
Classes to Extract: Company (subclass of organization), Country (We may already have this).
Relationships to Extract: Company thing_located_in Country
-
http://www.weforum.org/site/knowledgenavigator.nsf/Content/KB+Contributors
This particular page contains list of people who are contributors to weforum (World Economic Forum).
Estimated Number of Extractable Entities: 1400 people names, 1400 company names.
Classes to Extract: Company (subclass of organization), Person, Country (We may already have this).
Relationships to Extract:
Company thing_located_in Place (most of the times it is a country).
Example: the first line is the name of the person. The other line has the "Position" (create a class for it)
and a relationship from Person to Position called "has_position". After the position, typically is the name of
the company/organization, but better get it from the third line, together with a "organization_description"
attribute of the organization (that will be the 4th line).
Back to the second line, there is most of the times a country name, make a relationship of the
person "situated_at" such country. Also make a relationship of the "Organization" locates_at such country.
Take the text of personal profile of the person as an attribute of Person (personal_profile).
Also (manually) create an entity "World Economic Forum" (Organization) and add a "participates_in" relationship
for each person extracted.
-
http://www.aaadir.com/world.jsp
This particular page contains lists of banks located throughout the world.
Estimated Number of Extractable Entities: 1000+.
Classes to Extract: Bank (name) (create class Financial_Organization as subclass of Organization),
Country (We may already have this).
Relationships to Extract:Bank thing_located_in Place (most of the times it is a country).
Example: First you select a country. Then you may to select a state. Then it presents you with a list of banks.
The bank name is a hyperlink which takes you to the additional information (URL, Email, Country)
-
Extract both
http://ribbs.usps.gov/files/vendors/cassalln.TXT
and
http://ribbs.usps.gov/files/vendors/cassallO.TXT
(from the parent source: http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm)
Extractable:
Person, Company (subclass of Organization), City.
Relationships:
Person works_at Company;
Company thing_located_in City;
and other attributes without including email/phone (that is, not personal information)
-
http://www.calle.com/world/
Cities of the world, extract the relationship "located_in" to its respective country
-
http://www.state.gov/s/inr/rls/10543.htm
Create class "Dependency" with attribute: "long_name"; with attribute "code_FIPS".
With relationship "sovereignty_by" to a Country.
With relationship "capital" to a City (last column of the table).
-
publications
Around 2500 "Publications" (create such class, with attributes title, year), with authors (Person)
and make the relationship "listed_author_of" from Person to Publication.
In order to get the entities, click on "resources",
then "Bibliographic References" and on the field "Title" choose the option "contains"
and put the value "a", also chosse "all" in the "Find up to" option. This will return over 2800 records.
I believe you can copy/paste the search results to a local location and then extract from there.
-
DBLP
This links to TOC_OUT.txt from which we want to extract (Researcher), etc.
-
http://portal.acm.org/lookup/ccsnoun.cfm
Implicit Subject Descriptors in ACM. Extract node (do not "force new") and noun (do not "force new").
-
http://www.acm.org/class/1998/overview.html
Top Two Levels of The ACM Computing Classification System. We have specific classes in the ontology for them.
Keep the relationship with the appropiate "implicit subject descriptor".
-
http://www.sigmod.org/sigmod/dblp/db/indices/AUTHORS
Computer Science (researchers) authors listed by DBLP.
-
http://citeseer.nj.nec.com/mostcited.html
659481 Computer Science Researchers. Remember here to extract the full name when there is more than one "Miller", etc.
-
http://www.bis.org/cbanks.htm
Create class "Financial Organization" (from Organization). Then "Central Bank" will be a subclass of Financial Organization.
Entities in this page are "Central Bank" entities that are "located" in a Country (we should already have something like located_in).
Data Sources used for restricted access part of the ontology
-
Either
http://www.usdoj.gov/dea/fugitives/fuglist.htm
or
http://www.usdoj.gov/dea/directory.htm
DEA Fugitives ( links to http://www.usdoj.gov/dea/fugitives/* )
-
http://www.usdoj.gov/dea/programs/explorers/attendees.html
Entities of Class: Law Enforcement Organization
Relationships: Law Enforcement Organization thing_located_in City; City located_in State;
Law Enforcement Organization participates_in_event "2002 Explorer's Conference";
"2002 Explorer's Conference" occured_in Flagstaff (City); Flagstaff located_in Arizona (State);
"2002 Explorer's Conference" event_date "2002-07-08"
-
http://www.isp.state.il.us/sor/frames.htm
Sex Offenders (create such class from Person).
Relationships: "Sex Offender" is_situated_at City; City located_in Illinois (State); and other
attributes like DOB, name, address
-
http://www.ticic.state.tn.us/SEX_ofndr/search_short.asp
Sex offenders in TN (create class "Sex Offender" from Person).
This site requires to enter a county, try with "Hamilton", "Montgomery", "Knox", "Shelby", "Rutherford", "Davidson".
Relationships to extract, Place (city & state if possible), and other attributes.
-
If extraction of sex offenders is ok, then follow
http://www.usatrace.com/sex.html
-
http://www.txdps.state.tx.us/mpch/MissingPersons.asp?Date=new
Missing Persons (create such class under Person).
Relationships to extract, Place (city & state if possible) where last seen, DOB, and other attributes.
(this is an example of a site that is updated frequently)
-
http://www.state.gov/s/ct/rls/pgtrpt/2002/html/19990.htm
Terrorist_Attacks, include Date, Country, and description (for 2002).
Relationships: Terrorist_Attack occurred_in Country
Sites for possible expansion of the ontology in a later phase:
-
Databases of plants, animal science, agriculture, ...
http://www.fao.org/ag/guides/resource/data.htm
Notes:
1.- 'thing_located_in' should be used instead of 'located_at'
2.- the attribute 'attack_type' seems to be the same as 'type_of_event', please verify this
3.- we currently use as entity URI the entityID from Semagix, which may not be a good idea
considering it could change if Semagix is re-installed, right? Solution: we'll have backups in different
computers and locations
4.- change 'code' to 'airport_code'
5.-
6.- could we change the attribute 'place_of_birth' in Terrorist or Person (preferred for it to be in Person) so that it is a relationship to a Place entity?
6.- could we change the attribute 'location' in Event so that it is a relationship to a Place entity?
7.- Fix: SWEET_141024
8.- Fix extraction of "Egyptian" as a country (SWEET_141016), Egypt is the correct one (SWEET_130126);
similarly for Saudi Arabian, etc.
9.- Merge SWEET_139558 with the correct one?
10.- change the relationship of Airport 'located_at' to: 'thing_located_in' (which is already defined in Thing)
11.- change 'is_assosiated_with' in one extractor of terrorist organizations to 'thing_located_in' (which is already defined in Thing);
we wont need 'is_assosiated_with' anymore because it is very misleading with one we have already defined from a Person to an Organization
12.- We should clarify why we need 'city_located_in_state'
13.- We should clarify why we need 'state_located_in_country'
14.- Can we get rid of 'located_in'?
This material is based upon work supported by the National Science Foundation
under Grant No. <not assigned yet>. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Contact person for this page content:
Boanerges Aleman-Meza ( baleman@uga.edu )
|