In particular I will touch on the following:
-Java JENA framework
-OpenCalais
-SPARQL
(Note: you should have some basic understanding of SPARQL, RDF and exposure to Jena)
OpenCalais as well as other text mining toolkits are related to semantic web and linked data since, in order to "semantify" anything and convert it to the precious triple, we need to know what this "thing" is : whether it is a "person", a "company/organization" ,"dog","car","term" ,"social tag" ,"protein", etc. This is called "named entity extraction". Only then it makes sense to wrap it in RDF triple and then mash it up with other conceptually similar "things" , so that we end up with a linked data graph.
OpenCalais is particularity useful for business data such as RSS feeds about deals , investment news and other financial content since it can identify business events such as 'mergers' , 'acquisitions' , 'IPO" etc.
In this example I will show how you can take a raw text, use OpenCalais web service via a client library in your favorite language(in this case Java), and obtain "named entities" disambiguation.After we will use those to build an RDF model(as referred to by Jena).
Prerequisites:
Java 1.5 +
Jena Framework download
OpenCalais API key
OpenCalais Java command line client library
Because of both Jena and OpenCalais web service dependencies,the following jars are required to run the example:
wsdl4j-1.5.1.jar
commons-discovery-0.2.jar
icu4j_3_4.jar
iri.jar
axis.jar
xercesImpl.jar
xml-apis.jar
jaxrpc.jar
commons-logging.jar
arq.jar
jena.jar
calais-client.jar
Now lets get to work. The main class for the Calais java client is CalaisClient. In the call to "enlighten" method of the web service you need to specify licenseID (which is your API KEY), the content to be parsed, and a configuration file ("paramsXML") file.
The configuration file typically has the following content:
c:params xmlns:c='http://s.opencalais.com/1/pred/' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
c:processingDirectives c:contentType='text/txt' c:outputFormat='XML/RDF' >
/c:processingDirectives>
c:userDirectives c:allowDistribution='true' c:allowSearch='true' c:externalID='YOUR CALAIS API KEY' c:submitter='ABC'>
/c:userDirectives>
c:externalMetadata>
/c:externalMetadata>
/c:params>
This essentially is saying that you want the response from the service to be in XML/RDF format and that the content you are passing is text.
The content I cut and pasted from some random text from am RSS post :
"
Press Release: Cherokee International (NASDAQ: CHRK) announced today that it has entered into a definitive merger agreement with Lineage Power Holdings, Inc., a Gores Group company, under which Lineage will acquire all of the outstanding shares of Cherokee International. Under the terms of the agreement, stockholders of Cherokee International will receive $3.20 per share of common stock held, in an all cash transaction, representing an aggregate enterprise value of approximately $105 million. The transaction has been unanimously approved by the board of directors of Cherokee International, and certain stockholders have agreed to vote their Cherokee International shares in favor of the transaction.
' We believe the sale of Cherokee to Lineage will add value and scale for our customers,' said Jeffrey Frank, Cherokee President and Chief Executive Officer. Over the past 30 years, Cherokee has earned a great reputation for our strong engineering team, manufacturing, quality and responsiveness, all of which come down to our outstanding employees and our focus on the customer. Going forward, our employees and customers will be well served by becoming part of Lineage and The Gores Group portfolio of companies. Gores has a stellar reputation for customer satisfaction and the proven ability to profitably grow its businesses.'
According to Ryan Wald, Managing Director of The Gores Group, Cherokee will become a division of Lineage and will continue to be a leader in the custom power solutions marketplace. We are impressed by the accomplishments that Jeff and his management team have made to date regarding Cherokees North American and Asian operations,' said Mr. Wald. We look forward to partnering with them in those regions to create a more compelling value proposition for our combined customers.'
The transaction is subject to the approval of Cherokee Internationals stockholders and to regulatory approvals.
..........
"
You can pass the whole thing as a string to CalaisClient as follows:
CalaisClient calaisWebService= new CalaisClient(true);
String response = calaisWebService.enlighten(licenseID, content, paramsXML);
Calais's SPARQL Details
The returned value is an xml/rdf file which contains RDF elements unique only to OpenCalas.
We are going to run a SPARQL query on that response.
Calais Web service is based on unique identifiers (very much as FreeBase's "guid"). In order to run any SPARQL you need to understand how the Calais response is formed.The response looks something like this:
rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:c="http://s.opencalais.com/1/pred/">rdf:Description c:allowDistribution="true" c:allowSearch="true" c:calaisRequestID="5cc4c21c-2b61-7ab6-128e-7381a111daff" c:externalID="APIKEY" c:id="http://id.opencalais.com/KqNYvrMU8Jd0XY2IAMPEUA" rdf:about="http://d.opencalais.com/dochash-1/79f93e32-ce1d-3986-a00b-4c421b58c998">rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/DocInfo"/>c:document>....
To execute a SPARQL upon that you need to be familiar with the basic namespaces defined in OpenCalais. Good reference is at:
Calais Schema Entity/Fact/Event Definitions and Descriptions
Also have a look at:
Response format
In the current example I have concentrated only on RSS feed posts that have information regarding executives involved in business events.
Considering the Calais's SPQRQL structure I defined the following name spaces:
PREFIX ent: http://s.opencalais.com/1/type/em/e/>
PREFIX evn: http://s.opencalais.com/1/type/em/r/>
PREFIX pred: http://s.opencalais.com/1/pred/>
which respectively defines entity , event and predicate ( as "property" in RDF lingo).
Sample RDF triple as represented by OpenCalais will look like:
http://d.opencalais.com/pershash-1/84b30306-1139-38c2-9457-05f3f3e4e1c2
http://s.opencalais.com/1/pred/name
"Jeffrey Frank"
thus "http://s.opencalais.com/1/pred/name" represents literal value and
"http://d.opencalais.com/pershash-1/84b30306-1139-38c2-9457-05f3f3e4e1c2"
is the unique id which represents the reference graph("Resource") for all nodes related to it.
It is easy once you establish the schema to run a SPARQL queries on the returned result.A great help for that is the W3C Validation Service: RDF Online Validation
For our specific example we are interested in the following : Person ( in this case some executive) , executive title and Company.
Those in Calais RDF schema are represented as
"http://s.opencalais.com/1/pred/Person" ,
"http://s.opencalais.com/1/pred/position"
and "http://s.opencalais.com/1/pred/company".
You also need to consider the "literal" values.
So a query like "give me all nodes that are 'person' who has property 'title' and has 'name' that works for 'company' with 'name' " is:
PREFIX ent: http://s.opencalais.com/1/type/em/e/
PREFIX evn: http://s.opencalais.com/1/type/em/r/
PREFIX pred: http://s.opencalais.com/1/pred/
SELECT ?s ?name ?title ?company WHERE {
?s
?s pred:person ?p .
?p pred:name ?name.
?s pred:position ?k .
?k pred:name ?title .
?s pred:company ?d .
?d pred:name ?company};
"PersonCareer" defines in OpenCalais type which represents graph containing both http://s.opencalais.com/1/pred/person and http://s.opencalais.com/1/pred/position
RDF model creation based on extracted entities
After running SPARQL query on the returned response from Calais's web service I end up with the following:
Ryan Wald Managing Director The Gores Group LLC
Jeffrey Frank President and Chief Executive Officer Cherokee International
What do I do with those ? Well , now that I have extracted meaningful entities out of raw text I can "wrap" them in RDF and broadcast them on the "semantic web:" to be indexed.
When converting anything into RDF you want to consider which "ontology" would be most suitable. Ontology is essentially a dictionary of meaning for representing the data at hand.Currently the semantic web has rich set of ontologies(You can check some of those at Existing Ontologies )
For the present example FOAF ( Friend-of-a-Friend ) , Relationship and V-card are the best candidates.
It is particularly important in JENA to define the correct ontology namespace which is recognizable by Jena, since Jena has the unpleasant behavior to display unknown ontologies as "j.0" suffix.
So to represent the above I have done:
model=ModelFactory.createDefaultModel();
model.setNsPrefix("relationship", "http://purl.org/vocab/relationship/");
model.setNsPrefix("dbpedia", "http://dbpedia.org/property/");
model.setNsPrefix("foaf", "http://xmlns.com/foaf/0.1/");
This essentially assigners abbreviation to existing known ontology. The task is to embed the person entity in FOAF Person so that it can be semantically recognizable by agents on the "open web. Thus I create a "Resource" ("jena-speak") with name "Ryan Wald" and title "Managing Director" who works for a company "The Gores Group LLC" like this:
..
Property name = model.createProperty("http://xmlns.com/foaf/0.1/name");
Person = model.createProperty("http://xmlns.com/foaf/0.1/Person");
executive = model.createResource("http://www.yanago.com/example/html/example.rdf#Ryan Wald");
executive.addProperty(RDF.type, Person);
executive.addProperty(name,"Ryan Wald");
executive.addProperty(VCARD.TITLE,"Managing Director");
..
This is the essence of the triple "Subject-Property-Object" idea is as follows: my "subject" is a "resource" residing at "http://www.yanago.com/example/html/example.rdf#Ryan Wald"and it is of type "Person" with a "property" VCARD.TITLE as "Managing Director", and the "object" is another resource which resides at "http://www.yanago.com/example/html/example.rdf#The Gores Group LLC".
In Jena this is achieved simply by writing:
model.add(executive, employedBY, company);
The above represents a Statements saying: "We have a resource by the name of 'Ryan Wald' who is a 'person' and has the property 'employed by' which connects him to another resource named 'Gores Group LLC' which is of type 'company'
This is how we hooked two "nodes/resources" together by means of ontology and thus created meaningful connection between them, to be interpreted by machines.
This is the underlying idea of linked data : to connect scattered resources on the web so that applications ("intelligent agents") can understand their meaning.
To run the above example download the archive at:
Example src
to run this example:
unpack the bundle above.
-change the paramsXml file and assign to externalID the value of your Calais API key
-Edit the RDFExample.java file and replace the string "YOUR CALAIS KEY" with the same key as above.
-compile the source
-Create a text file with some raw textual content ( in my case I used the enclosed "example.txt" )
-execute as follows:
Linux
java -classpath ".:calais-client.jar:jena.jar:arq.jar:commons-logging.jar:jaxrpc.jar:xercesImpl.jar:axis.jar:iri.jar:icu4j_3_4.jar:commons-discovery-0.2.jar:wsdl4j-1.5.1.jar" RDFExample example.txt
Windows
java -classpath ".;calais-client.jar;jena.jar;arq.jar;commons-logging.jar;jaxrpc.jar;xercesImpl.jar;axis.jar;iri.jar;icu4j_3_4.jar;commons-discovery-0.2.jar;wsdl4j-1.5.1.jar" RDFExample example.txt
The sample output is something like this:
rdf:RDF
xmlns:dbpedia="http://dbpedia.org/property/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#"
xmlns:relationship="http://purl.org/vocab/relationship/"
foaf:Person rdf:about="http://www.example.com#Jeffrey Frank"
relationship:employedBy
foaf:Organisation rdf:about="http://www.example.com#Cherokee International"
relationship:employerOf rdf:resource="http://www.example.com#Jeffrey Frank"/
foaf:nameCherokee International/foaf:name
/foaf:Organisation
/relationship:employedBy
vcard:TITLEPresident and Chief Executive Officer/vcard:TITLE
foaf:nameJeffrey Frank/foaf:name
/foaf:Person
foaf:Person rdf:about="http://www.example.com#Ryan Wald"
relationship:employedBy
foaf:Organisation rdf:about="http://www.example.com#The Gores Group LLC"
relationship:employerOf rdf:resource="http://www.example.com#Ryan Wald"/
foaf:nameThe Gores Group LLC/foaf:name
/foaf:Organisation
/relationship:employedBy
vcard:TITLEManaging Director/vcard:TITLE
foaf:nameRyan Wald/foaf:name
/foaf:Person
/rdf:RDF
No comments:
Post a Comment