Keyword Extraction
Collects a set of relevant keywords/keyphrases from a textual input or file which contains text (pdf,doc,txt). Uses various stages of lexical and statistical analysis, built on top of GATE framework from Sheffield University.
Overview
Key Functionality
Provides a list of keywords/keyphrases/terminology for a given document containing a sufficient amount of natural language (English only for now). The returned structure is a Map with keyphrase as keys and relevance/significance as values, ordered by significance. Significance is a double value between [0;1] and indicates the relative importance of the associated keyword for the document in question.
Use Cases
- provide URLs of blog posts, wiki pages, (or any other http accessible document) as input, use recommended keyphrases as tags
- extract textual content as String via Aperture Services and extract keyphrases from it
- extract keyphrases from {pdf,doc,rtf,txt}-documents on your desktop, use those recommendations to fill in missing metadata descriptions in the documents (i.e. keywords field in PDF/MSWord documents)
Web Interface
A web-interface to the KeywordExtraction Service to test the functionality is available on the DERI server .
Authors/Developers
KeyphraseExtraction: AlexanderSchutz, DERI/NUIG, conceptual planning, implementation, documentation.
Installation Instructions
Installation via checkout from Nepomuk SVN
Service and dependencies have been integrated as bundle available for the Eclipse Target Platform
Using the component
javadoc TextAnalytics Extraction Services Interface javadoc
Webservice / WSDL / etc
alternative WSDL Location for NepomukType
Web Service Target NS (this is defunct for some reason)
Examples
Extracting from a URL (may be on the web or on the desktop)
NepomukServices services;
TextAnalyticsService taservice;
if (taservice == null) {
taservice = services.getService(TextAnalyticsService.class);
}
// state the location of a document via a well-formed URL (as String)
String documentUrl = "http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-lecture.pdf";
Map<String, Double> keyphraseMap = taservice.getKeywordsFromUrl(documentUrl);
// do something with the returned Map
for(Iterator<Map.Entry<String,Double>> keyphraseItr = keyphraseMap.entrySet().iterator(); keyphraseItr.hasNext(); ){
Map.Entry<String, Double> keyphrase = keyphraseItr.next();
String phrase = keyphrase.getKey();
double relevance = keyphrase.getValue().doubleValue();
System.out.println(phrase +" -- "+ relevance);
}
should result in something similar to this:
inertial frame -- 0.524 relativity -- 0.513 time -- 0.489 system -- 0.485 Nature -- 0.475 stipulation -- 0.469 physically -- 0.456 theory -- 0.416 motion -- 0.206 laws -- 0.390 states -- 0.371 bodies -- 0.364 point -- 0.338 terms -- 0.331 gravitational field -- 0.312 problem -- 0.300 identical clocks -- 0.242 principles -- 0.213 concept -- 0.197 mechanics -- 0.187 light -- 0.175 frame -- 0.142 equations -- 0.131 rigid -- 0.117 field -- 0.088 space -- 0.081 rest relative -- 0.081 important -- 0.023 special case -- 0.023
Extracting from Plain Text / String
NepomukServices services;
TextAnalyticsService taservice;
if (taservice == null) {
taservice = services.getService(TextAnalyticsService.class);
}
// this is the input string -- OR BETTER, load the string content of a file into a string
String text =
"Two of the world's richest men have launched a campaign aiming to tackle smoking "
+"in the developing world. New York City Mayor Michael Bloomberg and Microsoft founder "
+"Bill Gates warn one billion people could die this century from smoking-related "
+"illnesses. The billionaire philanthropists have pledged $500m (£250m) in the next "
+"five years to help people quit smoking. The two men want to run public-information "
+"campaigns warning of the dangers of tobacco. There are more than 1 billion smokers "
+"worldwide. As the developed world curbs tobacco use - "
+"with such moves as the ban on smoking in public places in London, New York and Dublin - "
+"so the tobacco companies are shifting their focus to Asia and Africa."
+"Bill and I want to highlight the enormity of this problem and catalyse a global movement "
+"of governments and civil society to stop the tobacco epidemic, said Mr Bloomberg."
+"Mr Gates and Mr Bloomberg are aiming to help governments with policies which have been "
+"shown to curb smoking, such as raising tobacco taxes and banning tobacco advertising."
+"The question is how effective $500m can be against the concerted efforts of the tobacco "
+"companies to expand their markets.";
Map<String, Double> keyphraseMap = taservice.getKeywordsFromString(text);
// do something with the returned Map
..
Developing the component
full javadoc available, see javadoc overview
SVN browsing
Link to Nepomuk SVN
External References
A web-interface to the KeywordExtraction Service to test the functionality is available on the DERI server .
