Keyword Extraction

Collects a set of relevant keywords/keyphrases from a textual input or file which contains text (pdf,doc,txt). Uses various stages of lexical and statistical analysis, built on top of GATE framework from Sheffield University.

Overview

Key Functionality

Provides a list of keywords/keyphrases/terminology for a given document containing a sufficient amount of natural language (English only for now). The returned structure is a Map with keyphrase as keys and relevance/significance as values, ordered by significance. Significance is a double value between [0;1] and indicates the relative importance of the associated keyword for the document in question.

Use Cases

  • provide URLs of blog posts, wiki pages, (or any other http accessible document) as input, use recommended keyphrases as tags
  • extract textual content as String via Aperture Services and extract keyphrases from it
  • extract keyphrases from {pdf,doc,rtf,txt}-documents on your desktop, use those recommendations to fill in missing metadata descriptions in the documents (i.e. keywords field in PDF/MSWord documents)

Web Interface

A web-interface to the KeywordExtraction Service to test the functionality is available on the DERI server .

Authors/Developers

KeyphraseExtraction: AlexanderSchutz, DERI/NUIG, conceptual planning, implementation, documentation.

Installation Instructions

Installation via checkout from Nepomuk SVN

Service and dependencies have been integrated as bundle available for the Eclipse Target Platform

Using the component

javadoc TextAnalytics Extraction Services Interface javadoc

Webservice / WSDL / etc

WSDL NepomukType

alternative WSDL Location for NepomukType

Web Service Target NS (this is defunct for some reason)

Examples

Extracting from a URL (may be on the web or on the desktop)

NepomukServices services;
TextAnalyticsService taservice;

if (taservice == null) {
  taservice = services.getService(TextAnalyticsService.class);
}


// state the location of a document via a well-formed URL (as String)
String documentUrl = "http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-lecture.pdf";

Map<String, Double> keyphraseMap = taservice.getKeywordsFromUrl(documentUrl);

// do something with the returned Map
for(Iterator<Map.Entry<String,Double>> keyphraseItr = keyphraseMap.entrySet().iterator(); keyphraseItr.hasNext(); ){
	Map.Entry<String, Double> keyphrase = keyphraseItr.next();
	String phrase = keyphrase.getKey();
	double relevance = keyphrase.getValue().doubleValue();
        System.out.println(phrase +" -- "+ relevance);
}

should result in something similar to this:

inertial frame -- 0.524
relativity -- 0.513
time -- 0.489
system -- 0.485
Nature -- 0.475
stipulation -- 0.469
physically -- 0.456
theory -- 0.416
motion -- 0.206
laws -- 0.390
states -- 0.371
bodies -- 0.364
point -- 0.338
terms -- 0.331
gravitational field -- 0.312
problem -- 0.300
identical clocks -- 0.242
principles -- 0.213
concept -- 0.197
mechanics -- 0.187
light -- 0.175
frame -- 0.142
equations -- 0.131
rigid -- 0.117
field -- 0.088
space -- 0.081
rest relative -- 0.081
important -- 0.023
special case -- 0.023

Extracting from Plain Text / String

NepomukServices services;
TextAnalyticsService taservice;

if (taservice == null) {
  taservice = services.getService(TextAnalyticsService.class);
}

// this is the input string -- OR BETTER, load the string content of a file into a string
String text = 
   "Two of the world's richest men have launched a campaign aiming to tackle smoking "
  +"in the developing world. New York City Mayor Michael Bloomberg and Microsoft founder "
  +"Bill Gates warn one billion people could die this century from smoking-related "
  +"illnesses. The billionaire philanthropists have pledged $500m (£250m) in the next "
  +"five years to help people quit smoking. The two men want to run public-information "
  +"campaigns warning of the dangers of tobacco. There are more than 1 billion smokers "
  +"worldwide. As the developed world curbs tobacco use - "
  +"with such moves as the ban on smoking in public places in London, New York and Dublin - "
  +"so the tobacco companies are shifting their focus to Asia and Africa."
  +"Bill and I want to highlight the enormity of this problem and catalyse a global movement "
  +"of governments and civil society to stop the tobacco epidemic, said Mr Bloomberg."
  +"Mr Gates and Mr Bloomberg are aiming to help governments with policies which have been "
  +"shown to curb smoking, such as raising tobacco taxes and banning tobacco advertising."
  +"The question is how effective $500m can be against the concerted efforts of the tobacco "
  +"companies to expand their markets.";





Map<String, Double> keyphraseMap = taservice.getKeywordsFromString(text);

// do something with the returned Map
..

Developing the component

full javadoc available, see javadoc overview

SVN browsing

Link to Nepomuk SVN

External References

A web-interface to the KeywordExtraction Service to test the functionality is available on the DERI server .