SummerSchool/LiftingProjectOne

Lifting Project Option One - Relation Extraction

  • Participants: TBE
  • Goal: Extract RDF triples from a text using Knowledge Based Information Extraction
  • Mentor: BrianDavis

Background

Do you want to teach a machine how to understand text? Natural Language Prcoessing (NLP)? Whats that? Wanna be an NLP engineer? How can teach my Semantic Desktop to understand simple relations within text without needing a degree in Linguistics? Its easy, even your grandmother could do it!

Relation Extraction

Human Language Technology (HLT) plays a crucial role with respect to the lifting of non-semantic web data, specifically unstructured data into the Social Semantic Desktop. Extracting relations as a subtask of Information Extraction however is not easy.

For instance look the example:

IBM (NYSE: IBM) and Cognos (NASDAQ: COGN) (TSX: CSN) today announced that the two companies have entered into a definitive agreement for IBM to acquire Cognos...

We could write shallow grammars in Knowledge based Information Extraction such as {Company} {acquiresRelation}{Company} to extraction a Subject Predicate Object triple and generate RDF statements for semantic annotation and/or Ontology population.

But what about?

Negation

BEA, which recently rebuffed a hostile takeover bid by Oracle,...
SAP no longer plans to buy Business Objects..

Outdated Facts which may no longer be of interest

IBM, already a powerhouse, has purchased Ascential, MRO and Filenet in the past two years.

Our rules will overgeneralise and extract false data for lifting.

The Research and Language Engineering Challenge

How do we write rules which do not overgeneralize and extract false positives? We want to pull the correct stuff and not false information? If we add in constraints how do we prevent our existing rules from undergeneralizing – where rules become over constrained to pull out the stuff we are actually interested in?!

Progress

Results