wiki:LocalDataAlignment

back to Service Descriptions

Local Data Alignment

Analyses local data and aligns it. The generated metadata can be to connects different representations of the same entities, use string similarity or soundes for similarities, use text analysis, analyse the crawled data generated by ApertureDataWrapper, and other alignment algorithm.

Overview

The Local Data Alignment component analyzes information in the RdfRepository and suggest new links or annotations based on existing data. The most prominent task is to find entities that are usable as entries in the user's [PimoOntology], for example to create a person from an nie:!PersonContact address. Second, it unifies multiple representations of the same entity into one thing (such as two information elements representing the same person into one pimo:Person).

Suggestions found by the components are confirmed or rejected in a user interface. The component runs regularly as background task, analysing changed resources after they are crawled by DataWrapper. Analysing the data is programmed in individual plugins, separating concerns and also allowing developers to put new algorithms into the component.

Authors/Developers

Status

Installation Instructions

First a working Nepomuk Server (RDFRepository and PimoService) is needed. To use the Local Data Alignment you also need the project org.semanticdesktop.nepomuk.comp.localdataalignment. This is in the TargetPlatform and contains the HypothesisGenerators and the browser based Review of the hypothesis. To create own algorithms you must checkout the project from the svn server. One HypothesisGenerator requires the project org.semanticdesktop.nepomuk.comp.strucrec which is also in the target platform. If you can't see it, please update. But this project requires a proprietary library from IBM LanguageWare, which is available for free for non-commercial use. It will be available from IBM AlphaWorks  http://www.alphaworks.ibm.com/tech/galaxy, and also available for Nepomuk developers from Nepomuk private SVN  https://svn.nepomuk.semanticdesktop.org/repos/trunk/component/Comp-StrucRec-Dependencies. To install it, you can create a link file in the folder of the target platform/all/eclipse/links. Create a new file with to following content: "path=../../EclipseTargetPlatformSecret". In this secret platform put the jar files from IBM LanguageWare. If this don't work, go to the configuration of the TargetPlatform in eclipse and add the path to the libraries directly.

For further information about this component see the Galaxy Section at TextAnalytics.

You also have to select these plugins in the "run dialog". Be sure to select the pluin (org.semanticdesktop.nepomuk.comp.strucrec) from the target platform. There seems to be a problem, if you checkout this plugin from the repository and use this instead of the one from the target platform.

Using the component

You can start the plugin like every other plugin. The service will automatically align any changed resources every minute as a background task. Only hypothesis with a high probability will be automatically aligned.

Users should call LocalDataAlignment manually once a day to review other Hypothesis.

LocalDataAlignment can be used either with the user interface or without. See below for interfaces and examples.

Architecture

LocalDataAlignment works as an engine with plugins. The plugins are called HypothesesGenerator and suggest possible matches. Each generator supports the three modes all, crawlingreport, and singleresource. After the Local Data Alignment is started the HypothesesGenerator generates a pool of hypotheses.

The component LocalDataAlignment is the management part of the system. It has a Leverager and a FeedbackAgent. The Leverager executes all HypothesisGenerators and collects the Hypothesis. The FeedbackAgent presents the Hypothesis in a human readable form and gives the result back to the LocalDataAlignment. If the user accept one suggestion the planed changes of these Hypothesis would be performed. If the user discard a suggestion, no changes would be performed. In both cases all HypothesisGenerators which have generated this specific suggestion get feedback. So the system can "learn" and can prefer the HypothesisGenerators which generates good hypothesis (often accepted by the user).

One of the components, which are used by LocalDataAlignment is AlignmentPostProcessing. AlignmentPostProcessing analyzes the generated hypothesizes and filter all duplicated Hypothesis. It is used internaly by Leverager before it returns Hypothesizes.

At the moment there is a AddressbookHypothesisGenerator for NCO.PersonContact Current work: complete the interfaces and implement the algorithms. There is no possibility for reuse the code which comes with the thesis from Frank Osterfeld. Only the mean idea behind his thesis (Dempsters theory) has some possibilties for reuse (BeliefFunction, BeliefFunctionCombiner, ...).

Running Local Data Alignment

Go to the User interface and invoke the alignment process. Then you have to accept or reject suggested Hypothesis. Obvious decisions need not to be done, therefore a few treshold values are used to either automatically accept hypothesis that are probably true and automatically discard hypothesis that are probably false.

  • AUTOMATIC Belief values higher or equals to this value will be accepted automatically and will not be reviewed by the user. Value is 0.75 (and higher)
  • HIGH Belief values higher or equals to this value can have a realistic certainty that they are true. They should be shown in a GUI to be controlled by a user but the user will typically accept them. Value is 0.5.
  • MEDIUM Belief values higher or equals to this value may be true but not with a high probability. They must be reviewed by the user before accepting. Value is 0.25.
  • LOW Belief values higher or equals to this value are probably wrong, and are guesses. They may be reviewed by the user, but should generally be auto-rejected and not shown to the user. Value is 0.0.

Kudos to Andreas Dengel and Benjamin Adrian who defined these tresholds for another project, we reviewed this and adapted them for NEPOMUK.

A user interface showing suggestions must interpret and adhere to the tresholds. The individual probability of a hypthesis is available via Hypothesis.getBeliefValue(). After reviewing the hypothesis, the user interface sets the answer via Hypothesis.setAnswer(URI). Possible answer values are defined in the serverconf ontology, such as answerYes.

After the user interface has reviewed the hypothesis, the pool of answered (and unanswered) hypothesis is passed back to the LocalDataAlignment service via the method LocalDataAlignment.processAnsweredHypothesisPool().

Suggestion review GUI

After Hypothesis have been generated, they can be reviewed in a GUI. The interface looks roughly like this:

Examples

This paragraph describe how to use LocalDataAlignment.

Before connecting to LocalDataAlignment, you need a reference to NepomukServices first. This is usually done via OSGi or by using the NepomukClient.

// get nepomuk services first, this example uses OSGi
NepomukServices nepomukServices = bundleContext.getService(bundleContext.getServiceReference(NepomukServices.class.getName()));
// get alignment
LocalDataAlignment align = nepomukServices.getService(LocalDataAlignment.class);

Calling LocalDataAlignment to align the whole repository, but not show a user interface (just auto-accepting Hypothesis beyond the Belief value of BELIEF_AUTOMATIC, 0.75)

LocalDataAlignment align = ... (see above);
align.autoAlignRepository(LocalDataAlignment.BELIEF_AUTOMATIC);
// the method will block now, so you better have called it within a thread. This can take hours.

If you want to align one resource and get the hypothesis back in one call, do:

LocalDataAlignment align = ... (see above);
URI resource = new URIImpl("http://..andSoOn..claudia/Dirk");
LeveragerMonitor monitor = align.generateHypothesesForResource(resource, null);
HypothesisPool pool = monitor.get(); // THIS will block now

If you want to align one resource and get the hypothesis back one by one, while they are dripping in, use the listener.

LocalDataAlignment align = ... (see above);
URI resource = new URIImpl("http://..andSoOn..claudia/Dirk");
LeveragerListener myListener = new LeveragerListener() {
 public void startedSuggestions() {}
	
 public void hypothesesGenerated(Collection<Hypothesis> hypos) {
   for (Hypothesis hypo:hypos)
     System.out.println("I got a hypothesis: "+hypo.getDescription());
 }
    
 public void hypothesisPoolFilled(HypothesisPool pool) {
     System.out.println("now these are all hypothesis i got, and they are "+pool.size());
 }
}
LeveragerMonitor monitor = align.generateHypothesesForResource(resource, myListener );
// you don't have to call monitor.get() now, the listener will be informed

After you have received the hypothesis, you can review them. Set the answer of each Hypothesis, if you want. Once all hypothesis are answered, pass the pool back to the service so that the accepted Hypothesis can be stored in the main repository, and the rejected hypothesis are remembered in the localdataalingment repository.

Developing the component

Extending LocalDataAlignment is typically done by developing a new HypothesisGenerator algorithm or extending an existing one.

Developing a new HypothesisGenerator

To create new algorithms, look at LocalDataAlignment/TagsToTopic as prototype implementation of the HypothesisGenerator. New generators should be implemented in package:  org.semanticdesktop.nepomuk.comp.localdataalignment.modules New module need two classes:

  • HypothesisGenerator class
  • it's correspondent factory

Each generator has unique identifying URI.

See also the section about the innards of hypothesis generators below.

Registering a new algorithm

Usually you do not need to work inside LocalDataAlignment, but if you develop new hypothesisGenerator, you need to register your HypothesisGenerator in LocalDataAlignment.

Registering generators is performed in plugin Activator. Remember that you do not register generator, but factory, which create Generators.

ServiceReference sr = context.getServiceReference(OSGIRegistry.class.getName());
...
registry = (OSGIRegistry) context.getService(sr);
...
ServiceReference sr2go = context.getServiceReference(RDF2GoRepository.class.getName());
RDF2GoRepository repository = (RDF2GoRepository) context.getService(sr2go);
PimoClient pimoclient = new PimoClient(repository);
NepomukServices neposervices = (NepomukServices)context.getService(context.getServiceReference(NepomukServices.class.getName()));
...		

leverager = new LeveragerImpl(repository, pimoclient, neposervices);

//REGISTERING HYPOTHESIS GENERATOR
leverager.addHypothesisGeneratorFactory(new TagsToTopicFactory());
localdataalignment = new LocalDataAlignmentImpl(repository, leverager);


Hypothesis Generators

Llist of the existing and planned Hypothesis generators.

planned:

  • Email Analysis: Analyze email content. (planned)
  • FolderToPimo: analyzes folder structures to guess projects, topics, people, and other structures based on the folder hierarchy. (planned)

Interfaces

To interact with the service:

  • LocalDataAlignment Interface to call the functionalities used by other services
  • Hypothesis: The result generated by a HypothesisGenerator. available via the service to the outside

Inside the service:

There are different types of automatic matching:

  • new instances (RDF:Resource)
  • new classes (RDF:Class) (difficulty)
  • Connections between instances (tripel)
  • new contections types (RDFS:Property)
  • machting information elements (email, files ...) to things (person, topics, places ...) (pimo:occurrence)
  • find possible errors: merge two or more instances of one thing (classes ...) together

The DataWrapper (Aperture|Beagle) write some RDF annotations into a special rdf repository, (we have these named repositories inside the repo component, "main", "config", one would be "reports" or something like that). Inside this store, the wrappers write whenever a resource is modified, deleted, or added. We envisioned that these logs are called "CrawlReport". It is not done yet, but Antoni Mykla and Leo Sauermann agreed on it and Leo thinks its "obvious and good".

Based on that, the Metadata Alignment plugins get a java object called "crawl report" that has handy methods for getting the URIs of changed/added/deleted resources, and can focus to compute the alignment for the changes.

That can be done once a day, the crawlreports of this day are then passed in.

The LocalDataAlignment service runs in three modes:

  • all: initial, analyzing all data for possible matches (automatic) -method generateHypothesesForRepository()
  • crawlingreport: matching changed and new resources based on crawling reports of aperture (automatic) - method generateHypothesesForCrawlReport
  • singleresource: matching of one resource (called manually when needed ex. to support users when they classify e-mails or websites) - method generateHypothesesForResource

The last functionality would allow us to make an interesting evaluation: first all data is aligned, then only one resource, where we show related documents of the resource.

The result of an alignment process are Hypotheses, that suggest changes to the database. The suggestions can be accepted automatically, based on a probability measure, or in a user interface. Once accepted, suggestions can be executed on the RdfRepository, or rejected and discarded.

The Hypotheses are represented as Hypotheses objects:

How the Hypothese are generated is up to HypothesisGenerator plugins. All HypothesisGenerators support the same methods as the Leverager (all, report, single). Some algorithms work fine for a user and others make many errors. But at the beginning we don't know which algorithm is "good". To solve this problem we allow the HypothesisGenerators to leave some uncertainness in their suggestions. So HypothesisGenerators which provide good suggestions (often accepted) have a lower uncertainness and vice versa HypothesisGenerators which provide bad suggestions shpould have a increased uncertainness.

When there are different HypothesisGenerators which generates the same Hypothesis, this changes the value how certain this Hypothesis is. For example: there is a HypothesisGenerator which generates a Hypothesis with a certain factor of 90% and another HypothesisGenerator which generates the same Hypothesis but only with a certain factor of 50%. The resulting certain factor should be lower then 90%. The exact value depends from the quality of the HypothesisGenerator. Was one HypothesisGenerator classified as unsuitable for this user then it shouldn't highly influence the value from a suitable HypothesisGenerator. This isn't implemented at the moment. This theory is to complex and needs to much time to implement it. At the moment we don't think about, what should happen, if two generators create the same hypothesis.

The prototype Hypothesis Generator TagsToTopic

An example implementation of a HypothesisGenerator is the TagsToTopic. It searches for NAO.Tag in the store. If there is no corresponding Pimo.Topic for this tag, this generator would suggest this.

Working inside a Hypothesis Generator

Hypothesis generator analyses the data under consideration (whole repo/one resource/changes). For each possible hypothesis, the generator must check, if the hypothesis is already aligned in the rdfrepository (main repository) before suggesting it.

Also there are rejected hypothesis (=hypotheses that were suggested in the past but rejected by the user as being wrong) which must not be suggested again as new hypothesis. To check if a hypothesis was rejected before, a method is available in AlignmentInput to compare the suggested hypothesis with all rejected hypothesis. The input to this method is an abstract of a hypothesis, consisting of a named graph (a set of triples) that is only defined throught the triples inside. The abstract of a hypothesis can contain blank nodes to identify where in the hypothesis a new resource would be created, but the abstract has to be concrete enough to be matched. For rejected hypothesis, all abstracts of the rejected hypothesis are stored in a separate repository (localdataalignment) to keep them for comparison. Hypothesis generators may use the abstract also to compare and check if the data suggested to be added by the hypothesis is already present in the main repository. For this, another method in AlignmentInput is provided to check if the data exists in the main repository.

After the hypothesis generator has run several times:

  • rejected Hypothesis shouldn't be suggested again
  • and Hypothesis which are already accepted should not be suggested again

(as of 24.6.2008, some of these things are not implemented YET but will be this week).

The difference between the suggested changes of a hypothesis and the abstract is that the changes contain much more data (such as graph metadata of newly created contexts, modification dates, access rights, other metadata of the suggested triples). The abstract on the other side contains only the core statement of the hypothesis (such as: this information element x is an occurrence of a new thing, this would be expressed as _blanknode pimo:occurrence <uri-of-x>).

Besides that, all implementors should care for effective and efficient implementation of their generators, as these will run often and in the background while the user works they should not take up much processing power. As the abstract should be the minimal signature of the hypothesis, it is recommended to create the abstract by hand (=define it explicitly) and not use the ClientSession for it, as the ClientSession will create more than the minimal data.

Example: the abstract of creating a new Person from an address book entry would be (creating a new pimo:thing from an information element) would be:

_blanknode1 pimo:groundingOccurrence <uri-of-addressbook-personcontact>.

Compare this to the data that will be suggested by the hypothesis generator:

claudia:Dirk a pimo:Person;
 pimo:groundingOccurrence <uri-of-addressbook-personcontact>;
 nao:created "2007-02-03T11:15:00";
 nao:prefLabel "Dirk Hageman";
 ....

As we see, the "abstract" is the minimal data needed to match if the hypothesis was previously rejected, or already accepted.

The abstract will be used to check if it exists in the main or localdataalignemnt repositories by replacing all blank nodes with named variables, and then passed as a SPARQL-ASK query to the respective stores. This is implemented in AlignmentInput.

SVN browsing

Tests and Example code

Tests for LocalDataAlignment can be found in this package:

External References

In this section you find links to work that was done before.

  • The Nepomuk TextAnalytics component.
  •  http://opennlp.sourceforge.net/ - OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. Click here to see the current list of OpenNLP projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general.

Benjamin Horak Diploma thesis ConTag

Benjamin Horak did a diploma thesis on parts of LocalDataAlignment, supervised by LeoSauermann. ConTag’s functionalities support users to annotate documents, written in natural language, with textual tags in regard of personal concepts managed in a semantic desktop environment.

The basic idea is: system gets a text and a PimoOntology of the user. System extracts a Topic-Map around the text and compares it with the user-PIMO, possibly suggesting changes and extensions to the user-PIMO. But it mainly suggests tag (see page 22 of thesis). Internally it uses onlince-services by Yahoo and Tagthe.net.

Code (internal, only accessible for DFKI members) :

  • topics are stemmed, at the moment only english:  Porter: ontologies -> ontolog,  Kuhlen: ontologies -> ontology
  • Testing done using  Validator
  • Key Algorithms: Porter Stemmer, Kuhlen Normalizer, Contag Alignment Generator

Frank Osterfeld – Diploma Thesis

Frank Osterfeld did a diploma thesis at DFKI on folder-to-pimo matching. This was supervised by Ludger van Elst from DFKI.

GoPimo creates or modifies a PIMO based on folder structure on the filesystem. Each hypothesis has a certainty level, used to rate algorithms that generate multiple wrong hypothesis. If the algorithm passes a high certainty level but the user rates the hypothesis as wrong, the accurracy of the algorithm is automatically lowered, as a self-learning system (Chapter 5.2). Shows the mathematical basis to combine multiple Hypothesis. Hypothesis are defined in a mix of XML, Sparql, Java snippets, Scripting snippets, and Rules.

Code only available to DFKI members

Attachments