Local Storage / RdfRepository

The RdfRepository service provides storage and access to semantic resources and their descriptions. It supports storing and retrieval of RDF data. RDF is the data structure used by all of the NEPOMUK services, and all data of NEPOMUK is stored in the RdfRepository , hence this is a core service. Full-text search is supported by using the Apache-Lucene text indexing and search engine. A limited support for NRL inferencing is implemented.

Overview

The Local Storage (or RdfRepository) is the central metadata and structured data store in Nepomuk. It contains all information from the filesystem from the DataWrapper, ontologies, and the Personal Information Model (PIMO) of the user. All Data that Nepomuk handles locally should be kept here, to benefit most from data integration. Storage of binary files is not part of the repository, but members of Nepomuk are working on ways to do this. Mikhail Kotelnikov suggested using JCR (Java Content Repository), Max Völkel is working on another solution for binary content.

The features of the RdfRepository at the moment are

  • Storing and Querying RDF Data
  • underlying database is sesame2 alpha 3
  • SPARQL support as far as implemented in sesame
  • SERQL support
  • performant inference engine for subclass/subpropery/inverseProperty enclosure on insert (uses transactions)

All Desktop applications can use the database to share metadata about resources, and through this sharing it is possible to integrate them on the level of data and ontologies. Information about the person Dirk edited by application A can be changed and annotated by application B, and vice versa. If one of the applications can express information in a format not readable for the other, the RDF model will allow both facets to co-exist without disturbing each other. For example, if A can work with telephone numbers but B does not, then the telephone numbers added by the first will not disturb correct functioning of the second. The mechanism behind is the concept of extensible ontologies, a basic concept of RDF.

Important for the applications and the integration to work is that all ontologies and data is stored in the RdfRepository.

The RdfRepository is not an application on its own but an infrastructure for other applications. To make the integration work, it is important that there is only one RDF repository available. Therefore we provide it as unique NEPOMUK component, which is able to store data generated by various applications and extracted by the DataWrapper.

Authors/Developers

Installation Instructions

Using the component

The RdfRepository starts together with the other core services on system startup. It is then available through its API.

The easiest way to program with it is using the RDF2go interface, read using RdfToGo with RdfRepository. A more convenient way is using the PimoClient.

Manipulating the data inside the repository requires knowing the OntologiesHowTo describing the stored data formats, RecommendedUris about URI identifiers to use, and other Recommendations.

SPARQL Endpoint

A  Sparql Endpoint conforming to the SPARQL protocol and implemented using Sesame is available, for the main repository it is here:

User can send queries using standard HTTP requests, passing as a parameter a query. Query:

SELECT ?s ?p WHERE { ?s ?p "PERSON" .}

Request:

http://localhost:8181/org.semanticdesktop.services.rdfrepository/repositories/main?query=SELECT%20?s%20?p%20WHERE%20{?s%20?p%20"Person".}

Result  example query

To see more about Sparql usage go to RdfRepository/SparqlQuery.

Multiple Repositories and Contexts

The RdfRepository is a RDF storage server that supports multiple databases. Similar to a SQL server that has multiple databases and tables, a RdfRepository has multiple repositories and inside one repository multiple contexts. See RecommendationStoringInformationElements for details on how to store Information Elements.

The preconfigured repositories are:

identifierpurposecontext used for
mainstores all RDF data of the user. All ontologies, add RDF data generated by applications, all RDF data extracted from files and other sources Each ontology is two contexts (one data, one metadata), the user's PIMO is multiple contexts, each crawled information element is one context, see RecommendationStoringInformationElements and RecommendedUris
configstores configuration data (username, e-mail address of user, application configuration, service configurations, information about data sourcesOnly a few contexts, most data is in the default context, data source configuration from DataWrapperAperture use one context each

As you can see, you will typically work on the main repository. If you need additional repositories for statistic data or data mining in RDF, you can create separated repositories for your needs. The RdfRepository has the methods createRepository(), removeRepository(), listRepositoryIds() to help you.

Inference

We do inference during insert into the store. That means, all inferred data is already available when the data is inserted, and no inference is needed when querying data. That allows NEPOMUK to be very fast when users interact with it (reading) and a little slower when manipulating data. To keep the storage requirements low and to allow efficient inference (within milliseconds) on large datasets, we implement only TBox inferencing.

Read NepomukInferencing to learn what to expect from the inference engine.

Fulltext search

Read LuceneSail to learn how fulltext indexing and searching works. Besides that, Local Search provides more sophisticated search methods.

Data Storage Folder

For end users the RDFRepository will store it's data in the folder specified in the NEPOMUK/configuration/config.ini file by the line org.semanticdesktop.nepomuk.data=$user.home/.nepomuk. It is the folder .nepomuk in the user's home folder.

When starting NEPOMUK for development from within Eclipse (as described on EclipseDevelopment) the property org.semanticdesktop.nepomuk.data is not set in the config.ini in the product and rdfrepository will store using the OSGi bundle's data directory. This is somewhere in ./configuration/.metadata/plugins and the location is printed out in the log messages on startup.

If you develop nepomuk in eclipse the property is by default NOT set, making it easy to wipe the data at will and start over. If you downloaded the nightly build the property should be set and your data will be persistent, even if you download a new nightly build.

  • DataFolder - read on about the storage location and how to configure it!

Examples

See how to...

Developing the component

Checkout the component from SVN

You must use the JUnit tests while developing:

To learn about development as such:

Development should be easy when you start with the JUnit tests. You have to checkout the mockup also.

Known Problems

Limited Multi-Threading

RdfRepository does support multi-threading with a simple scheme: all calls run through one synchronized SAIL (in the SAIL-stack). This can cause deadlocks. There is no simple solution to make RdfRepository to support concurrency. Providing thread safe read and write access to RdfRepository would result in a major rewrite of RDF2go and NEPOMUK. This problem has tickets attached: general ticket:576 and more precise ticket:415

Unicode 0xb breaks RDF Repository

The RDF standard does not limit the characters valid in RDF literals, hence literals containing forbidden XML or unicode are valid in RDF. If such literals are added to the store (which is valid RDF), remote access and backup will break (which is based on XML). This is a general problem of the RDF standard being incompatible with RDF/XML. To circumvent it, we could filter out all forbidden Unicode literals before they are entered into the store, which may have other side-effects. For now, ApertureDataWrapper filters out illegal unicode characters - Aperture is the main source where such bad characters can origin from, cancelling it out solves most of this problem. Another problem of RDF is to allow URIs that are not valid XML localnames. XML localnames must start with a character, URIs can have a number or symbol at the beginning of the localname. This is breaking XML serialization at another point. We see that XML, RDF, and URIs need a better alignment when it comes to details, but leave this problem to the W3C members of our consortium to address.

SVN browsing

  • you find this component at  svn, browse online here

Tests and Example code

Tests for RdfRepository are in

External References

The RDF repository uses  Sesame 2 underneath. We were one of the first users of Sesame 2 and helped developing the  HTTP client layer of it.

For fulltext indexing we use LuceneSail developed together by Aduna, DFKI, and L3S, it is based on  Apache Lucene.

Publications about the RDF Repository are (not exhaustive):

Attachments