wiki:LuceneSail

The LuceneSail is an Apache Lucene-based fulltext enabled RDF storage layer above existing storage. It is based on the Sesame2 platform, but can be used (in theory) on any RDF store, as Sesame has a stacked architecture allowing this. Currently, there are three flavors of the LuceneSail.

The paper:

Contact

  • Chris Fluit - developer
  • LeoSauermann - developing
  • Enrico Minack - developing
  • Gunnar Grimnes - developing
  • Alex Vigdor - developing

There is no "lead" developer, individual developers may or may not update the code and care for patches. On issues, contacting everyone is recommended.

Status

This is stable enough for us.

  • it is in extensive use by NEPOMUK developers and it works fine
  • it slows down changes to the database compared to a LuceneSail less store

The version used for NEPOMUK can be browsed here:

Installation

LuceneSail is part of RdfRepository in nepomuk. It can also be used in a sail stack inside a normal openrdf/sesame installation. see http://www.openrdf.org/forum/mvnforum/viewthread?thread=1528 for a discussion about the factory and configuration.

Support

NEPOMUK does not offer free support for the LuceneSail, you can ask Aduna or DFKI for commercial support or try the sesame forum.

query language

Example

Search for any resource an RDFS-label value that contains the string "person"

PREFIX search:   <http://www.openrdf.org/contrib/lucenesail#>
SELECT ?x ?score ?snippet WHERE {
?x search:matches ?match.
?match search:query "person";
  search:property rdfs:label;
  search:score ?score;
  search:snippet ?snippet. 
}

Details

The query is expressed as virtual resource with virtual properties, connected to the resource to find with a virtual property. If this is too much "virtual" to understand, read on here.

The parameters are:

  • search:matches - connecting the resource to be found with the query. subject = resource to be found. object = formulated query
  • search:query - the lucene fulltext query property of the query
  • search:property - [optional] restrict the search to only this property. If omitted, all literal properies will be searched
  • search:score - [optional] bind the score for an individual returned hit to this variable (must be a variable)
  • search:snippet - [optional] bind a highlighted snippet for each hit to this variable (must be a variable)

The query part can be any lucene term expression, you can use the documented Lucene Term Modifiers in your query. For short, those are:

Highlighting is when "... you get a small excerpt of the document, with the key words highlighted so that you can spot the context where the word appeared....". In this implementation, the result uses HTML's <b></b> markers around the hightlighted word.

Details: what is stored, how

The LuceneSail stores the fulltext of all literal values stored into the RdfRepository. The sail is part of the sail-stack odfRepository, its triggered before inference (inferred triples won't be indexed, to optimize storage). When resources extracted by the DataWrapper are stored (crawled resources from datasources), the fulltext of the resource is also stored into the RDF repository and therefore the LuceneSail. The fulltext is stored as plaintext without markup (formatting), alphanumerical characters and punctuation are indexed. The conversion to plaintext is done by the DataWrapper and not by LuceneSail nor the RdfRepository.

Inside the LuceneSail, the fulltext is stored as Lucene documents. A Lucene Document consists of key-value pairs, allowing you to store and search on many different properties. Each Lucene Document represents one RDF resource. There is a special field "uri" marking the URI of the resource. Triples are then stored by using the predicate URI as field name, and the object literal value as the field value. Another field "context" is used to capture all context(s) that contributed to a Lucene Document (here the word context means "the fourth column in a triplestore making it a quadstore", do not mix it up with other meanings of the word context). Ususally a resource is defined in one context, but there can be multiple.

All fields in Lucene are stored as "STORED" fields (in comparison to "INDEXED"). There are two reasons for this:

  • The index has to be updateable. Changes to the properties of a resource result in the Lucene Document to be re-created and the existing document replaced. This is due to the architecture of Lucene, which does not allow "editing" stored documents.
  • Result highlightning (available as a Lucene Contribution) requires the fields to be stored. Syntax highlithing is when "... you get a small excerpt of the document, with the key words highlighted so that you can spot the context where the word appeared...."

Storing the fulltext both in the RdfRepository and in the Lucene index doubles the needed disk space but allows a quick update of the Lucene storage. This could be optimized by storing the fulltext only in the RdfRepository, but this needs a tighter integration with the underlying sails (it would be good to integrate the fulltext storage right into the storage mechanisms of the NativeSail, then reading operations needed to update the Lucene Index could access the data directly). If you are interested in implementing this optimization, contact the developers (see below for contact information).

The LuceneSail fulltext index is activated for the main repository only. For extra repositories, like the config repository, we have not added fulltext search support. This can be changed, so that you can pass options when creating repositories (do you want to code this configuration? contact LeoSauermann).

Re-Indexing

You will notice that the fulltext fulltext index may get corrupted after a few weeks of usage, this can happen when you don't shutdown the system gracefully or by bugs. We have anticipated this, there is a "reindexing option". Go to your debug RDFRepository page and press the re-index this repository button, its at the bottom.

Dependencies

You may notice that the Lucene OSGi jars in the Eclipse Target Platform are nowhere to be downloaded from, they just appeared.

They were created in 5 minutes by taking the release and fumbling with the manifests by Leo. If you want to update them, do the same: download a release of lucene (best done using maven), use Eclipse option "create new plugin from JAR" and use the manifests in the existing jars as basis.

Development

We use Aduna's code repository to do development, then the results will outlive the nepomuk project. You can read Documentation on developing sesame which also apply here.

For a much tighter integration of Maven into Eclipse, see Aduna Maven wiki.

Notice, if you want to commit code to the LuceneSail SVN repository, you need an Aduna account and you have to use the https URL of the SVN repository.

To test if you did everything right, go to the command line inside the lucene folder and run this, it will compile the project, run the tests, and print you some results:

mvn test

To build a release you do

mvn jar:jar

Since you have set up Eclipse already, you can also open the src/test/java folder of the LuceneSail project, right-click on org.openrdf.sail.lucene.TestAll.java and select Run As -> JUnit Test.

If you have problems:

External References

The paper:

Last modified 5 years ago Last modified on 03/03/09 13:14:32