vote up 2 vote down
star
2

If I wanted to do semantic web application development in [some obscure language] and nothing is currently available, where would I go to find out how they work and how to build a production-quality triple store in my obscure language? What API standards are out there that I ought to adhere to?

Specifically, How do I go about indexing triples efficiently in terms of space and search time? Are B-Tree derivative data structures what I should look at, or is something else better? What optimisations are known for compaction and for optimising, say, data retrieval to support reasoners and SPARQL queries?

flag

7 Answers

vote up 3 vote down

Some stores index everything, some assume that the predicate is always present in a triple pattern. With that assumption, two indexes are needed, POS and PSO, not 3 (e.g. SPO, POS, OSP). The saving is greater for quads.

Once indexes are created, if they are not used, they just sit on disk out of the way. They don't take cache space so don't affect query execution speed.

link|flag
Hi Andy, Is there, yet, a state-of-the-art design that is most often used? Perhaps there's a definitive paper or review paper that would give me guidelines on data structures, algorithms, and trade-offs to consider? – Andrew Matthews Nov 10 at 21:51
vote up 3 vote down

I recently wrote an article about the way we implemented a triple store at our company, Procurios. I go into the low-level details of implementing it in a relational database:

Semantic web marvels in a relational database - part I: Case Study

link|flag
Hi Patrick, Thanks for the links. Although I am specifically interested in the design of a dedicated triple store, I'm also very interested in how your approach compares to such a solution in terms of space and time complexity for a standard query benchmark. Do you have such comparison figures yet? Also - just out of interest - what was your implementation language? Have you tried to use this store with an inference engine? how did it perform for that sort of use? – Andrew Matthews Nov 15 at 21:48
Hi Andrew. PHP is our implementation language. Since we don't use a semantic web query language its hard to compare performance with other stores. The article is just to give you some ideas. Set up several test situations and create your own performance stats, is what I would advice. – Patrick van Bergen Nov 17 at 21:13
vote up 3 vote down

Thomas Neumann and Gerhard Weikum have put a lot of thought into your question relative to development of their RDF-3X engine. The following document includes performance testing against multiple data sets, info on index optimizations, triples compression, etc.:

http://www.vldb.org/pvldb/1/1453927.pdf

link|flag
vote up 1 vote down

http://www.openlinksw.com/weblog/oerling I've found it hard to follow at times, but you get the idea that Orri has thought a lot about it.

There have been some academic(-ish) papers here and there. I recently read a good one about a distributed triple store. I think it was about 4store(.org) but I can't remember where I found it. Anyone else know?

Otherwise, you probably have to ping the people that have built them for ideas. For instance, in the SemWeb.NET [1] triple store that I built, I found a simple MySQL structure [2] worked well enough to scale to 1B triples, though it was very space-hungry with many indexes.

[1] http://razor.occams.info/code/semweb/ [2] http://razor.occams.info/code/repo/?/semweb/src/SQLStore.cs

link|flag
vote up 1 vote down

Most triple stores and frameworks are open source under liberal licenses (e.g. virtuoso, sesame, 4store, jena, openanzo, redland, semweb.net, mulgara). I think the best place to learn about building a triple store is by looking at those and spending time understanding how they work and what design decisions were made.

link|flag
Hi Ian, I'm motivated not only by a desire to get questions answered on this_site but to also get a discussion going about the relative merits of different approaches. For example, as Josh mentioned, there seems no easy way to get around the need for at least 3 indexes. That's potentially a big bloat that might be off-putting for some. How can you minimize such inflation? Are there accepted encoding schemes for URLs in a triple store? I have also read elsewhere about partitioning schemes that can speed up searches... – Andrew Matthews Nov 4 at 21:51
vote up 0 vote down

I really wonder if once you created an application over your triplestore, and once you now know the queries you're doing—if it is not possible to "trim" from the indexes the "things" that you won't ever use for your queries. That would actually remove the unnecessary bloat, think of this like a "database packing" operation.

Probably trimming is not doable, what about re-indexing the data while skipping what won't be used, and the no-skip list being generated from a collection of your SPARQL queries.

link|flag
Thanks, Laurian. For this question I'm more interested in how to construct a general purpose triple store, so I am also more interested in generally applicable tuning techniques as well... – Andrew Matthews Nov 10 at 21:48
vote up 0 vote down

A search on Google brought up the paper "Design and implementation of an RDF Triple Store".

If you're more like the source code reader type of guy, you mabe want to take a look at the sources of TDB, which is the native storage engine in Jena, or also at the sources of Sesame.

Update: Another place where you maybe could learn something about the topic is the BigData project. They write a lot about technical details in their blog, and it's open source, too, so you can take a look at the sources, too.

link|flag
I read the paper. Not very impressive. It does score points in the (some obscure language) stakes though. );^}> – Andrew Matthews Nov 18 at 10:48

Your Answer

Get an OpenID
or

Not the answer you're looking for? Browse other questions tagged or ask your own question.