show/hide this revision's text 2 Moved tedious testing details to blog post.; added 20 characters in body

The reason I didn't realize this simple solution was that I was unaware of the procesing model of SPARQL, which in fact involves "involves generating every possible combination of bindings of values in the data to variables in the query, where the bindings match the query pattern" ... and "Each set of bindings then being evaluated against each FILTER expression and those sets that don't evaluate to true are discarded.", citing Brandon's discarded." according to a clarifying explanation by Brandon on the mailing list.

So far so good. Unfortunately though, this query scales very bad - exponentially with the number of variables / shift values tested for.

To verify this, I tried going from # = 1 as documented in this blog post. (where # denotes numbers of Using six similar shift values & variables) to # = 6. As exemplified with code (with prefixes omitted):

With # = 1:

SELECT ?s ?s1WHERE {\   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] .FILTER ( fn:abs(?s1 - 17.6) < 0.3 ).} LIMIT 1

...To # = 2:

SELECT ?s ?s1 ?s2WHERE {\   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] ,                  [ nmr:hasShift ?s2 ] .FILTER ( fn:abs(?s1 - 17.6) < 0.3 &&          fn:abs(?s2 - 18.3) < 0.3 ).} LIMIT 1

... etc, up to # = 6. The results from timing the queries follow below:

# |  Reasoning time (s)  |  log10(Reasoning time)1 |              0.043   |                 -1.372 |              0.058   |                 -1.243 |              0.295   |                 -0.534 |              3.332   |                  0.525 |             44.183   |                  1.656 |            488.612   |                  2.69

The log10(reasoning time) results in a straight line when plotting (except in the very beginning).

(I'm using Pellet 2.0.0 in Bioclipsesearch, takes some 488 seconds, on a 1.3 GHz Intel Su7300 Dual Core laptop (ASUS UL30A))dataset with 1 spectrum!)

show/hide this revision's text 1

With the help of advice from Brandon Ibach on the [pellet-users] mailing list, we found a query that works (but unfortunatly has serious scaling problems, as explained below).

First, the sample data I'm using now:

(Yes, one single spectra as for now. That is due to the scaling problems.)

The file contain a spectrum with peaks (nmr:hasPeak) with one shift each (nmr:hasShift), with the following values (for convenience if anyone wants to try): [17.6, 18.3, 22.6, 26.5, 31.7, 33.5, 33.5, 41.8, 42.0, 42.2, 78.34, 140.99, 158.3, 193.4, 203.0, 0]

Trying to search for just a few of these peaks (so as to not run into performance problem) against the 1 spectrum file, results in this SPARQL query, which works fine, and finishes in 3 sec:

PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX afn:<http://jena.hpl.hp.com/ARQ/function#>
PREFIX fn:<http://www.w3.org/2005/xpath-functions#>
PREFIX nmr:<http://www.nmrshiftdb.org/onto#>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?s1 ?s2 ?s3 ?s4
WHERE {
   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] ,
                  [ nmr:hasShift ?s2 ] ,
                  [ nmr:hasShift ?s3 ] ,
                  [ nmr:hasShift ?s4 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 &&
          fn:abs(?s2 - 18.3) < 0.3 &&
          fn:abs(?s3 - 22.6) < 0.3 &&
          fn:abs(?s4 - 26.5) < 0.3 ) .
} LIMIT 1

The reason I didn't realize this simple solution was that I was unaware of the procesing model of SPARQL, which in fact involves "generating every possible combination of bindings of values in the data to variables in the query, where the bindings match the query pattern" ... and "Each set of bindings then being evaluated against each FILTER expression and those sets that don't evaluate to true are discarded.", citing Brandon's clarifying explanation.

So far so good. Unfortunately though, this query scales very bad - exponentially with the number of variables / shift values tested for.

To verify this, I tried going from # = 1 (where # denotes numbers of shift values & variables) to # = 6. As exemplified with code (with prefixes omitted):

With # = 1:

SELECT ?s ?s1
WHERE {\
   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 ).
} LIMIT 1

...To # = 2:

SELECT ?s ?s1 ?s2
WHERE {\
   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] ,
                  [ nmr:hasShift ?s2 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 &&
          fn:abs(?s2 - 18.3) < 0.3 ).
} LIMIT 1

... etc, up to # = 6. The results from timing the queries follow below:

# |  Reasoning time (s)  |  log10(Reasoning time)
---------------------------------------------------
1 |              0.043   |                 -1.37
2 |              0.058   |                 -1.24
3 |              0.295   |                 -0.53
4 |              3.332   |                  0.52
5 |             44.183   |                  1.65
6 |            488.612   |                  2.69

The log10(reasoning time) results in a straight line when plotting (except in the very beginning).

(I'm using Pellet 2.0.0 in Bioclipse, on a 1.3 GHz Intel Su7300 Dual Core laptop (ASUS UL30A)))

This is of course problematic as one might want to search for spectra with up to as many as 50 peaks in the worst case, and of course with datasets bigger than 1 spectrum, so any hints about alternative strategies with more linear scaling behavior are still highly welcome.