SPARQL Best Practices

Writing efficient SPARQL queries for TopBraid requires both a good understanding of the data model and of how TopBraid will execute SPARQL queries. TopBraid bundles the Apache Jena SPARQL engine which does not optimize queries automatically. This means that the onus is on the query author to follow some best practices as outlined below.

How SPARQL queries are processed

The WHERE clause of a SPARQL query produces bindings of variables. In the following query, the variable ?continent will be bound to all instances of the class Continent.

SELECT ?continent
WHERE {
    ?continent a g:Continent .
}

The row in the WHERE clause above is called a Basic Graph Pattern, and those are matched against the triples found in the current query graph. In general, a query will be executed from top to bottom, so that the variable bindings from one row will be used as input to the next row.

Order of Basic Graph Patterns

The most important rule of writing efficient SPARQL queries is to eliminate combinations of variables as early as possible. The following (bad) query will return all continents together with their labels:

SELECT ?continent ?label
WHERE {
    ?continent skos:prefLabel ?label .
    ?continent a g:Continent .
}

When the above query is executed, the engine will first walk through all triples that have skos:prefLabel as predicate and only then combine those triple matches with the next row, which checks whether the bindings of ?continent are in fact instances of g:Continent.

This is very inefficient because there may be hundreds of thousands of triples with skos:prefLabel but only a handful of continents. The engine would do a lot of extra work that could be eliminated by reordering the clauses to the following (better) query:

SELECT ?continent ?label
WHERE {
    ?continent a g:Continent .
    ?continent skos:prefLabel ?label .
}

The difference is that here it will focus only on the (few) instances of g:Continent and only for those it will query the labels.

Joining Variables

If you want to join two usages of the same value, avoid using = but instead try to directly look up the triple. For example, the following query will deliver all instances of Country together with instances of Address that have the same country code.

SELECT ?country ?address
WHERE {
    ?country a g:Country .
    ?country g:isoCode2 ?countryCode .
    ?address a x:Address .
    ?address x:countryCode ?addressCode .
    FILTER (?countryCode = ?addressCode) .
}

The above is very inefficient, because it will loop through all possible values of ?countryCode and for each of them it will walk through all instances of Address, and for each of those it will query the ?addressCode. It will then compare ?countryCode and ?addressCode with the = operator. Imagine you have 200 countries and 1000 addresses, this means it would loop 200 x 1000 times.

A far more efficient variation would be this:

SELECT ?country ?address
WHERE {
    ?country a g:Country .
    ?country g:isoCode2 ?countryCode .
    ?address x:countryCode ?countryCode .
    ?address a x:Address .
}

This (better) variation would only loop over the 200 countries and then only on that subset of addresses that actually have matching country codes. It will do a direct value matching to query exactly the triples that have the given ?countryCode as their x:countryCode. And only then it will verify whether the ?address is actually also an instance of x:Address.

So always check if you can reuse a variable that already has a value before introducing a new variable.

FILTER Placement

TopBraid’s SPARQL engine will usually automatically move FILTER clauses to the end of its surrounding { … } block. For example, the following FILTER clause will be executed after all basic graph patterns have been processed:

SELECT ?continent ?label ?country
WHERE {
    ?continent a g:Continent .
    ?continent skos:prefLabel ?label .
    FILTER langMatches(lang(?label), 'en') .
    ?country skos:broader ?continent .
}

The “real” execution order of the query above will be as follows:

SELECT ?continent ?label ?country
WHERE {
    ?continent a g:Continent .
    ?continent skos:prefLabel ?label .
    ?country skos:broader ?continent .
    FILTER langMatches(lang(?label), 'en') .
}

As a result, the line that fetches the skos:broader matches will be executed even if the language of the label is not English. In practice this means that the engine would do a lot of unnecessary work with triples that will later be filtered out anyway.

To avoid this, you can introduce extra {…} blocks to make sure that the FILTER is executed earlier:

SELECT ?continent ?label ?country
WHERE {
    {
        ?continent a g:Continent .
        ?continent skos:prefLabel ?label .
        FILTER langMatches(lang(?label), 'en') .
    }
    ?country skos:broader ?continent .
}

Hint

It is important to understand that SPARQL is executed “from the inside out”. This means that the engine will first execute the inner {…} blocks.

How to Identify Slow SPARQL Queries

The Usage Statistics Admin Page will record any SPARQL query that took longer than the a configured value. If you are an administrator or power user, you may want to occasionally take a look at that page and inform users about best practices.