gehacktes ///

0016Querying Wikidata About Vienna’s Tram Lines: An Introduction to SPARQL

Recently I attended a workshop about Wikidata and its query language SPARQL. Playing around with Wikidata queries turned out to be fun, and so I ended up writing a blog post.

In Wikidata, data is organized in statements, with each statement being a triplet of the form “subject – predicate – object”, for example:

Things that can be used as subject or object are called items, and things that can be used as predicate are called properties. There are a few thousand properties in Wikidata, and literally millions of items.

Let’s use this data for some cool queries!

What Tram Lines are There in Vienna?

The item Q15145593 represents a “tram line”. But if we simply query for “every item that is a tram line”, the result would be all tram lines, not just the ones in Vienna.

Fortunately, there is also the item Q129612, “tram transport in Vienna”. To narrow down the results we can thus go for “every item that is a tram line, and is also part of tram transport in Vienna”.

Here’s how to express this in a query:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q15145593. # ?item – is a – tram line
  ?item wdt:P361 wd:Q129612.  # ?item – is part of – tram transport in Vienna
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
ORDER BY xsd:integer(?itemLabel) ?itemLabel

→ Try this query on Wikidata!

If you know SQL this syntax might look vaguely familiar. In fact it is written in SPARQL, a semantic query language for databases like Wikidata.

Wikdata’s query editor explains items and properties on mouseover.

Our conditions are specified in the second and third line: ?item is a variable, the other terms represent properties (wdt:…) and items (wd:…).1 If you use the query editor on Wikidata you can simply mouse over these terms to see their explanations.

The line starting with SERVICE … applies some magic to the results: For each variable it creates an additional variable (e.g. ?itemLabel for ?item) containing the item’s actual label. In our case that’s the name of the tram line (like „D“ or „49“).2

Finally there’s an ORDER BY clause sorting the results by their label’s integer value (because sorting alphabetically would result in “1, 10, 18, 2, …“).3

Tram Lines and Districts

Now look at a tram line item in detail. You’ll notice that amongst others it has the property P131 “located in”, specifying all the districts the line runs in.4 This is true for all Viennese tram lines, and it allows for some cool queries.

For example: How many tram lines are there in each district? Here’s the query:

SELECT ?district ?districtLabel
       (COUNT(?line) AS ?lineCount)
       (GROUP_CONCAT(?lineLabel) AS ?lines)
  ?district wdt:P31 wd:Q261023;         # ?district is a district of Vienna …
            rdfs:label ?districtLabel.  # … and has the label ?districtLabel.
  ?line wdt:P31 wd:Q15145593;   # ?line is a tram line …
        wdt:P131 ?district;     # … and is located in ?district …
        rdfs:label ?lineLabel.  # … and has the label ?lineLabel.
  FILTER(lang(?lineLabel)="en" && lang(?districtLabel)="en")
GROUP BY ?district ?districtLabel
ORDER BY DESC(?lineCount)

→ Try this query on Wikidata!

Again, if you know SQL the concept behind this query should look familiar: Select all lines with all their districts, then group and aggregate these results by district.5

The WHERE block makes use of two features we haven’t seen yet:

Note that we don’t use the label service (SERVICE wikibase:label …) in this query: ?lineLabel is used in an aggregate function, and the label service can’t do its magic there. Fortunately it’s relatively simple to manually replicate: A condition like ?item rdfs:label ?itemLabel will give you the item’s label … but in all languages. So an appropriate FILTER expression is needed to restrict the results to labels in one specific language.

A Graph of Tram Lines and Districts

Wikidata Graph Builder is a tool that shows you the results of a Wikidata query as directed graph. Of course not all queries make sense to be displayed in that way, but there are cases where a graph makes the results easier to understand. For example, the following query generates a graph of all districts and tram lines, showing which lines run in which districts:

# Graph Builder requires variables to be named `?item`, `?itemLabel`, and `?linkTo`
SELECT ?item ?itemLabel ?linkTo WHERE {
    ?item wdt:P31 wd:Q15145593;   # ?item (tram line) is a tram line …
          wdt:P361 wd:Q129612;    # … and is part of the Vienna tram system
          wdt:P131 ?linkTo.       # … and is located in ?linkTo (district)
  } UNION {
    ?item wdt:P31 wd:Q261023;     # ?item (district) is a district of Vienna
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

→ Try this query on Graph Builder or on Wikidata!

Apart from displaying the results as a graph, the only new feature in this query is the usage of UNION to combine two independent result sets.6

Adjacent Districts not Connected by a Tram Line

The property P47 “shares a border with” tells us which districts are adjacent. We can combine this information with the data about tram lines to get a list of adjacent districts that are not connected by a tram line:

SELECT ?aLabel ?bLabel WHERE {
  ?a wdt:P31 wd:Q261023;       # ?a is a district of Vienna …
     wdt:P1545 ?aNr.           # … and has number ?aNr.
  ?b wdt:P31 wd:Q261023;       # ?b is a district of Vienna …
     wdt:P1545 ?bNr;           # … and has number ?bNr …
     wdt:P47 ?a.               # … and shares a border with ?a.
    ?line wdt:P361 wd:Q129612; # ?line is part of tram transport Vienna …
          wdt:P131 ?a, ?b.     # … and is located in ?a and ?b.
  FILTER (?aNr < ?bNr)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
ORDER BY ?aNr ?bNr

→ Try this query on Wikidata!

The query works by building a list of all adjacent districts, then excluding those that are connected by a tram line. Exclusion is achieved by a FILTER NOT EXISTS block, which – contrary to a simple FILTER constraint – can contain arbitrary triplets and even introduce new variables.7 Both types of filters can be combined freely.

The second filter block FILTER (?aNr < ?bNr) makes sure that each pair of districts appears in the results only once, with the lower numbered district first. Otherwise, a pair of districts like “Favoriten – Liesing” would also show up as “Liesing – Favoriten” etc.

There’s also some new syntax present: Triplets sharing the same subject and predicate can be further shortened. Thus

?line wdt:P361 ?a, ?b.

is simply a more succinct way of writing

?line wdt:P361 ?a;
      wdt:P361 ?b.


A common issue with real world data is that it is prone to change over time. For example the length of a tram line might change when the line gets cut or extended at some point in time.

On Wikidata, “qualifiers” can be used to annotate statement triplets. So for example the statement “line 26 has a length of 11.6km” can be complemented by the qualifier “since October 5, 2013”. The item for line 26 even contains a statement about the line’s previous length, with appropriate qualifiers “since December 22, 2012” and “until October 4, 2013”.

Unsurprisingly, these qualifiers can also be used in SPARQL queries. For example, here’s a list of all tram lines in Vienna ordered by the date their length last changed:

SELECT ?line ?lineLabel (MAX(?changed) AS ?lastChanged) WHERE {
  ?line wdt:P361 wd:Q129612; # ?line is part of tram transport Vienna …
        p:P2043 [            # … and has property “length” …
          pq:P580 ?changed   # … with qualifier “since”, stored in ?changed
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
GROUP BY ?line ?lineLabel
ORDER BY DESC(?lastChanged)

→ Try this query on Wikidata!

Again, some new syntax in this query: Brackets can be used to avoid intermediary variables. Take as an example the follwing expression:

?line p:P2043 ?length.
?length pq:P580 ?changed.

If the variable ?length isn’t needed anywhere else, this expression can be shortened to

?line p:P2043 [pq:P580 ?changed].

Note that to access the qualifier we need to use the property with a different prefix (p:P2043 instead of wdt:P2043). You can think of using wdt:P… as a shortcut that directly yields the property’s value, while p:P… yields a property item that then can be queried with triplets like any other item.8

Finally, grouping and an aggregate function (MAX) are used to restrict the results to the last date of change for each line. Otherwise lines with more than one statement about their length (like line 26) would appear in the results multiple times.

Further Reading

If you want to know more about SPARQL, Wikidata has a some great tutorials, as well as a long list of example queries.

You may also want to look at the SPARQL specification itself.

Finally, if there are Wikidata events in your location, just drop by!

  1. Like in most databases, items and properties in Wikidata are identified by unique IDs. These are simply numbers, prefixed with P (properties) or Q (items). The very first item Q1 is, appropriately, the universe. And as manifestation of infinite improbability, Q42 is Douglas Adams.) 

  2. A cool feature of this service is that is supports internationalization: Wikidata items may have labels in multiple languages, and the service let’s you specify your preferred language. More languages can be added as fallback for the case when a label is not available in the preferred language. If you specify [AUTO_LANGUAGE] as language this means “use the language currently selected in the Wikidata user interface”. 

  3. In fact it’s a bit more complicated: Because there are some non-numeric lines like “D” and “O” (that’s letter Oh, not zero), all having an integer value of zero, we additionally need to sort by the label itself to get the expected ordering “D, O, 1, 2, …”. 

  4. This also demonstrates that a property can have more than a single value. Or more general: a “subject – predicate – object” statement can have multiple objects. But technically this is mere syntacic sugar for a set of single-object statements (“line 60 – is located in – Hietzing”, “line 60 – is located in – Penzing”, and so on). 

  5. The line with GROUP_CONCAT(…) isn’t necessary for the query to work, but nice to have: It adds a string containing all tram lines for each district.9 

  6. If you’ve wondered how to express a “logical or” in SPARQL conditions: UNION is the anwer. 

  7. There’s also FILTER EXISTS which is useful if you want to restrict the results to those having a specific property, but you don’t care about the actual value(s) of this property. 

  8. It’s also possible to refer to the property’s value when using the p:P… syntax – this is where the prefix ps: comes into play: ?line p:P2043 [ps:P2043 ?length] is a complicated way of expressing ?line wdt:P2043 ?length, but makes sense if used in conjunction with other qualifiers. 

  9. Did you know that GROUP_CONCAT is not specific to SPARQL but also available for example in MySQL (as group_concat) and PostgreSQL (as string_agg)?10 

  10. Did you notice that I really like footnotes? Especially second order footnotes, i.e. footnotes in footnotes. 

0015Cookie Cutters from the 3D Printer