0016Querying Wikidata About Vienna’s Tram Lines: An Introduction to SPARQL
2018-02-18
Recently I attended a workshop about Wikidata and its query language SPARQL. Playing around with Wikidata queries turned out to be fun, and so I ended up writing a blog post.
In Wikidata, data is organised in statements, with each statement being a triplet of the form “subject – predicate – object”, for example:
Vienna
–is a
–capital city
Stephansdom
–is located in
–Vienna
Things that can be used as subject or object are called items, and things that can be used as predicate are called properties. There are a few thousand properties in Wikidata, and literally millions of items.
Let’s use this data for some cool queries!
What Tram Lines are There in Vienna?
The item Q15145593 represents a “tram line”. But if we simply query for “every item that is a tram line”, the result would be all tram lines, not just the ones in Vienna.
Fortunately, there is also the item Q129612, “tram transport in Vienna”. To narrow down the results we can thus go for “every item that is a tram line, and is also part of tram transport in Vienna”.
Here’s how to express this in a query:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q15145593. # ?item – is a – tram line
?item wdt:P361 wd:Q129612. # ?item – is part of – tram transport in Vienna
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY xsd:integer(?itemLabel) ?itemLabel
If you know SQL this syntax might look vaguely familiar. In fact it is written in SPARQL, a semantic query language for databases like Wikidata.
Our conditions are specified in the second and third line: ?item
is a variable, the other terms represent properties (wdt:…
) and items (wd:…
).1 If you use the query editor on Wikidata you can simply mouse over these terms to see their explanations.
The line starting with SERVICE …
applies some magic to the results: For each variable it creates an additional variable (e.g. ?itemLabel
for ?item
) containing the item’s actual label. In our case that’s the name of the tram line (like „D“ or „49“).2
Finally there’s an ORDER BY
clause sorting the results by their label’s integer value (because sorting alphabetically would result in “1, 10, 18, 2, …“).3
Tram Lines and Districts
Now look at a tram line item in detail. You’ll notice that amongst others it has the property P131 “located in”, specifying all the districts the line runs in.4 This is true for all Viennese tram lines, and it allows for some cool queries.
For example: How many tram lines are there in each district? Here’s the query:
SELECT ?district ?districtLabel
(COUNT(?line) AS ?lineCount)
(GROUP_CONCAT(?lineLabel) AS ?lines)
WHERE {
?district wdt:P31 wd:Q261023; # ?district is a district of Vienna …
rdfs:label ?districtLabel. # … and has the label ?districtLabel.
?line wdt:P31 wd:Q15145593; # ?line is a tram line …
wdt:P131 ?district; # … and is located in ?district …
rdfs:label ?lineLabel. # … and has the label ?lineLabel.
FILTER(lang(?lineLabel)="en" && lang(?districtLabel)="en")
}
GROUP BY ?district ?districtLabel
ORDER BY DESC(?lineCount)
Again, if you know SQL the concept behind this query should look familiar: Select all lines with all their districts, then group and aggregate these results by district.5
The WHERE
block makes use of two features we haven’t seen yet:
- Several conditions for the same variable can be combined into a single condition by using a
;
as delimiter:?foo x 1; y 2.
is equivalent to?foo x 1. ?foo y 2.
- A
FILTER(…)
expression can be used to further restrict the results.
Note that we don’t use the label service (SERVICE wikibase:label …
) in this query: ?lineLabel
is used in an aggregate function, and the label service can’t do its magic there. Fortunately it’s relatively simple to manually replicate: A condition like ?item rdfs:label ?itemLabel
will give you the item’s label … but in all languages. So an appropriate FILTER
expression is needed to restrict the results to labels in one specific language.
A Graph of Tram Lines and Districts
Wikidata Graph Builder is a tool that shows you the results of a Wikidata query as directed graph. Of course not all queries make sense to be displayed in that way, but there are cases where a graph makes the results easier to understand. For example, the following query generates a graph of all districts and tram lines, showing which lines run in which districts:
# Graph Builder requires variables to be named `?item`, `?itemLabel`, and `?linkTo`
SELECT ?item ?itemLabel ?linkTo WHERE {
{
?item wdt:P31 wd:Q15145593; # ?item (tram line) is a tram line …
wdt:P361 wd:Q129612; # … and is part of the Vienna tram system
wdt:P131 ?linkTo. # … and is located in ?linkTo (district)
} UNION {
?item wdt:P31 wd:Q261023; # ?item (district) is a district of Vienna
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
→ Try this query on Graph Builder or on Wikidata!
Apart from displaying the results as a graph, the only new feature in this query is the usage of UNION
to combine two independent result sets.6
Adjacent Districts not Connected by a Tram Line
The property P47 “shares a border with” tells us which districts are adjacent. We can combine this information with the data about tram lines to get a list of adjacent districts that are not connected by a tram line:
SELECT ?aLabel ?bLabel WHERE {
?a wdt:P31 wd:Q261023; # ?a is a district of Vienna …
wdt:P1545 ?aNr. # … and has number ?aNr.
?b wdt:P31 wd:Q261023; # ?b is a district of Vienna …
wdt:P1545 ?bNr; # … and has number ?bNr …
wdt:P47 ?a. # … and shares a border with ?a.
FILTER NOT EXISTS {
?line wdt:P361 wd:Q129612; # ?line is part of tram transport Vienna …
wdt:P131 ?a, ?b. # … and is located in ?a and ?b.
}
FILTER (?aNr < ?bNr)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?aNr ?bNr
The query works by building a list of all adjacent districts, then excluding those that are connected by a tram line. Exclusion is achieved by a FILTER NOT EXISTS
block, which – contrary to a simple FILTER
constraint – can contain arbitrary triplets and even introduce new variables.7 Both types of filters can be combined freely.
The second filter block FILTER (?aNr < ?bNr)
makes sure that each pair of districts appears in the results only once, with the lower numbered district first. Otherwise, a pair of districts like “Favoriten – Liesing” would also show up as “Liesing – Favoriten” etc.
There’s also some new syntax present: Triplets sharing the same subject and predicate can be further shortened. Thus
?line wdt:P361 ?a, ?b.
is simply a more succinct way of writing
?line wdt:P361 ?a;
wdt:P361 ?b.
Qualifiers
A common issue with real world data is that it is prone to change over time. For example the length of a tram line might change when the line gets cut or extended at some point in time.
On Wikidata, “qualifiers” can be used to annotate statement triplets. So for example the statement “line 26 has a length of 11.6km” can be complemented by the qualifier “since October 5, 2013”. The item for line 26 even contains a statement about the line’s previous length, with appropriate qualifiers “since December 22, 2012” and “until October 4, 2013”.
Unsurprisingly, these qualifiers can also be used in SPARQL queries. For example, here’s a list of all tram lines in Vienna ordered by the date their length last changed:
SELECT ?line ?lineLabel (MAX(?changed) AS ?lastChanged) WHERE {
?line wdt:P361 wd:Q129612; # ?line is part of tram transport Vienna …
p:P2043 [ # … and has property “length” …
pq:P580 ?changed # … with qualifier “since”, stored in ?changed
].
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?line ?lineLabel
ORDER BY DESC(?lastChanged)
Again, some new syntax in this query: Brackets can be used to avoid intermediary variables. Take as an example the follwing expression:
?line p:P2043 ?length.
?length pq:P580 ?changed.
If the variable ?length
isn’t needed anywhere else, this expression can be shortened to
?line p:P2043 [pq:P580 ?changed].
Note that to access the qualifier we need to use the property with a different prefix (p:P2043
instead of wdt:P2043
). You can think of using wdt:P…
as a shortcut that directly yields the property’s value, while p:P…
yields a property item that then can be queried with triplets like any other item.8
Finally, grouping and an aggregate function (MAX
) are used to restrict the results to the last date of change for each line. Otherwise lines with more than one statement about their length (like line 26) would appear in the results multiple times.
Addition 2019-10: Lines and Districts Sharing a Number
Recently there was a tweet posing an interesting question:
Which of the tram lines in Vienna pass through the district sharing the same number (like line 1 passing through district 1 etc.)?
Unsurprisingly, a SPARQL query can be constructed to answer this question:
SELECT distinct ?Line ?LineLabel ?LineNr ?District ?DistrictNr WHERE {
?District wdt:P31 wd:Q261023; # ?District / is a / district of Vienna
p:P31 [pq:P1545 ?DistrictNr]. # … and its property “is a” / has qualifier “ordinal number” / as ?DistrictNr
?Line wdt:P31 wd:Q15145593; # ?Line / is a / tram line
wdt:P131 ?District; # … and passes through / ?District
rdfs:label ?LineLabel. # … and has label / ?LineLabel
# Restrict to english labels only, and extract ?LineNr from ?LineLabel
FILTER(lang(?LineLabel) = "en").
BIND(STRBEFORE(STRAFTER(?LineLabel, "Tram line "), " (Vienna)") as ?LineNr).
# Restrict to lines having the same number as the district
FILTER(xsd:integer(?LineNr) = xsd:integer(?DistrictNr)).
}
ORDER BY xsd:integer(?LineNr)
The query is rather straightforward, there’s only one new feature: BIND
is used to calculate an expression (a string operation extracting the actual line number from string like tram line 2 (Vienna)
) and making the result available in the variable ?LineNr
. This variable can then be uses like any other variable, for example in the FILTER
clause.
Further Reading
If you want to know more about SPARQL, Wikidata has a some great tutorials, as well as a long list of example queries.
You may also want to look at the SPARQL specification itself.
Finally, if there are Wikidata events in your location, just drop by!
-
Like in most databases, items and properties in Wikidata are identified by unique IDs. These are simply numbers, prefixed with
P
(properties) orQ
(items). The very first item Q1 is, appropriately, the universe. And as manifestation of infinite improbability, Q42 is Douglas Adams.) ↩ -
A cool feature of this service is that is supports internationalisation: Wikidata items may have labels in multiple languages, and the service let’s you specify your preferred language. More languages can be added as fallback for the case when a label is not available in the preferred language. If you specify
[AUTO_LANGUAGE]
as language this means “use the language currently selected in the Wikidata user interface”. ↩ -
In fact it’s a bit more complicated: Because there are some non-numeric lines like “D” and “O” (that’s letter Oh, not zero), all having an integer value of zero, we additionally need to sort by the label itself to get the expected ordering “D, O, 1, 2, …”. ↩
-
This also demonstrates that a property can have more than a single value. Or more general: a “subject – predicate – object” statement can have multiple objects. But technically this is mere syntactic sugar for a set of single-object statements (“line 60 – is located in – Hietzing”, “line 60 – is located in – Penzing”, and so on). ↩
-
The line with
GROUP_CONCAT(…)
isn’t necessary for the query to work, but nice to have: It adds a string containing all tram lines for each district.9 ↩ -
If you’ve wondered how to express a “logical or” in SPARQL conditions:
UNION
is the answer. ↩ -
There’s also
FILTER EXISTS
which is useful if you want to restrict the results to those having a specific property, but you don’t care about the actual value(s) of this property. ↩ -
It’s also possible to refer to the property’s value when using the
p:P…
syntax – this is where the prefixps:
comes into play:?line p:P2043 [ps:P2043 ?length]
is a complicated way of expressing?line wdt:P2043 ?length
, but makes sense if used in conjunction with other qualifiers. ↩ -
Did you know that
GROUP_CONCAT
is not specific to SPARQL but also available for example in MySQL (asgroup_concat
) and PostgreSQL (asstring_agg
)?10 ↩ -
Did you notice that I really like footnotes? Especially second order footnotes, i.e. footnotes in footnotes. ↩
Leave a comment