{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Internet Is My Database\n", "\n", "As a developer, you may be familiar with using SQL to query your SQLite/MySql/PostgreSQL database. Thanks to the Semantic Web and its people, you can treat the Internet as a database and query it with SPARQL. We'll take a look at what we can query, write ourselves some utilities to make querying easier and test out some queries.\n", "\n", "Please note, this is only an introduction to using SPARQL with Python to query open SPARQL endpoints on the web. If you wish to use data obtained using SPARQL within an application, just like with any API, you should store the results locally and serve this local data to your users, refreshing periodically. This is faster for your users and being nice to the data provider.\n", "\n", "## What Can I Query?\n", "\n", "There are only a limited number of public SPARQL endpoints you can query. Submitting a SPARQL query is much like submitting an SQL query straight to the database, except that database is public. Due to the database being public, most SPARQL endpoints don't accept \"UPDATE\"/\"DELETE\" queries! Unfortunately, \"UPDATE\"/\"DELETE\" isn't the only malicious thing an end-user can do with a query, it is relatively easy to DDOS a SPARQL endpoint by sending queries that take a long time to execute. The more common REST/RESTFUL API's restrict what you can ask the database and sometimes implement rate-limiting, hence some Semantic Web applications, such as [Thomson Reuters PermID](https://permid.org/) only expose an API. Unfortunately the technology I'm currently using on my website also doesn't permit me to expose a SPARQL endpoint. Like many other organisations I only permit the RDF to be downloaded and queried locally.\n", "\n", "So who does expose a SPARQL endpoint that we can access? The most famous, general purpose one is provided by [DBpedia](http://wiki.dbpedia.org/), which gathers all the data in those little Wikipedia boxes. We'll use their SPARQL endpoint. \n", "\n", "## import What?\n", "\n", "Python is famously \"batteries included\", unfortunately we're lacking in the SPARQL query space; there's no SPARQLAlchemy, *yet*! So far in the blog we've used RDFLib when making Semantic Web applications, but that SPARQL interface is for local data, so we're switching to SPARQLWrapper. It's pretty bare-bones, so we'll make some utility functions to make it easy to use." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pip install sparqlwrapper\n", "from SPARQLWrapper import SPARQLWrapper, JSON\n", "\n", "endpoint = \"http://dbpedia.org/sparql/\"\n", "sparql = SPARQLWrapper(endpoint)\n", "# Return JSON so we can access results like a dict\n", "sparql.setReturnFormat(JSON)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Some Utilities\n", "\n", "You can copy these and adapt them. We're going to start with easy prefixes.\n", "\n", "Prefixes let us use short-hand for URIs, such as `rdf:type` instead of `http://www.w3.org/1999/02/22-rdf-syntax-ns#type`. Obviously we love them. But we don't want to write them out everytime we make a query. So we'll store our namespaces in a dict for easy code updates, use a function to get that data into the PREFIX syntax, and join them all together. Finally we create the function `make_query` to add the prefixes onto a given query." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "namespaces = dict(db=\"http://dbpedia.org/\",\n", " dbo=\"http://dbpedia.org/ontology/\",\n", " dbr=\"http://dbpedia.org/resource/\",\n", " dbc=\"http://dbpedia.org/page/Category:\",\n", " dbp=\"http://dbpedia.org/property/\",\n", " rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\",\n", " rdfs=\"http://www.w3.org/2000/01/rdf-schema#\",\n", " owl=\"http://www.w3.org/2002/07/owl#\")\n", "\n", "# function to put data into PREFIX syntax\n", "def prefix(ns, uri):\n", " return f\"PREFIX {ns}: <{uri}>\"\n", "\n", "# Join all the prefixes together\n", "prefixes = \"\\n\".join(prefix(ns=ns, uri=uri) \n", " for ns, uri in namespaces.items())\n", "\n", "# Utility function to add prefixes to a query\n", "def make_query(query:str) -> str:\n", " return \"{}\\n\\n{}\".format(prefixes, query)\n", "\n", "# Test it\n", "print(make_query(\"SELECT ?test WHERE {?test a magic:success.}\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need an easy way to make the query and parse the results. SPARQLWrapper requires that we set the query, then call the `query()` method. It returns JSON that contains more details than we need for this tutorial so we'll slim that down into a nice generator of dictionaries that associate the variable with its value. We're using `partial`, if you've not met it check out [Python Partial: Code Your Intention](http://www.paulbrownmagic.com/blog/python_partial_application)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from functools import partial\n", "\n", "\n", "def result_variables(result:dict) -> [str]:\n", " \"\"\"Get a list of the variable names used in the query.\"\"\"\n", " return result['head']['vars']\n", "\n", "def get_data(variables:[str], data:dict) -> dict:\n", " \"\"\"Get a dictionary with the variables as keys\n", " and their values as values.\"\"\"\n", " return {v:data[v]['value'] for v in variables}\n", "\n", "def parse_results(results:dict) -> map:\n", " \"\"\"Takes the JSON, partially applies get data to the\n", " variable names, and maps this over the results to\n", " return a generator of dictionaries.\"\"\"\n", " read_data = partial(get_data, result_variables(results))\n", " return map(read_data, results['results']['bindings'])\n", "\n", "def query(query:str) -> map:\n", " \"\"\"Take a query and return generator of results.\"\"\"\n", " sparql.setQuery(make_query(query))\n", " return parse_results(sparql.query().convert())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's **SPARQL!**\n", "\n", "Let's start with some basic \"SELECT\" queries. It's possible to do amazing things with SPARQL, including query multiple endpoints in a single query, but we've got to start at the beginning. \n", "\n", "Everything in Semantic Web is a triple: **subject, predicate, object**. Subject is what we're talking about, predicate is the relationship, object is what the subject is related to by that predicate. DBPedia is too big to ask for all the triples, so let's pick a subject. Go to [Wikipedia](https://en.wikipedia.org/) and choose an article that's not a stub *and doesn't have parenthesis () in the title*. We'll discuss how to deal with DBpedia's parenthesis in the next query. \n", "\n", "I'm going to use my favourite magic trick: https://en.wikipedia.org/wiki/Cups_and_balls. The wikipedia URL maps to a DBpedia resource, for my subject it will be `dbr:Cups_and_balls`. Your's will be `dbr:` followed by the last part of your Wikipedia URL. Or to explain using Python: `\"https://en.wikipedia.org/wiki/Cups_and_balls\".replace(\"https://en.wikipedia.org/wiki/\", \"dbr:\")`\n", "\n", "Variable names are prefixed with a '?' in SPARQL, so here we are asking for all the ?predicate and ?object that are in triples where the subject is `dbr:Cups_and_balls`. Not too disimilar from SQL. Just *watch out for the full-stops* on the end of each line in the \"WHERE\" clause, they cause many a debugging headache!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "q = \"\"\"SELECT ?predicate ?object\n", "WHERE {\n", " dbr:Cups_and_balls ?predicate ?object.\n", "}\"\"\"\n", "\n", "for result in query(q):\n", " print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you read the results you'll see what looks like a big mess to the untrained eye. That's because the predicates and some of the objects are full URIs. We could use our `namespace` dictionary to change these to prefix format, but that will be left as an exercise for you. In Semantic Web we use URIs so we can share terms with common definitions; if I do a blog post and include the triple to say it has `dcterms:subject dbr:Cups_and_balls` then there can be no ambiguity as to whether it is about https://en.wikipedia.org/wiki/Cups_and_balls or https://en.wikipedia.org/wiki/Cup-and-ball.\n", "\n", "I promised a solution to DBpedia's use of parenthesis, sadly we can't use parenthesis with prefixes, so we need to type the whole URI. To tell SPARQL it's a URI we put it in angle brackets <>. In this query I'm going to use ``.\n", "\n", "Let's do a more useful SELECT query and get some objects that we want. In this query, we'll make use of DISTINCT and FILTER; many of the familiar SQL keywords are also in SPARQL, including SUM, COUNT, GROUP BY, ORDER BY and more. There's also an additional OPTIONAL keyword that will return a result if there is one, and won't if there's not. However, our simple utility functions would need improving to not throw a `KeyError` when no result is returned.\n", "\n", "Popular Wikipedia articles are in multiple languages, so here we make sure to FILTER to only get the English label and abstract." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "q = \"\"\"SELECT DISTINCT ?label ?abstract\n", "WHERE {\n", " rdfs:label ?label.\n", " OPTIONAL { dbo:abstract ?abstract. }\n", " FILTER (lang(?label) = \"en\")\n", " FILTER (lang(?abstract) = \"en\")\n", "}\"\"\"\n", "\n", "from textwrap import wrap\n", "\n", "for result in query(q): \n", " print(result['label'], \n", " \"\\n----------------\\n\", \n", " \"\\n\".join(wrap(result['abstract'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, you're not limited to searching for predicates on objects. In this final query I get the name and abstract for all the people on DBpedia who share the same birthday as one of my magic heros, Patrick Page." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "q = \"\"\"SELECT DISTINCT ?person ?name ?about\n", "WHERE {\n", " dbo:birthDate ?birthday.\n", " ?person dbo:birthDate ?birthday.\n", " ?person foaf:name ?name.\n", " ?person dbo:abstract ?about.\n", " FILTER (lang(?about) = \"en\")\n", "} ORDER BY ?name\"\"\"\n", "\n", "for result in query(q):\n", " print(\"{} ( <{}> )\".format(result['name'], result['person']))\n", " print(result['about'], end=\"\\n\\n\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "We've only looked at one SPARQL endpoint and only covered the most basic of queries, but hopefully this has been enough to spark your imagination. Imagine if you didn't have to write all the nitty-gritty content of your website yourself, but could just pull in data from Wikipedia. The BBC did it during the London Olympics so every sport and athlete could have their own page, they've also done it for animals. I'll be doing it to describe my tags, just as soon as I get a round tuit." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python3.6" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }