DBpedia Tutorial


Introduction

DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. This structured information resembles an open knowledge graph (OKG) which is available for everyone on the Web. A knowledge graph is a special kind of database which stores knowledge in a machine-readable form and provides a means for information to be collected, organised, shared, searched and utilised. Google uses a similar approach to create those knowledge cards during search. We hope that this work will make it easier for the huge amount of information in Wikimedia projects to be used in some new interesting ways.

DBpedia data is served as Linked Data, which is revolutionizing the way applications interact with the Web. One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.

It is basically extracting and summarizing information and data from wikipedia documents. And it break all those factual information into N-tuple dumps, where everyone is related to each other somehow? In such cases, people can search elements and information from different wikipedia passages and search for something they are not sure as all related information is collected together with internal links.


SPARQL


Okay... So what is

SPARQL is a query language for data stored in the Resource Description Framework (RDF) format. RDF is a labeled and directed graph format designed specifically for representing data on the web. It can be used to encode information about almost anything, and more importantly, it allows for loose integration between differing sources of information. It uses Universal Resource Identifiers (URIs) to name not only the endpoints of a link in the graph but also the link itself. This grouping of information, usually called a RDF triple, forms the basis for how information is stored in RDF and queried using SPARQL.

The SPARQL schema can be found here. You may also find the Virtuoso online SPARQL editor useful for playing around with SPARQL syntax and the faceted browser useful for exploring the database.

The first step of building a SPARQL query is to define one or more prefixes. This is done using the PREFIX keyword. Prefixes are essentially namespaces; if you have some experience programming, you can think of a PREFIX statement as being more or less equivalent to an import or include statement common to many programming languages. It tells the SPARQL processor where to find the names of resource in your query. The following prefixes are often useful:

Now let's start with something really simple:

SELECT ?x WHERE { ?x ?v 42 .} LIMIT 100

A ? in SPARQL denotes a _variable_, i.e. something that we intend to fill with data. In this case, we SELECT x as the variable we want populated in our results, and we ask the processor to match all results satisfying the triple ?x ?v 42, which could be pretty much anything involving 42 (and we know anything involving 42 must be pretty important!).

Let's modify our query a bit to get more focused results:

This will return all results x that have Rock_music as the value a property labeled genre.

As a full example, this is a query that finds all songs by Green Day paired with the albums they are found on.

PREFIX: <http://dbpedia.org/resource> PREFIX dbpedia2: <http://dbpedia.org/property/> PREFIX albums: <http://dbpedia.org/resource/Category:Green_Day_albums>

SELECT ?album ?title WHERE { ?album ?property albums: . ?album dbpedia2:title ?title . ?title dbo:musicalArtist :Green_Day } ORDER BY ?album

Long words short, a SPARQL query consists of five parts, in order:

Structurally, it looks like: