Hasso-Plattner-Institut DERI

State of the LOD Cloud

LATC Project
PlanetData Project
Anja Jentzsch (Hasso-Plattner-Institut)
Richard Cyganiak (DERI, NUI Galway)
Chris Bizer (Universität Mannheim)

Version 0.3, 09/19/2011

This document provides statistics about the structure and content of the LOD cloud. It also analyzes the extend to which LOD data sources implement nine best practices that are either recommended W3C or have emerged within the LOD community.

All statistics within this document are based on the LOD data set catalog that is maintained on CKAN. This document contains a preliminary release of the statistics. If you spot any errors in the data describing the LOD data sets, it would be great if you would correct them directly on CKAN. For information on how to describe datasets on CKAN please refer to the Guidelines for Collecting Metadata on Linked Datasets in CKAN.

Table of Contents

  1. Structure of LOD Cloud
    1. LOD cloud diagram
    2. Linked Data by domain
    3. Dataset published by data producer or by third-party
  2. Compliance with Best Practices for Data Provisioning
    1. Provide dereferencable URIs
    2. Set RDF links pointing at other data sources
    3. Use terms from widely deployed vocabularies
    4. Make proprietary vocabulary terms dereferencable
    5. Map proprietary vocabulary terms to other vocabularies
    6. Provide provenance metadata
    7. Provide licensing metadata
    8. Provide data-set-level metadata
    9. Refer to additional access methods

1. Structure of LOD Cloud #

This section gives an overview of the structure of the LOD cloud.

1.1 LOD cloud diagram #

The following diagram visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set.

Linking Open Data cloud diagram, large version

Detailed information about all data sets that are currently part of the LOD Cloud is found in LOD Data Set Catalog on CKAN.

1.2 Linked Data by Domain #

Linked Data technologies are being using to share data covering a wide range of different topical domains. The table below gives an overview of the amount of triples as well as the amount of RDF links per domain. The number of RDF links refers to out-going links that are set from data sources within a domain to other data sources.

Domain Number of datasets Triples % (Out-)Links %
Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %
Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 %
Government 49 13,315,009,400 42.09 % 19,343,519 3.84 %
Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 %
Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 %
Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 %
User-generated content 20 134,127,413 0.42 % 3,449,143 0.68 %
295 31,634,213,770 503,998,829

The diagram below shows the distribution of triples by domain.

The diagram below shows the distribution of links by domain.

1.3 Dataset published by data producer or by third-party #

Initially, the Linked Data best practices were adopted mainly by research projects and Web enthusiasts. These third-parties took existing data sets, converted them into RDF and served them on the Web. Alternatively, they implemented Linked Data wrappers around existing Web APIs. Today, Linked Data technologies are increasing adopted by the primary data producers themselves and are used by them to provide access to their data sets.

As of August 2011, out of the 295 datasets in the LOD cloud 113 (38.57 %) are published by the data producers themselves, while 180 (61.43 %) are published by third-parties.

2. Compliance with Best Practices for Data Provisioning #

The promise of the Web of Linked Data is to enable applications to discover and integrate data from an global Web of interconnected data sources. In order to make it as easy as possible for applications to access and process Linked Data, data providers should publish data according to a set of best practices. These best practices recommend to make data accessible using the Web's standard access mechanism (HTTP) and represent data using standardized Web formats (i.e. RDF/XML, RDFa, XML with GRDDL). On the other hand, the best practices aim at making data as self-descriptive as possible. The best practices are either recommended directly by W3C or have emerged within the LOD community.

This section analyses to which extend data sources in the LOD cloud implement these best practices.

2.1 Provide dereferencable URIs #

The basic idea of Linked Data is to make data accessible using the Web's standard retrieval algorithm. This means that every entity of interest, for instance a person, place, company or abstract concept, should be identified with its own http URI (Linked Data Principle 1). On the other hand, these URIs should be made dereferencable into an RDF description of the entity (Linked Data Principle 2 and 3). The W3C Interest Group Note Cool URIs for the Semantic Web describes the different technical options for realizing such lookup operations (hash vs. slash URIs).

100% of the data sources in the LOD cloud fulfill this best practice as it is a pre-condition for being included into the LOD cloud.

There are two validators available for checking whether URIs fulfill this best practice:

The 4th Linked Data principle is to set RDF links pointing into other data sources. These RDF links connect data from different sources into a single global RDF graph and enable Linked Data browsers and crawlers to navigate between data sources.

The absolute number of RDF links in the LOD cloud is given in Section 1.2.

The table below categorizes the datasets in the LOD cloud by the absolute number of outgoing RDF links.

(Out-)Links Number of datasets
up to 1,000 30 (10.17 %)
up to 1,000 90 (30.51 %)
1,000 to 10,000 58 (19.66 %)
10,000 to 100,000 45 (15.25 %)
100,000 to 1,000,000 43 (14.58 %)
more than 1,000,000 29 (9.83 %)
237

The next table categorizes the LOD data sets by the number of other data sources that are target of outgoing RDF links.

Number of linked datasets Number of datasets
more than 10 27 (9.15 %)
6 to 10 17 (5.76 %)
5 5 (1.69 %)
4 19 (6.44 %)
3 38 (12.88 %)
2 62 (21.02 %)
1 98 (33.22 %)
266

The table below lists the 10 LOD data sets with the highest number of linked other LOD data sets.

Number of linked datasets Dataset
35rkb-explorer-dblp
31rkb-explorer-southampton
31rkb-explorer-eprints
31rkb-explorer-acm
29dbpedia
29rkb-explorer-wiki
27rkb-explorer-rae2001
27rkb-explorer-citeseer
25rkb-explorer-newcastle
25rkb-explorer-kisti

2.3 Use terms from widely deployed vocabularies #

In order to make it easier for applications to understand Linked Data, data providers should use terms from widely deployed vocabularies to represent data wherever possible.

Nearly all data sources in the LOD cloud use terms from the W3C base-vocabularies RDF, RDF Schema, and OWL. In addition 191 (64.75 %) of the 295 data sources in the LOD cloud use terms from other non-proprietary vocabularies. When calculating this number, we consider a vocabulary to be proprietary if it is defined in the same top-level domain that is also used to serve the instance data.

The table below lists the most widely used vocabularies and provides links to the data sources that use a specific vocabulary.

Vocabulary prefix Vocabulary link Number of usages in data sets Data sets that use the vocabulary
dc http://purl.org/dc/elements/1.1/ 92 (31.19 %) Data sets that use dc
foaf http://xmlns.com/foaf/0.1/ 81 (27.46 %) Data sets that use foaf
skos http://www.w3.org/2004/02/skos/core# 58 (19.66 %) Data sets that use skos
geo http://www.w3.org/2003/01/geo/wgs84_pos# 25 (8.47 %) Data sets that use geo
xhtml http://www.w3.org/1999/xhtml/vocab# 19 (6.44 %) Data sets that use xhtml
akt http://www.aktors.org/ontology/portal# 17 (5.76 %) Data sets that use akt
bibo http://purl.org/ontology/bibo/ 14 (4.75 %) Data sets that use bibo
mo http://purl.org/ontology/mo/ 13 (4.41 %) Data sets that use mo
vcard http://www.w3.org/2006/vcard/ns# 10 (3.39 %) Data sets that use vcard
sioc http://rdfs.org/sioc/ns# 10 (3.39 %) Data sets that use sioc
cc http://creativecommons.org/ns# 8 (2.71 %) Data sets that use cc
geonames http://www.geonames.org/ontology# 6 (2.03 %) Data sets that use geonames
frbr http://purl.org/vocab/frbr/core# 6 (2.03 %) Data sets that use frbr
xsd http://www.w3.org/2001/XMLSchema# 6 (2.03 %) Data sets that use xsd
time http://www.w3.org/2006/time# 5 (1.69 %) Data sets that use time
event http://purl.org/NET/c4dm/event.owl# 5 (1.69 %) Data sets that use event
dbpedia http://dbpedia.org/resource/ 5 (1.69 %) Data sets that use dbpedia
gr http://purl.org/goodrelations/v1# 4 (1.36 %) Data sets that use gr
dbo http://dbpedia.org/ontology/ 4 (1.36 %) Data sets that use dbo
ore http://www.openarchives.org/ore/terms/ 4 (1.36 %) Data sets that use ore
bio http://purl.org/vocab/bio/0.1/ 4 (1.36 %) Data sets that use bio
dbp http://dbpedia.org/property/ 4 (1.36 %) Data sets that use dbp
tag http://www.holygoat.co.uk/owl/redwood/0.1/tags/ 3 (1.02 %) Data sets that use tag
void http://rdfs.org/ns/void# 3 (1.02 %) Data sets that use void
scovo http://purl.org/NET/scovo# 3 (1.02 %) Data sets that use scovo
http http://www.w3.org/2006/http# 3 (1.02 %) Data sets that use http
uniprot http://purl.uniprot.org/core/ 3 (1.02 %) Data sets that use uniprot
umbel http://umbel.org/umbel# 3 (1.02 %) Data sets that use umbel
rev http://purl.org/stuff/rev# 3 (1.02 %) Data sets that use rev
vu-wordnet 2 (0.68 %) Data sets that use vu-wordnet
xml 2 (0.68 %) Data sets that use xml
qb http://purl.org/linked-data/cube# 2 (0.68 %) Data sets that use qb
geospecies http://rdf.geospecies.org/ont/geospecies# 2 (0.68 %) Data sets that use geospecies
sdmx http://purl.org/linked-data/sdmx# 2 (0.68 %) Data sets that use sdmx
sawsdl http://www.w3.org/ns/sawsdl# 2 (0.68 %) Data sets that use sawsdl
org http://www.w3.org/ns/org# 2 (0.68 %) Data sets that use org
api 2 (0.68 %) Data sets that use api
vann http://purl.org/vocab/vann/ 2 (0.68 %) Data sets that use vann
admingeo http://data.ordnancesurvey.co.uk/ontology/admingeo/ 2 (0.68 %) Data sets that use admingeo
wdrs http://www.w3.org/2007/05/powder-s# 2 (0.68 %) Data sets that use wdrs
doap http://usefulinc.com/ns/doap# 2 (0.68 %) Data sets that use doap
metalex 2 (0.68 %) Data sets that use metalex
txn http://lod.taxonconcept.org/ontology/txn.owl# 2 (0.68 %) Data sets that use txn
wot http://xmlns.com/wot/0.1/ 2 (0.68 %) Data sets that use wot
compass http://purl.org/net/compass# 2 (0.68 %) Data sets that use compass
rdfg http://www.w3.org/2004/03/trix/rdfg-1/ 2 (0.68 %) Data sets that use rdfg
mpeg7 2 (0.68 %) Data sets that use mpeg7
tl http://purl.org/NET/c4dm/timeline.owl# 2 (0.68 %) Data sets that use tl
dcam http://purl.org/dc/dcam/ 2 (0.68 %) Data sets that use dcam
swrc http://swrc.ontoware.org/ontology# 2 (0.68 %) Data sets that use swrc
bib http://zeitkunst.org/bibtex/0.1/bibtex.owl# 2 (0.68 %) Data sets that use bib
po http://purl.org/ontology/po/ 2 (0.68 %) Data sets that use po

The diagram below shows the distribution of most widely used vocabularies.

2.4 Make proprietary vocabulary terms dereferencable #

Widely deployed vocabularies often do not provide all terms that are needed to publish the complete content of a data set on the Web. Thus, data providers often define proprietary terms that are used in addition to terms from widely deployed vocabularies.

Currently:

In order to enable applications to automatically retrieve the definition of vocabulary terms from the Web, URIs identifying vocabulary terms should be made dereferencable. Guidelines for doing this are given in the W3C Note Best Practice Recipes for Publishing RDF Vocabularies.

Currently:

Split by topical domain, the numbers look as follows:

Domain Proprietary vocabulary terms Dereferancable proprietary term URIs Not dereferancable proprietary term URIs No proprietary vocabulary terms
Media 15/25 (60.00 %) 12/25 (48.00 %) 3/25 (12.00 %) 10/25 (40.00 %)
Geographic 25/31 (80.65 %) 16/31 (51.61 %) 9/31 (29.03 %) 6/31 (19.35 %)
Government 39/49 (79.59 %) 33/49 (67.35 %) 6/49 (12.24 %) 10/49 (20.41 %)
Publications 57/87 (65.52 %) 50/87 (57.47 %) 7/87 (8.05 %) 30/87 (34.48 %)
Cross-domain 31/41 (75.61 %) 28/41 (68.29 %) 3/41 (7.32 %) 9/41 (21.95 %)
Life sciences 13/41 (31.71 %) 11/41 (26.83 %) 2/41 (4.88 %) 28/41 (68.29 %)
User-generated content 9/20 (45.00 %) 8/20 (40.00 %) 1/20 (5.00 %) 9/20 (45.00 %)

2.5 Map proprietary vocabulary terms to other vocabularies #

Proprietary vocabulary terms should be related to corresponding terms within other (widely used) vocabularies in order to enable applications to understand as much data as possible and to translate data into their target schemata (see RDF and the Self-Describing Semantic Web). The W3C recommendations define the following terms for representing such correspondences (mappings): owl:equivalentClass, owl:equivalentProperty, or if a looser mapping is desired: rdfs:subClassOf, rdfs:subPropertyOf, and skos:broadMatch, skos:narrowMatch.

Currently 15 (7.89 %) out of the 190 data sources that use proprietary terms provide mappings to other vocabularies for their terms.

2.6 Provide provenance metadata #

In order to enable applications to be sure about the origin of data as well as to enable them to assess the quality of data, data source should publish provenance meta data together with the primary data. A common means for providing provenance information is to represent it as document level metadata as described in the How to publish Linked Data tutorial. A widely deployed vocabulary for representing provenance information is Dublin Core (dc:creator, dc:publisher, dc:date). Alternative vocabularies which provide for representing more details about the data creation process include the Open Provenance Model as well as the vocabularies examined by the W3C Provenance Incubator Group.

Currently:

Split by topical domain, the figures look as follows:

Domain Provenance information
Media 4/25 (16.00 %)
Geographic 15/31 (48.39 %)
Government 10/49 (20.41 %)
Publications 41/87 (47.13 %)
Cross-domain 10/41 (24.39 %)
Life sciences 2/41 (4.88 %)
User-generated content 6/20 (30.00 %)

2.7 Provide licensing metadata #

Web data should be self-descriptive concerning any restrictions that apply to its usage. A common way to express such restrictions is to attach a data license to published data. Doing so is essential to enable applications to use Web data on a secure legal basis. A common means to attach licenses to Linked Data is to use dc:rights links pointing at the license as document-level metadata. An example of this is given in the How to publish Linked Data tutorial.

Currently:

Split by topical domain, the figures look as follows:

Domain Licensing information
Media 4/25 (16.00 %)
Geographic 9/31 (29.03 %)
Government 7/49 (14.29 %)
Publications 9/87 (10.34 %)
Cross-domain 8/41 (19.51 %)
Life sciences 1/41 (2.44 %)
User-generated content 4/20 (20.00 %)

2.8 Provide dataset-level Metadata #

In addition to making instance data self-descriptive, it is also desirable that data publishers provide metadata describing characteristic of complete data sets, for instance the topic of a dataset and more detailed statistics. A vocabulary for representing such metadata is the voiD vocabulary. A second means for representing dataset-level metadata are Semantic Sitemaps.

Currently,

Split by topical domain, the figures look as follows:

Domain voiD Semantic Sitemap voiD or Semantic Sitemap
Media 5/25 (20.00 %) 0/25 (0.00 %) 5/25 (20.00 %)
Geographic 12/31 (38.71 %) 3/31 (9.68 %) 13/31 (41.94 %)
Government 21/49 (42.86 %) 3/49 (6.12 %) 23/49 (46.94 %)
Publications 39/87 (44.83 %) 34/87 (39.08 %) 43/87 (49.43 %)
Cross-domain 9/41 (21.95 %) 6/41 (14.63 %) 11/41 (26.83 %)
Life sciences 3/41 (7.32 %) 6/41 (14.63 %) 7/41 (17.07 %)
User-generated content 5/20 (25.00 %) 1/20 (5.00 %) 6/20 (30.00 %)

2.9 Refer to additional access methods #

The primary way to publish Linked Data on the Web is to make the URIs that identity data items dereferencable into RDF descriptions. In addition, various LOD data providers have chosen to provide two alternative means of access to their data:

  1. SPARQL endpoints which allow expressive queries to be asked against the datasets.
  2. They provide RDF dumps of the complete dataset for download from a separate URL.

Altogether 201 (68.14 %) out of the 295 data sources provide a SPARQL endpoint. 117 (39.66 %) out of the 295 data sources provide RDF dumps.

The voiD vocabulary provides terms for pointing applications from the description of a single entity to these alternative means of access. For this voiD recommends to use a link of the type dcterms:isPartOf to point from the entity description to a void description of the complete dataset. This dataset description in turn may contain void:sparqlEndpoint and void:dataDump links pointing at SPARQL endpoint and the download URI of RDF dataset dumps.

Currently,

Revision history