Version 0.3, 09/19/2011
This document provides statistics about the structure and content of the LOD cloud. It also analyzes the extend to which LOD data sources implement nine best practices that are either recommended W3C or have emerged within the LOD community.
All statistics within this document are based on the LOD data set catalog that is maintained on CKAN. This document contains a preliminary release of the statistics. If you spot any errors in the data describing the LOD data sets, it would be great if you would correct them directly on CKAN. For information on how to describe datasets on CKAN please refer to the Guidelines for Collecting Metadata on Linked Datasets in CKAN.
This section gives an overview of the structure of the LOD cloud.
The following diagram visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set.
Detailed information about all data sets that are currently part of the LOD Cloud is found in LOD Data Set Catalog on CKAN.
Linked Data technologies are being using to share data covering a wide range of different topical domains. The table below gives an overview of the amount of triples as well as the amount of RDF links per domain. The number of RDF links refers to out-going links that are set from data sources within a domain to other data sources.
|Domain||Number of datasets||Triples||%||(Out-)Links||%|
|Media||25||1,841,852,061||5.82 %||50,440,705||10.01 %|
|Geographic||31||6,145,532,484||19.43 %||35,812,328||7.11 %|
|Government||49||13,315,009,400||42.09 %||19,343,519||3.84 %|
|Publications||87||2,950,720,693||9.33 %||139,925,218||27.76 %|
|Cross-domain||41||4,184,635,715||13.23 %||63,183,065||12.54 %|
|Life sciences||41||3,036,336,004||9.60 %||191,844,090||38.06 %|
|User-generated content||20||134,127,413||0.42 %||3,449,143||0.68 %|
The diagram below shows the distribution of triples by domain.
The diagram below shows the distribution of links by domain.
Initially, the Linked Data best practices were adopted mainly by research projects and Web enthusiasts. These third-parties took existing data sets, converted them into RDF and served them on the Web. Alternatively, they implemented Linked Data wrappers around existing Web APIs. Today, Linked Data technologies are increasing adopted by the primary data producers themselves and are used by them to provide access to their data sets.
As of August 2011, out of the 295 datasets in the LOD cloud 113 (38.57 %) are published by the data producers themselves, while 180 (61.43 %) are published by third-parties.
The promise of the Web of Linked Data is to enable applications to discover and integrate data from an global Web of interconnected data sources. In order to make it as easy as possible for applications to access and process Linked Data, data providers should publish data according to a set of best practices. These best practices recommend to make data accessible using the Web's standard access mechanism (HTTP) and represent data using standardized Web formats (i.e. RDF/XML, RDFa, XML with GRDDL). On the other hand, the best practices aim at making data as self-descriptive as possible. The best practices are either recommended directly by W3C or have emerged within the LOD community.
This section analyses to which extend data sources in the LOD cloud implement these best practices.
The basic idea of Linked Data is to make data accessible using the Web's standard retrieval algorithm. This means that every entity of interest, for instance a person, place, company or abstract concept, should be identified with its own http URI (Linked Data Principle 1). On the other hand, these URIs should be made dereferencable into an RDF description of the entity (Linked Data Principle 2 and 3). The W3C Interest Group Note Cool URIs for the Semantic Web describes the different technical options for realizing such lookup operations (hash vs. slash URIs).
100% of the data sources in the LOD cloud fulfill this best practice as it is a pre-condition for being included into the LOD cloud.
There are two validators available for checking whether URIs fulfill this best practice:
The 4th Linked Data principle is to set RDF links pointing into other data sources. These RDF links connect data from different sources into a single global RDF graph and enable Linked Data browsers and crawlers to navigate between data sources.
The absolute number of RDF links in the LOD cloud is given in Section 1.2.
The table below categorizes the datasets in the LOD cloud by the absolute number of outgoing RDF links.
|(Out-)Links||Number of datasets|
|up to 1,000||30 (10.17 %)|
|up to 1,000||90 (30.51 %)|
|1,000 to 10,000||58 (19.66 %)|
|10,000 to 100,000||45 (15.25 %)|
|100,000 to 1,000,000||43 (14.58 %)|
|more than 1,000,000||29 (9.83 %)|
The next table categorizes the LOD data sets by the number of other data sources that are target of outgoing RDF links.
|Number of linked datasets||Number of datasets|
|more than 10||27 (9.15 %)|
|6 to 10||17 (5.76 %)|
|5||5 (1.69 %)|
|4||19 (6.44 %)|
|3||38 (12.88 %)|
|2||62 (21.02 %)|
|1||98 (33.22 %)|
The table below lists the 10 LOD data sets with the highest number of linked other LOD data sets.
|Number of linked datasets||Dataset|
In order to make it easier for applications to understand Linked Data, data providers should use terms from widely deployed vocabularies to represent data wherever possible.
Nearly all data sources in the LOD cloud use terms from the W3C base-vocabularies RDF, RDF Schema, and OWL. In addition 191 (64.75 %) of the 295 data sources in the LOD cloud use terms from other non-proprietary vocabularies. When calculating this number, we consider a vocabulary to be proprietary if it is defined in the same top-level domain that is also used to serve the instance data.
The table below lists the most widely used vocabularies and provides links to the data sources that use a specific vocabulary.
|Vocabulary prefix||Vocabulary link||Number of usages in data sets||Data sets that use the vocabulary|
|dc||http://purl.org/dc/elements/1.1/||92 (31.19 %)||Data sets that use dc|
|foaf||http://xmlns.com/foaf/0.1/||81 (27.46 %)||Data sets that use foaf|
|skos||http://www.w3.org/2004/02/skos/core#||58 (19.66 %)||Data sets that use skos|
|geo||http://www.w3.org/2003/01/geo/wgs84_pos#||25 (8.47 %)||Data sets that use geo|
|xhtml||http://www.w3.org/1999/xhtml/vocab#||19 (6.44 %)||Data sets that use xhtml|
|akt||http://www.aktors.org/ontology/portal#||17 (5.76 %)||Data sets that use akt|
|bibo||http://purl.org/ontology/bibo/||14 (4.75 %)||Data sets that use bibo|
|mo||http://purl.org/ontology/mo/||13 (4.41 %)||Data sets that use mo|
|vcard||http://www.w3.org/2006/vcard/ns#||10 (3.39 %)||Data sets that use vcard|
|sioc||http://rdfs.org/sioc/ns#||10 (3.39 %)||Data sets that use sioc|
|cc||http://creativecommons.org/ns#||8 (2.71 %)||Data sets that use cc|
|geonames||http://www.geonames.org/ontology#||6 (2.03 %)||Data sets that use geonames|
|frbr||http://purl.org/vocab/frbr/core#||6 (2.03 %)||Data sets that use frbr|
|xsd||http://www.w3.org/2001/XMLSchema#||6 (2.03 %)||Data sets that use xsd|
|time||http://www.w3.org/2006/time#||5 (1.69 %)||Data sets that use time|
|event||http://purl.org/NET/c4dm/event.owl#||5 (1.69 %)||Data sets that use event|
|dbpedia||http://dbpedia.org/resource/||5 (1.69 %)||Data sets that use dbpedia|
|gr||http://purl.org/goodrelations/v1#||4 (1.36 %)||Data sets that use gr|
|dbo||http://dbpedia.org/ontology/||4 (1.36 %)||Data sets that use dbo|
|ore||http://www.openarchives.org/ore/terms/||4 (1.36 %)||Data sets that use ore|
|bio||http://purl.org/vocab/bio/0.1/||4 (1.36 %)||Data sets that use bio|
|dbp||http://dbpedia.org/property/||4 (1.36 %)||Data sets that use dbp|
|tag||http://www.holygoat.co.uk/owl/redwood/0.1/tags/||3 (1.02 %)||Data sets that use tag|
|void||http://rdfs.org/ns/void#||3 (1.02 %)||Data sets that use void|
|scovo||http://purl.org/NET/scovo#||3 (1.02 %)||Data sets that use scovo|
|http||http://www.w3.org/2006/http#||3 (1.02 %)||Data sets that use http|
|uniprot||http://purl.uniprot.org/core/||3 (1.02 %)||Data sets that use uniprot|
|umbel||http://umbel.org/umbel#||3 (1.02 %)||Data sets that use umbel|
|rev||http://purl.org/stuff/rev#||3 (1.02 %)||Data sets that use rev|
|vu-wordnet||2 (0.68 %)||Data sets that use vu-wordnet|
|xml||2 (0.68 %)||Data sets that use xml|
|qb||http://purl.org/linked-data/cube#||2 (0.68 %)||Data sets that use qb|
|geospecies||http://rdf.geospecies.org/ont/geospecies#||2 (0.68 %)||Data sets that use geospecies|
|sdmx||http://purl.org/linked-data/sdmx#||2 (0.68 %)||Data sets that use sdmx|
|sawsdl||http://www.w3.org/ns/sawsdl#||2 (0.68 %)||Data sets that use sawsdl|
|org||http://www.w3.org/ns/org#||2 (0.68 %)||Data sets that use org|
|api||2 (0.68 %)||Data sets that use api|
|vann||http://purl.org/vocab/vann/||2 (0.68 %)||Data sets that use vann|
|admingeo||http://data.ordnancesurvey.co.uk/ontology/admingeo/||2 (0.68 %)||Data sets that use admingeo|
|wdrs||http://www.w3.org/2007/05/powder-s#||2 (0.68 %)||Data sets that use wdrs|
|doap||http://usefulinc.com/ns/doap#||2 (0.68 %)||Data sets that use doap|
|metalex||2 (0.68 %)||Data sets that use metalex|
|txn||http://lod.taxonconcept.org/ontology/txn.owl#||2 (0.68 %)||Data sets that use txn|
|wot||http://xmlns.com/wot/0.1/||2 (0.68 %)||Data sets that use wot|
|compass||http://purl.org/net/compass#||2 (0.68 %)||Data sets that use compass|
|rdfg||http://www.w3.org/2004/03/trix/rdfg-1/||2 (0.68 %)||Data sets that use rdfg|
|mpeg7||2 (0.68 %)||Data sets that use mpeg7|
|tl||http://purl.org/NET/c4dm/timeline.owl#||2 (0.68 %)||Data sets that use tl|
|dcam||http://purl.org/dc/dcam/||2 (0.68 %)||Data sets that use dcam|
|swrc||http://swrc.ontoware.org/ontology#||2 (0.68 %)||Data sets that use swrc|
|bib||http://zeitkunst.org/bibtex/0.1/bibtex.owl#||2 (0.68 %)||Data sets that use bib|
|po||http://purl.org/ontology/po/||2 (0.68 %)||Data sets that use po|
The diagram below shows the distribution of most widely used vocabularies.
Widely deployed vocabularies often do not provide all terms that are needed to publish the complete content of a data set on the Web. Thus, data providers often define proprietary terms that are used in addition to terms from widely deployed vocabularies.
In order to enable applications to automatically retrieve the definition of vocabulary terms from the Web, URIs identifying vocabulary terms should be made dereferencable. Guidelines for doing this are given in the W3C Note Best Practice Recipes for Publishing RDF Vocabularies.
Split by topical domain, the numbers look as follows:
|Domain||Proprietary vocabulary terms||Dereferancable proprietary term URIs||Not dereferancable proprietary term URIs||No proprietary vocabulary terms|
|Media||15/25 (60.00 %)||12/25 (48.00 %)||3/25 (12.00 %)||10/25 (40.00 %)|
|Geographic||25/31 (80.65 %)||16/31 (51.61 %)||9/31 (29.03 %)||6/31 (19.35 %)|
|Government||39/49 (79.59 %)||33/49 (67.35 %)||6/49 (12.24 %)||10/49 (20.41 %)|
|Publications||57/87 (65.52 %)||50/87 (57.47 %)||7/87 (8.05 %)||30/87 (34.48 %)|
|Cross-domain||31/41 (75.61 %)||28/41 (68.29 %)||3/41 (7.32 %)||9/41 (21.95 %)|
|Life sciences||13/41 (31.71 %)||11/41 (26.83 %)||2/41 (4.88 %)||28/41 (68.29 %)|
|User-generated content||9/20 (45.00 %)||8/20 (40.00 %)||1/20 (5.00 %)||9/20 (45.00 %)|
Proprietary vocabulary terms should be related to corresponding terms within other (widely used) vocabularies in order to enable applications to understand as much data as possible and to translate data into their target schemata (see RDF and the Self-Describing Semantic Web). The W3C recommendations define the following terms for representing such correspondences (mappings): owl:equivalentClass, owl:equivalentProperty, or if a looser mapping is desired: rdfs:subClassOf, rdfs:subPropertyOf, and skos:broadMatch, skos:narrowMatch.
Currently 15 (7.89 %) out of the 190 data sources that use proprietary terms provide mappings to other vocabularies for their terms.
In order to enable applications to be sure about the origin of data as well as to enable them to assess the quality of data, data source should publish provenance meta data together with the primary data. A common means for providing provenance information is to represent it as document level metadata as described in the How to publish Linked Data tutorial. A widely deployed vocabulary for representing provenance information is Dublin Core (dc:creator, dc:publisher, dc:date). Alternative vocabularies which provide for representing more details about the data creation process include the Open Provenance Model as well as the vocabularies examined by the W3C Provenance Incubator Group.
Split by topical domain, the figures look as follows:
|Media||4/25 (16.00 %)|
|Geographic||15/31 (48.39 %)|
|Government||10/49 (20.41 %)|
|Publications||41/87 (47.13 %)|
|Cross-domain||10/41 (24.39 %)|
|Life sciences||2/41 (4.88 %)|
|User-generated content||6/20 (30.00 %)|
Web data should be self-descriptive concerning any restrictions that apply to its usage. A common way to express such restrictions is to attach a data license to published data. Doing so is essential to enable applications to use Web data on a secure legal basis. A common means to attach licenses to Linked Data is to use dc:rights links pointing at the license as document-level metadata. An example of this is given in the How to publish Linked Data tutorial.
Split by topical domain, the figures look as follows:
|Media||4/25 (16.00 %)|
|Geographic||9/31 (29.03 %)|
|Government||7/49 (14.29 %)|
|Publications||9/87 (10.34 %)|
|Cross-domain||8/41 (19.51 %)|
|Life sciences||1/41 (2.44 %)|
|User-generated content||4/20 (20.00 %)|
In addition to making instance data self-descriptive, it is also desirable that data publishers provide metadata describing characteristic of complete data sets, for instance the topic of a dataset and more detailed statistics. A vocabulary for representing such metadata is the voiD vocabulary. A second means for representing dataset-level metadata are Semantic Sitemaps.
Split by topical domain, the figures look as follows:
|Domain||voiD||Semantic Sitemap||voiD or Semantic Sitemap|
|Media||5/25 (20.00 %)||0/25 (0.00 %)||5/25 (20.00 %)|
|Geographic||12/31 (38.71 %)||3/31 (9.68 %)||13/31 (41.94 %)|
|Government||21/49 (42.86 %)||3/49 (6.12 %)||23/49 (46.94 %)|
|Publications||39/87 (44.83 %)||34/87 (39.08 %)||43/87 (49.43 %)|
|Cross-domain||9/41 (21.95 %)||6/41 (14.63 %)||11/41 (26.83 %)|
|Life sciences||3/41 (7.32 %)||6/41 (14.63 %)||7/41 (17.07 %)|
|User-generated content||5/20 (25.00 %)||1/20 (5.00 %)||6/20 (30.00 %)|
The primary way to publish Linked Data on the Web is to make the URIs that identity data items dereferencable into RDF descriptions. In addition, various LOD data providers have chosen to provide two alternative means of access to their data:
Altogether 201 (68.14 %) out of the 295 data sources provide a SPARQL endpoint. 117 (39.66 %) out of the 295 data sources provide RDF dumps.
The voiD vocabulary provides terms for pointing applications from the description of a single entity to these alternative means of access. For this voiD recommends to use a link of the type dcterms:isPartOf to point from the entity description to a void description of the complete dataset. This dataset description in turn may contain void:sparqlEndpoint and void:dataDump links pointing at SPARQL endpoint and the download URI of RDF dataset dumps.