Trusty URI presentation sources
\title[Trusty URIs]{Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data\\\medskip\texttt{\#eswc2014Kuhn}}
\author[Tobias Kuhn]{\colorbold{Tobias Kuhn} and Michel Dumontier\\\medskip\small\url{} / \url{}\\\texttt{@txkuhn} / \texttt{@micheldumontier}}
\institute[ETH Zurich]{ETH Zurich / Stanford University}
\date{ESWC\\27 May 2014}
\frametitle{Motivation 1}
\colorbold{The Semantic Web:} Web content becomes machine-interpretable.
\coloremph{Machines} (i.e. algorithms) can then perform --- on large amounts of linked data --- tasks such as:
\emph{automated aggregation, complex searches, problem solving, recommendations, and much more ...}
\colorbold{But wait...} even human users are often \coloremph{easy to trick by spam and fraudulent content} found on the web. \colorbold{We should be even more concerned in the case of machines!}
\frametitle{Motivation 2}
Sue publishes a script that allows everybody to \coloremph{replicate} her scientific analysis:\bigskip\\
\# Download data:\\
\# Analyze data\\
But what if the third party \coloremph{silently changes} that version of the dataset? What if the resource becomes \coloremph{unavailable} at this location? What if the web site later gets hacked and the \coloremph{data manipulated}?
\frametitle{Motivation 3}
\colorbold{Nanopublications:} Atomic pieces of scientific results together with their provenance, all represented in RDF.
\item \coloremph{Citation networks:} nanopubs can cite or refer to other nanopubs
\item Nanopubs are supposed to be \coloremph{immutable}
\item A scientist citing something wants to be sure that it is \coloremph{not silently changed afterwards}
\item The current web has \coloremph{no mechanism} to enforce immutability
Given a URI for a digital artifact, there is no reliable standard procedure of checking whether a retrieved file really represents the \coloremph{correct and original state} of that artifact.
\frametitle{We need URIs we can Trust!}
\LARGE\textbf{Trusty URIs}
\frametitle{Trusty URIs}
\colorbold{Basic idea:} Use of \coloremph{cryptographic hash values} calculated on digital artifacts.
\item To allow for the verification of entire reference trees, \coloremph{the hash should be part of the reference} (i.e. the URI)
\item To allow for meta-data, digital artifacts should be allowed to contain \coloremph{self-references} (i.e. their own URI)
\item \coloremph{Format-independent} hash for \coloremph{different kinds of content}
\item The complete approach should be \coloremph{decentralized and open}
\item We want to use them \coloremph{right away}
\frametitle{Trusty URIs: Range of Verifiability}
With the hash as a part of the URI, the \coloremph{``range of verifiability''} extends to referenced artifacts (if they also use trusty URIs):
\frametitle{Trusty URI Modules}
Currently, there are two trusty URI modules:
\item \colorbold{FA:} Plain files (i.e. byte sequences)
\item \colorbold{RA:} \coloremph{Sets of RDF graphs}
\item More to come in the future...
The first character (\colorbold{F} or \colorbold{R}) represents the \coloremph{type} of the module; the second character (\colorbold{A}) its \coloremph{version}.
\frametitle{Example: Nanobrowser}
\frametitle{Verifiable {\normalfont --- Immutable --- Permanent}}
Whether or not a given resource is the one a given trusty URI is supposed to represent can be \colorbold{verified with perfect confidence}.
(assuming that the trusty URI for the required artifact is known, e.g. because another artifact contains it as a link)
\frametitle{{\normalfont Verifiable ---} Immutable {\normalfont --- Permanent}}
Trusty URI artifacts are \colorbold{immutable}, as any change in the content also changes its URI, thereby making it a \coloremph{new} artifact.
(as soon as your trusty URI has been picked up by third parties, e.g. cached or linked from other resources, every change will be noticed)
\frametitle{{\normalfont Verifiable --- Immutable ---} Permanent}
Trusty URI artifacts are \colorbold{permanent}, as they can be retrieved from the cache of third-party websites if otherwise no longer available.
(if there are search engines and web archives regularly crawling and caching the artifacts on the web)
\frametitle{Permanent Digital Artifacts}
\colorbold{Ideally,} a (trusty) artifact should be \coloremph{retrievable via its URI}:
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{}}
\colorbold{But if not,} we can also retrieve it \coloremph{from third-party sources}:
\item[\color{red}{\LARGE $\nRightarrow$}] {\footnotesize\texttt{}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{}}
\colorbold{(Partial) Implementations in:}
\item \coloremph{Java} {\footnotesize(\url{})}
\item \coloremph{Python} {\footnotesize(\url{})}
\item \coloremph{Perl} {\footnotesize(\url{})}
\item \emph{more to come...}
\item \coloremph{General:} CheckFile, RunBatch
\item \coloremph{Module FA only:} ProcessFile
\item \coloremph{Module RA only:} TransformRdf, TransformLargeRdf, TransformNanopub, CheckLargeRdf, CheckSortedRdf, CheckNanopubViaSparql
\item \emph{more to come...}
\frametitle{Evaluation 1: Nanopubs}
We took \coloremph{$\sim$150,000 nanopublications} from previous work, transformed them to \coloremph{different formats} (TriG, N-Quads, and TriX), and then generated trusty URIs for them.
\item[$\Rightarrow$] \coloremph{For any given nanopub, the same trusty URI was generated for the different formats}
Then we \coloremph{checked these trusty URIs}, also for corrupted copies of the files (one random byte changed).
\item[$\Rightarrow$] \coloremph{All non-corrupted files are successfully validated}
\item[$\Rightarrow$] \coloremph{All corrupted files either lead to errors or the validation fails} (except for $<$1\% harmless cases in TriX format where the changed byte is not part of the RDF content)
\item[$\Rightarrow$] \coloremph{Checking with Java in batch mode takes 0.001s per nanopub}
\frametitle{Evaluation 2: Bio2RDF}
To evaluate our approach on \coloremph{larger files}, we transformed and checked 858 RDF files from Bio2RDF.
\item File sizes ranging from 1.4kB to 177GB
\item[$\Rightarrow$] \coloremph{Files smaller than 10MB require less than 3 seconds to be transformed or checked}
\item[$\Rightarrow$] \coloremph{Large files of 2GB require $\sim$5min to be transformed and $\sim$2min to be checked}
\item[$\Rightarrow$] \coloremph{Largest file of 177GB (much larger than memory) required 29h to be transformed and 3h to be checked}
\frametitle{Make This a Community Effort}
\colorbold{Code on GitHub:} \url{}
\colorbold{Permissive Open Source License}
\colorbold{Open Development:} Let us know if you want to be involved!
\colorbold{Wiki} (including wish list):\\
\frametitle{Conclusions and Future Work}
\item \coloremph{Unambiguous URI references} for verifiable, immutable, and permanent digital artifacts
\item Proposal of a central \coloremph{technical pillar} of the (semantic) web
\item In particular for scientific data, where \coloremph{provenance and verifiability} are crucial
\colorbold{Planned usage:}
\item Next version of \coloremph{Bio2RDF}
\item Nanopublications for \coloremph{neXtProt} (currently $\sim$20 million nanopubs)
\item \coloremph{Nanopub server} (for publishing and archiving nanopubs)
\frametitle{Thank you for your Attention!}
\Large Twitter: \coloremph{\texttt{@txkuhn}} and \coloremph{\texttt{\#eswc2014Kuhn}}
Web: \coloremph{\url{}}
\frametitle{Some Additional Slides Follow...}
\frametitle{Related Approaches}
\colorbold{Git} inspired the design of trusty URIs: Git refers to \coloremph{commits} by hash values calculated in a recursive way.
\colorbold{Named Information (ni) URIs:}
(Trusty URIs can be mapped to ni-URIs.)
What is \colorbold{missing} in these approaches:
\item Digital artifacts on a \coloremph{more abstract level than byte sequences}
\item Support for \coloremph{self-references}
\frametitle{Skolemization of Blank Nodes}
The hash also helps us to \coloremph{solve the problem of blank nodes} for canonicalization of RDF content: We use the hash to \coloremph{skolemize blank nodes}:
These URIs are guaranteed to \coloremph{have never been used before} (except possibly for exactly the same content).
\frametitle{Performance for Nanopubs in Batch Mode}
\frametitle{Performance for Large Files (Bio2RDF)}
\includegraphics[trim=15mm 11.5mm 15mm 3mm, clip=true, width=\textwidth]{bio2rdf-transform.pdf}\\
\includegraphics[trim=15mm 0mm 15mm 6mm, clip=true, width=\textwidth]{bio2rdf-check.pdf}
