Skip to content

Instantly share code, notes, and snippets.

@tkuhn
Created October 9, 2014 14:27
Show Gist options
  • Save tkuhn/b82de11e176f4b282ddb to your computer and use it in GitHub Desktop.
Save tkuhn/b82de11e176f4b282ddb to your computer and use it in GitHub Desktop.
Trusty URI presentation sources
\documentclass{beamer}
\usepackage{graphicx}
\usepackage{multicol}
\usepackage{bchart}
\usepackage{wasysym}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{tikz}
\frenchspacing
\beamertemplatenavigationsymbolsempty
\usefonttheme{structurebold}
\useoutertheme{infolines}
\useinnertheme{circles}
\setbeamersize{text margin left=8mm, text margin right=8mm}
\setbeamertemplate{frametitle}{\begin{centering}\insertframetitle\par\end{centering}}
\definecolor{emphcolor}{rgb}{0.19,0.22,0.68}
\definecolor{backcolor}{rgb}{0.84,0.85,0.95}
\newcommand{\coloremph}[1]{{\color{emphcolor}#1}}
\newcommand{\colorbold}[1]{{\color{emphcolor}\bfseries#1}}
\newcommand{\arrowitem}{\item[\boldmath$\Rightarrow$]}
\newcommand{\plainfootnote}[1]{\let\thefootnote\relax\footnote{\vspace{-3.5mm}~\\\hspace{1.5mm}\tiny #1\vspace{0.5mm}}}
\setbeamercolor{block title}{bg=emphcolor,fg=white}
\setbeamercolor{block body}{bg=backcolor,fg=black}
\setbeamertemplate{blocks}[rounded]
\begin{document}
\setbeamertemplate{headline}{}
\setbeamertemplate{footline}{}
\title[Trusty URIs]{Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data\\\medskip\texttt{\#eswc2014Kuhn}}
\author[Tobias Kuhn]{\colorbold{Tobias Kuhn} and Michel Dumontier\\\medskip\small\url{http://www.tkuhn.ch} / \url{http://dumontierlab.com}\\\texttt{@txkuhn} / \texttt{@micheldumontier}}
\institute[ETH Zurich]{ETH Zurich / Stanford University}
\date{ESWC\\27 May 2014}
\begin{frame}
\titlepage
\end{frame}
\setbeamertemplate{footline}{%
\color{gray}~~%
\insertshortauthor, \insertshortinstitute \hfill %
\insertshorttitle \hfill %
%\insertshortdate \hfill %
\insertframenumber\,/\,\inserttotalframenumber%
~~\smallskip\newline%
}
\begin{frame}
\frametitle{Motivation 1}
\colorbold{The Semantic Web:} Web content becomes machine-interpretable.
\par\bigskip
\coloremph{Machines} (i.e. algorithms) can then perform --- on large amounts of linked data --- tasks such as:
\emph{automated aggregation, complex searches, problem solving, recommendations, and much more ...}
\begin{center}
\includegraphics[width=2cm]{computer.pdf}
\end{center}
\colorbold{But wait...} even human users are often \coloremph{easy to trick by spam and fraudulent content} found on the web. \colorbold{We should be even more concerned in the case of machines!}
\end{frame}
\begin{frame}
\frametitle{Motivation 2}
Sue publishes a script that allows everybody to \coloremph{replicate} her scientific analysis:\bigskip\\
\begin{minipage}{1.6cm}
\includegraphics[width=1.2cm]{researcher.pdf}
\end{minipage}%
\begin{minipage}{7cm}
\texttt{\coloremph{\small%
\# Download data:\\
wget http://some-third-party.org/dataset/1.4\\
\# Analyze data\\
...
}}
\end{minipage}\bigskip\\
But what if the third party \coloremph{silently changes} that version of the dataset? What if the resource becomes \coloremph{unavailable} at this location? What if the web site later gets hacked and the \coloremph{data manipulated}?
\end{frame}
\begin{frame}
\frametitle{Motivation 3}
\colorbold{Nanopublications:} Atomic pieces of scientific results together with their provenance, all represented in RDF.
\begin{itemize}
\item \coloremph{Citation networks:} nanopubs can cite or refer to other nanopubs
\item Nanopubs are supposed to be \coloremph{immutable}
\end{itemize}
\vspace{-3mm}
\begin{center}
\includegraphics[width=8cm]{nanopubs.pdf}
\end{center}
\vspace{-3mm}
\colorbold{Problem:}
\begin{itemize}
\item A scientist citing something wants to be sure that it is \coloremph{not silently changed afterwards}
\item The current web has \coloremph{no mechanism} to enforce immutability
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Problem}
\begin{center}
{\texttt{http://some-third-party.org/dataset/1.4}}
\medskip\\
\color{emphcolor}{\scalebox{3}{\phantom{?} $\Updownarrow$ \textbf{?}}}
\medskip\\
\includegraphics[width=2cm]{file.png}
\end{center}
Given a URI for a digital artifact, there is no reliable standard procedure of checking whether a retrieved file really represents the \coloremph{correct and original state} of that artifact.
\end{frame}
\begin{frame}
\frametitle{We need URIs we can Trust!}
\begin{center}
\includegraphics[width=40mm]{trustyuri.png}\\
\LARGE\textbf{Trusty URIs}
\end{center}
\end{frame}
\begin{frame}
\frametitle{Trusty URIs}
\colorbold{Basic idea:} Use of \coloremph{cryptographic hash values} calculated on digital artifacts.
\par\bigskip
\colorbold{Requirements:}
\begin{itemize}
\item To allow for the verification of entire reference trees, \coloremph{the hash should be part of the reference} (i.e. the URI)
\item To allow for meta-data, digital artifacts should be allowed to contain \coloremph{self-references} (i.e. their own URI)
\item \coloremph{Format-independent} hash for \coloremph{different kinds of content}
\item The complete approach should be \coloremph{decentralized and open}
\item We want to use them \coloremph{right away}
\end{itemize}
\par\bigskip
\colorbold{Example:}
{\footnotesize
\coloremph{\texttt{http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70}}
}
\end{frame}
\begin{frame}
\frametitle{Trusty URIs: Range of Verifiability}
\vfill
With the hash as a part of the URI, the \coloremph{``range of verifiability''} extends to referenced artifacts (if they also use trusty URIs):
\vfill\par
\makebox[\textwidth][c]{\includegraphics[width=1.1\textwidth]{verifiability.pdf}}
\vfill
\end{frame}
\begin{frame}
\frametitle{Trusty URI Modules}
Currently, there are two trusty URI modules:
\begin{itemize}
\item \colorbold{FA:} Plain files (i.e. byte sequences)
\item \colorbold{RA:} \coloremph{Sets of RDF graphs}
\item More to come in the future...
\end{itemize}
\par\bigskip
The first character (\colorbold{F} or \colorbold{R}) represents the \coloremph{type} of the module; the second character (\colorbold{A}) its \coloremph{version}.
\end{frame}
\begin{frame}
\frametitle{Example: Nanobrowser}
\begin{center}
\makebox[\textwidth][c]{%
\scalebox{0.9}{%
\setlength{\fboxsep}{0mm}%
\begin{tikzpicture}
\node[inner sep=0mm] (img) at (0,0) {\fbox{\includegraphics[width=13.5cm]{nanobrowser.png}}};
\node[inner sep=0mm] (i) at (-1.5,-2) {\setlength{\fboxrule}{0.5mm}\fbox{\includegraphics[scale=0.4]{nanobrowser-valid.png}}};
\node[circle,fill=white,draw,very thick] at (i.north east) {\textbf{1}};
\draw[->,very thick] (i) -- (-6.1,2.7);
\node[inner sep=0mm] (l) at (1.5,0.5) {\setlength{\fboxrule}{0.5mm}\fbox{\includegraphics[scale=0.4]{nanobrowser-buttons.png}}};
\node[circle,fill=white,draw,very thick] at (l.north east) {\textbf{2}};
\draw[->,very thick] (l) -- (-1.2,2.7);
\end{tikzpicture}%
}}%
\smallskip\par
\url{http://nanobrowser.inn.ac}
\end{center}
\end{frame}
\begin{frame}
\frametitle{Verifiable {\normalfont --- Immutable --- Permanent}}
\begin{center}
\color{green}{\scalebox{7.0}{\checkmark}}
\end{center}
\vfill
Whether or not a given resource is the one a given trusty URI is supposed to represent can be \colorbold{verified with perfect confidence}.
\par\bigskip\footnotesize
(assuming that the trusty URI for the required artifact is known, e.g. because another artifact contains it as a link)
\end{frame}
\begin{frame}
\frametitle{{\normalfont Verifiable ---} Immutable {\normalfont --- Permanent}}
\begin{center}
\scalebox{5.0}{\includegraphics[width=0.5cm]{pen.pdf}}
\end{center}
\vfill
Trusty URI artifacts are \colorbold{immutable}, as any change in the content also changes its URI, thereby making it a \coloremph{new} artifact.
\par\bigskip\footnotesize
(as soon as your trusty URI has been picked up by third parties, e.g. cached or linked from other resources, every change will be noticed)
\end{frame}
\begin{frame}
\frametitle{{\normalfont Verifiable --- Immutable ---} Permanent}
\begin{center}
\color{brown}{\scalebox{9.0}{\clock}}
\end{center}
\vfill
Trusty URI artifacts are \colorbold{permanent}, as they can be retrieved from the cache of third-party websites if otherwise no longer available.
\par\bigskip\footnotesize
(if there are search engines and web archives regularly crawling and caching the artifacts on the web)
\end{frame}
\begin{frame}
\frametitle{Permanent Digital Artifacts}
\colorbold{Ideally,} a (trusty) artifact should be \coloremph{retrievable via its URI}:
\begin{itemize}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{http://my-organization.org/datasets/RA5AbX...}}
\end{itemize}
\medskip\par
\colorbold{But if not,} we can also retrieve it \coloremph{from third-party sources}:
\begin{itemize}
\item[\color{red}{\LARGE $\nRightarrow$}] {\footnotesize\texttt{http://my-organization.org/datasets/RA5AbX...}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{http://hashcache.org/object/RA5AbX...}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{http://artifact-archive.com/artifacts/RA5AbX...}}
\item[\LARGE $\Rightarrow$] {\footnotesize\texttt{http://nasty-server.com/no-need-to-trust-me/RA5AbX...}}
\end{itemize}
\vspace{-3mm}
\begin{center}
\scalebox{5.0}{\includegraphics[width=15mm]{servers.pdf}}
\end{center}
\vspace{-7mm}
\end{frame}
\begin{frame}
\frametitle{Implementations}
\colorbold{(Partial) Implementations in:}
\begin{itemize}
\item \coloremph{Java} {\footnotesize(\url{https://github.com/trustyuri/trustyuri-java})}
\item \coloremph{Python} {\footnotesize(\url{https://github.com/trustyuri/trustyuri-python})}
\item \coloremph{Perl} {\footnotesize(\url{https://github.com/trustyuri/trustyuri-perl})}
\item \emph{more to come...}
\end{itemize}
\vfill
\colorbold{Functions:}
\begin{itemize}
\item \coloremph{General:} CheckFile, RunBatch
\item \coloremph{Module FA only:} ProcessFile
\item \coloremph{Module RA only:} TransformRdf, TransformLargeRdf, TransformNanopub, CheckLargeRdf, CheckSortedRdf, CheckNanopubViaSparql
\item \emph{more to come...}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Evaluation 1: Nanopubs}
\vspace{-5mm}
\begin{center}
\includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\includegraphics[width=4mm]{file.png} \includegraphics[width=3mm]{file.png}
\end{center}
\vspace{-3mm}
We took \coloremph{$\sim$150,000 nanopublications} from previous work, transformed them to \coloremph{different formats} (TriG, N-Quads, and TriX), and then generated trusty URIs for them.
\begin{itemize}
\item[$\Rightarrow$] \coloremph{For any given nanopub, the same trusty URI was generated for the different formats}
\end{itemize}
\par\bigskip
Then we \coloremph{checked these trusty URIs}, also for corrupted copies of the files (one random byte changed).
\begin{itemize}
\item[$\Rightarrow$] \coloremph{All non-corrupted files are successfully validated}
\item[$\Rightarrow$] \coloremph{All corrupted files either lead to errors or the validation fails} (except for $<$1\% harmless cases in TriX format where the changed byte is not part of the RDF content)
\item[$\Rightarrow$] \coloremph{Checking with Java in batch mode takes 0.001s per nanopub}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Evaluation 2: Bio2RDF}
\vspace{-5mm}
\begin{center}
\includegraphics[width=12mm]{file.png}
\includegraphics[width=22mm]{file.png}
\includegraphics[width=8mm]{file.png}
\includegraphics[width=15mm]{file.png}
\end{center}
\vspace{-3mm}
To evaluate our approach on \coloremph{larger files}, we transformed and checked 858 RDF files from Bio2RDF.
\begin{itemize}
\item File sizes ranging from 1.4kB to 177GB
\item[$\Rightarrow$] \coloremph{Files smaller than 10MB require less than 3 seconds to be transformed or checked}
\item[$\Rightarrow$] \coloremph{Large files of 2GB require $\sim$5min to be transformed and $\sim$2min to be checked}
\item[$\Rightarrow$] \coloremph{Largest file of 177GB (much larger than memory) required 29h to be transformed and 3h to be checked}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Make This a Community Effort}
\begin{center}
\scalebox{5.0}{\includegraphics[width=12mm]{community.pdf}}
\end{center}
\vfill
\colorbold{Code on GitHub:} \url{https://github.com/trustyuri/}
\bigskip\par
\colorbold{Permissive Open Source License}
\bigskip\par
\colorbold{Open Development:} Let us know if you want to be involved!
\bigskip\par
\colorbold{Wiki} (including wish list):\\
\url{https://github.com/trustyuri/trustyuri/wiki}
\end{frame}
\begin{frame}
\frametitle{Conclusions and Future Work}
\colorbold{Contribution:}
\begin{itemize}
\item \coloremph{Unambiguous URI references} for verifiable, immutable, and permanent digital artifacts
\item Proposal of a central \coloremph{technical pillar} of the (semantic) web
\item In particular for scientific data, where \coloremph{provenance and verifiability} are crucial
\end{itemize}
\par\bigskip
\colorbold{Planned usage:}
\begin{itemize}
\item Next version of \coloremph{Bio2RDF}
\item Nanopublications for \coloremph{neXtProt} (currently $\sim$20 million nanopubs)
\item \coloremph{Nanopub server} (for publishing and archiving nanopubs)
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Thank you for your Attention!}
\vfill
\vfill
\begin{center}
\Large Twitter: \coloremph{\texttt{@txkuhn}} and \coloremph{\texttt{\#eswc2014Kuhn}}
\par\medskip
Web: \coloremph{\url{http://trustyuri.net}}
\end{center}
\vfill
\end{frame}
\appendix
\newcounter{finalframe}
\setcounter{finalframe}{\value{framenumber}}
\begin{frame}
\frametitle{Some Additional Slides Follow...}
\end{frame}
\begin{frame}
\frametitle{Related Approaches}
\colorbold{Git} inspired the design of trusty URIs: Git refers to \coloremph{commits} by hash values calculated in a recursive way.
\par\bigskip
\colorbold{Named Information (ni) URIs:}
\par\smallskip
{~~~~~\footnotesize\url{ni:///sha-256;UyaQV-Ev4rdLoHyJJWCi11OHfrYv9E1aGQAlMO2X_-Q}}
\par\smallskip
(Trusty URIs can be mapped to ni-URIs.)
\vfill
What is \colorbold{missing} in these approaches:
\begin{itemize}
\item Digital artifacts on a \coloremph{more abstract level than byte sequences}
\item Support for \coloremph{self-references}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Skolemization of Blank Nodes}
The hash also helps us to \coloremph{solve the problem of blank nodes} for canonicalization of RDF content: We use the hash to \coloremph{skolemize blank nodes}:
\medskip\par
{\footnotesize
\texttt{http://foo.org/r3.RACjKTA5dl23ed7JIpgPmS0E0dcU-XmWIBnGn6Iyk8B-U\#\_1}\\
\texttt{http://foo.org/r3.RACjKTA5dl23ed7JIpgPmS0E0dcU-XmWIBnGn6Iyk8B-U\#\_2}\\
\texttt{...}
}
\medskip\par
These URIs are guaranteed to \coloremph{have never been used before} (except possibly for exactly the same content).
\end{frame}
\begin{frame}
\frametitle{Performance for Nanopubs in Batch Mode}
\begin{center}
\includegraphics[width=\textwidth]{performance.png}
\end{center}
\end{frame}
\begin{frame}
\frametitle{Performance for Large Files (Bio2RDF)}
\begin{center}
\begin{minipage}{10cm}
\includegraphics[trim=15mm 11.5mm 15mm 3mm, clip=true, width=\textwidth]{bio2rdf-transform.pdf}\\
\includegraphics[trim=15mm 0mm 15mm 6mm, clip=true, width=\textwidth]{bio2rdf-check.pdf}
\end{minipage}
\end{center}
\end{frame}
\setcounter{framenumber}{\value{finalframe}}
\end{document}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment