campoy/paper.txt

## paper.txt
\documentclass[conference,10pt]{IEEEtran}
\usepackage[utf8]{inputenc}
\ifCLASSINFOpdf
  \usepackage[pdftex]{graphicx}
\else
  \usepackage[dvips]{graphicx}
\fi
\usepackage{algpseudocode}
\usepackage{amsmath}
\usepackage{color}
\usepackage{doi}
\usepackage{enumitem}
\usepackage{etoolbox}
\usepackage{fancyvrb}
\usepackage{float}
\usepackage{hyperref}
\usepackage{multicol}
\usepackage{pbox}
\usepackage{romannum}
\usepackage{tabularx}

\makeatletter
\renewcommand\section{\@startsection {section}{1}{\z@}%
     {-2.5ex \@plus -1ex \@minus -.2ex}%
     {2ex \@plus.2ex}%
    {\centering\scshape}}
\renewcommand{\doitext}{}
\hypersetup{
    colorlinks,
    linkcolor={black},
    citecolor={black},
    urlcolor={blue}
}

\title{Public Git Archive: a Big Code dataset for all}

\begin{document}
\author{\IEEEauthorblockN{Xxxxx Xxxxxxxx\IEEEauthorrefmark{1},
Yyyyy Yyyyy\IEEEauthorrefmark{2}}
\IEEEauthorblockA{zzzzzzz,
Zzzzz, Zzzzz\\
\IEEEauthorrefmark{1}\texttt{xxxxx@xxxxxx.xxx},
\IEEEauthorrefmark{2}\texttt{yyyyy@yyyyyy.yyy}}
}

\maketitle

\begin{abstract}

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present Public Git Archive - dataset of 186,895 top-starred Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for keeping the dataset up to date and related legal issues. The Public Git Archive occupies 2.6TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for ``Big Code" research.
\end{abstract}

\begin{IEEEkeywords}
source code, git, GitHub, software repositories, development history, open dataset.
\end{IEEEkeywords}

\section{Introduction}

Big Code is revolutionizing software development. The revolution began with GitHub and their extensive collection of Git repositories. More than 24 million developers collaborated on over 67 million projects during 2017 \cite{octoverse}. GitHub has made version control accessible, therefore universal. The next stage of the revolution is permitting the automatic analysis of source code at scale, to support data-driven language design, to infer best (and worst) practices, and to provide the raw data for data hungry machine learning techniques. These techniques will be the basis of the next generation of development tools \cite{Sutton13, Barr16} and require vast source code archives to be programmatically accessible for analysis.
% * <marcelo@sourced.tech> 2018-02-05T18:15:48.049Z:
%
% > GitHub
% In the first mention worth to add a footnote with the link
% Not enough space, the content is squeezed into 4 pages - Vadim.
%
% ^.

The GHTorrent project \cite{Gousi13} took the first steps in this direction, focusing on metadata in order to be scalable. Current source code datasets typically contain tens of thousands of projects at most \cite{Sutton13} and are dedicated to particular programming languages such as Java or JavaScript \cite{Raychev15}, thus lacking diversity and attracting critics \cite{Cosentino16}. Software Heritage \cite{dicosmo:hal-01590958} is a recent attempt to archive all the open source code ever written. However, no public dataset has been published yet by them.

We present the Public Git Archive, the first Big Code dataset designed for programmatic analysis at scale. It is by far the largest curated archive of top-rated\footnote{We use ``top-rated``, ``top-starred``, `top-bookmarked` and ``having most stargazers`` interchangeably. The number of stargazers is a proxy on the degree of public awareness and project quality within the community.} repositories on GitHub, see Table \ref{dataset_stats} for comparison. The Public Git Archive targets large-scale quantitative research in the areas of Source Code Analysis (SCA) and Machine Learning on Source Code (MLoSC). The dataset is made available via HTTP as an index file and a set of files in the Siva format, a novel archive format tailored for storing Git repositories efficiently \cite{siva}. Every GitHub repository can be forked; forks typically introduce subtle changes not necessarily merged into the origin. The naive way to obtain forks is to clone them separately, requiring additional time and storage space. We describe the data retrieval pipeline which places forks into the original repository without mixing identities by reusing the existing Git objects \cite{6976092}. The dataset size becomes thus smaller, requiring users to download and store less data.
% * <marcelo@sourced.tech> 2018-02-05T18:19:50.528Z:
%
% > forked
% Worth a footnote explaining or link to an explanation of what a fork really is
%
% Not enough space, the content is squeezed into 4 pages - Vadim.
%
% Francesc: I think it's safe to assume anyone reading this paper understands what a fork is.
% ^.
% * <marcelo@sourced.tech> 2018-02-05T18:14:54.205Z:
%
% >  and curated
% only curated, best curated, …?
%
% This is academia cliche - Vadim.
%
% ^.
% * <marcelo@sourced.tech> 2018-02-05T18:13:08.275Z:
%
% > amendable
% What do you want to mean by that? Very weird in this context.
%
% Ask Earl Barr - I copied his text here. - Vadim.
%
% ^.

The main contributions of the paper are:
\begin{itemize}
\item The Public Git Archive dataset, which is the largest collection of Git repositories to date available for download.
\item The data retrieval pipeline which produces this dataset. Each part of that pipeline can scale horizontally to process millions of repositories.
\item The Siva repository archival format used in this dataset. This format allows to efficiently store forks.
\end{itemize}

\newcolumntype{M}[1]{>{\raggedright}m{#1}}

\bgroup
\def\arraystretch{1}
\begin{table*}
\begin{center}
\begin{tabular}{M{2.8cm} r r r r}
 & \textbf{Qualitas Corpus} \cite{QualitasCorpus:APSEC:2010} & \textbf{Sourcerer} \cite{6976088} & \textbf{GitHub Java Corpus} \cite{Sutton13} & \textbf{Public Git Archive}\tabularnewline
\hline
Number of projects  & 111 & 19,233 & 14,807 & 186,895\tabularnewline
Latest release (year) & 2013 & 2014 & 2013 & 2018 \tabularnewline
Code language  & Java & Java & Java & 454 distinct \tabularnewline
Development history & No & No & No & Yes \tabularnewline
Number of files, million	  & 0.177 & 1.9 & 1.5 & 44.4 (HEAD) \tabularnewline
Lines of code, million  & 37.1 & 320 & 352 & 12,100 (HEAD) \tabularnewline
Storage size 	 & 1.3 GB & 19 GB & 14 GB & 2.6 TB \tabularnewline
\hline
\end{tabular}
\end{center}
\caption{Datasets comparison}
\label{dataset_stats}
\vspace*{-2.5em}
\end{table*}
\egroup

\section{Dataset production}

The dataset production consists of three steps, as follows.

\subsection{Compiling the list of repositories}

Similarly to existing research on GitHub mining \cite{Padhye:2014:SEC:2597073.2597113}, the focus of the dataset is put on the top-starred repositories. To compose the list of repository URLs, we make use of the metadata provided by GHTorrent \cite{Gousi13}: a scalable, queryable, offline database generated from listening events through GitHub API, and available to the research community as a data-on-demand service \cite{Gousi14}. The repository list for the Public Git Archive is based on GHTorrent's MySQL dump dated from January $1^{st}$, 2018.

We created a command-line application which streams the compressed GHTorrent MySQL dump, reads and processes the needed files and stores the intermediate tables on disk. This tool can also be used to filter repositories based on the number of stargazers and chosen programming languages by taking the intermediate tables for input. The Public Git Archive does not filter repositories by language performed but discards repositories with less than 50 stargazers. The resulting list contains 190,051 Git repository URLs.
% * <marcelo@sourced.tech> 2018-02-05T18:38:05.045Z:
%
% > 190,051 Git repository URLs.
% Doesn't match the number from the abstract, are we talking about different things (URLs vs actual repos)? Check
%
% I checked and this is intended. - Vadim.
%
% Could you expand why those numbers are different? - Francesc
% ^.

\subsection{Cloning Git repositories}

Once the list repository URLs is produced, we fetch the actual data with \texttt{borges} \cite{borges}: a container-friendly distributed system that clones and stores Git repositories at scale.

% Why "container-friendly"? I think that could be dropped - Francesc

Borges is designed as two separate standalone services: a producer and a consumer. The producer reads URLs and determines which repositories should be processed next and adds them as jobs into a message queue for the consumer. The consumer dispatches jobs to its thread worker pool. Multiple producers and consumers can run in parallel; the message queue is also scalable. A job is a request to update a repository, new or existing. Each Git remote is fetched and each reference is pushed to the corresponding \textit{rooted repository}. We store all references from possibly different repositories that share a same initial commit - \textit{root}, Fig. \ref{rooted_repo} into a single Git repository, thus reducing data duplication.

% The producer reads URLs ... what are those URLs? The ones pointing to git repos? How does the producer decide
% what repo should be processed next? - Francesc

\begin{figure}[H]
\begin{center}
\includegraphics[width=0.48\textwidth]{rooted_repo}
\end{center}
\vspace*{-1em}
\caption{Rooted Repository}
\label{rooted_repo}
\end{figure}

Subsequently, Borges consumer workers store the Git packfiles of the rooted repositories as Siva files. Siva is an archival format \cite{siva} similar to \texttt{tar} and \texttt{zip}: it allows constant-time random file access, concatenation of the archive files, and seekable access to the contained files which are written verbatim—since packfiles are already compressed with \texttt{zlib}. Siva makes possible to store rooted repositories in an efficient and convenient way with minimal storage overhead. Internally, placing Git repositories inside Siva is implemented as \texttt{git push}. Borges can store Siva files in the local file system or Hadoop Distributed File System (HDFS).

The repositories in the Public Git Archive were cloned in late January, 2018. A total of 8 consumers ran 32 threads each for one week utilizing a 1Gbps internet connection. The "smart HTTP" Git protocol was used. The bulk download of 2.6 TB of data at the same connection speed takes under 6 hours — less than 1\% of the initial retrieval time normalized to single consumer. The storage space saved due to fork embedding is 6.4\% of total. This small percentage is explained by the fact that forks are usually not highly rated. We plan to include all existing forks in the future.
% * <marcelo@sourced.tech> 2018-02-05T18:58:24.300Z:
%
% > We plan to include all the forks in the future.
% What do you mean by that? Unclear what is missing now or the intent of this.
%
% No idea how to not let this single remark grow into 2 sentences - it will kill the layout. - Vadim.
%
% ^.
% * <marcelo@sourced.tech> 2018-02-05T18:55:36.880Z:
%
% > smart HTTP
% Is this something obvious/usual?
%
% Git internals. We have to mention this. No space to go deeper - Vadim.
%
% ^.

From the initial 190,051 URLs, 3,156 had become inaccessible by the Git clone time, including 82 removed for legal reasons (HTTP 451 error messages returned by the server). The final amount of repositories cloned is 186,895. They are spread over 275,000 Siva files. 88\% of the repositories were cloned within the first 24 hours and they constitute 46\% of the final dataset size.

\subsection{Generating the index file}

Users of the Public Git Dataset might want to download only a subset of the collected terabytes of Siva files. The most common request which the authors received was to triage files by specific programming languages. To address this feature, we do a pass over the finalized Siva files and generate a CSV file with per-repository metadata, detected license information, and basic statistics such as the number of files, lines, and bytes per programming language. Each row in the CSV file links to the Siva files which contain Git references of the corresponding repository. This allows queries on the index file to choose the list of Siva files to download. The columns of the CSV file are explained in Table \ref{csv_columns}.

\bgroup
\def\arraystretch{1.5}
\begin{table}[b]
\vspace*{-1em}
\begin{center}
\begin{tabular}{M{2.9cm} M{5.4cm}}
\textbf{Column name} & \textbf{Description} \tabularnewline
\vspace*{-1.5em}
\rule{8.6cm}{.4pt} \tabularnewline
$url$ & URL of the GitHub repository. \tabularnewline
$siva\_filenames$ & Siva files which contain parts of that repo. \tabularnewline
$file\_count$ & Number of files in default HEAD. \tabularnewline
$langs$ & Languages encountered in default HEAD. \tabularnewline
$langs\_$$\{byte,lines,files\}$\\$\_count$ & Byte, line, file counts per each language, in the same order as \textit{langs}.\tabularnewline
$commits\_count$ & Number of the unique commits in the Siva files which refer to that repository. \tabularnewline
$branches\_count$ & Number of references, tags excluded. \tabularnewline
$fork\_count$  & Number of remotes in the referring Siva files. \tabularnewline
$\{empty,code,comment\}$\\
$\_lines\_count$ & Number of empty, code, commented lines in default HEAD. \tabularnewline
$license$ & License names and corresponding confidences. \tabularnewline
\vspace*{-1.5em}
\rule{8.6cm}{.4pt}
\end{tabular}
\end{center}
\caption{CSV columns}
\label{csv_columns}
\vspace*{-1em}
\end{table}
\egroup

\section{Using Public Git Archive}

The dataset download links, as well as all the relevant tools and scripts to reproduce it, are hosted in the official repository on GitHub\footnote{\href{https://github.com/xxxx/datasets}{github.com/xxxx/datasets}}. Public Git Archive consists of (a) 275,000 (2.6TB) Siva files with Git repositories and (b) the index file in CSV format. We also created a command line application to automate, accelerate, and simplify downloading (a) and (b). In the columns of the CSV file, \textit{HEAD} is used to denote the latest commit of a branch; \textit{default HEAD} means the latest commit of the default branch. The default branch corresponds to the reference which is marked main on GitHub. Languages were detected with \textbf{enry} \cite{enry}. Licenses were detected with \textbf{go-license-detector} \cite{go-license-detector}. The lines of code were counted with \textbf{gocloc} \cite{gocloc}. \textbf{No GitHub API was used} as it is planned to extend the dataset to sources others than GitHub.

% "The dataset download links" is a very confusing sentence since download and links can be verbs or nouns.
% Please rephrase. - Francesc

The aggregated programming language statistics are depicted on Fig. \ref{langhist}.

Once the selected Siva files are downloaded, users can work with the dataset using \textbf{engine} \cite{engine}. It is a plug-in for Apache Spark which reads Siva files as a Spark pipeline source. Therefore it becomes possible to execute conventional Spark or PySpark code to analyze Git repositories. It is also possible to unpack Git repositories from Siva files using its Go API or from the command line.

\begin{figure}
\begin{center}
\includegraphics[width=0.48\textwidth]{lang_stats}
\end{center}
\vspace*{-1em}
\caption{\fontdimen2\font=1pt Most popular languages. 100\% is $3.5{\cdot}10^6$ files, $1.1{\cdot}10^9$ lines and 54 GB.}
\label{langhist}
\vspace*{-1em}
\end{figure}

\section{Significance}

Analysis of source code has recently received significant research attention. The areas which can benefit most from Public Git Archive are statistical machine learning and natural language processing on source code. For example, source code modeling studies \cite{Sutton13, Barr16} have shown that the performance of $n$-gram models critically depends on the dataset size. That's why the presented dataset can enhance research in topics like automatic naming suggestion \cite{Barr15}, program prediction \cite{Raychev15}, topic modeling and semantic clustering \cite{Kuhn07}, bug detection \cite{gao:icse:2017}, and automated software transpilation \cite{barr:issta:15}. It can also provide valuable insights for language designers into how their languages are used in the open source community.

Another promising research direction is inter-project source code clone detection. Social programming platforms with minimal boundaries between projects like GitHub have facilitated code reuse across multiple repositories. A number of studies has been carried out about those ecosystems in the recent years \cite{Nguyen:2013:SRC:3107656.3107682}. Code clones were first studied within single projects but, now that GitHub has emerged, several different reasons of duplicated code snippets have been explored, e.g. ``accidental`` clones due to imprecise API usage protocols \cite{1541846} or automatic program repair \cite{6035728}. Public Git Archive enables the research community to study source code clones across project boundaries, not limiting to a single language and having large graphs with more than 10,000 projects \cite{Gharehyazie:2017:HCC:3104188.3104225}.

% Consider replacing `clone` with a different (less overloaded) word. Maybe copies?
% I do not see the link with auctomtati program repair - Francesc

\section{Updates}

In order to keep the pace of the constantly changing open source landscape, Public Git Archive is regularly updated. There are several technical challenges with this requirement. The common way to organize dataset updates is to provide regular snapshots, as for example GHTorrent does. However, every snapshot of our dataset is going to allocate considerable disk space. The solution is to manage incremental updates which consist of the difference with the previous snapshot. To our knowledge, there are two ways to implement this.

The first is to pull the changes into every packfile in every Siva file of the dataset. \texttt{git pull} operation requires the whole new packfile to be read, and this is exactly what we would like to avoid.

The second is to generate binary diffs of the Siva files. However, diffing Git packfiles is not straightforward. They are compressed and even a single Git object which is removed at the beginning of a packfile changes the whole binary stream. Thus we need to retain the old objects which are no longer referenced in the new packfile. GitHub always returns a single packfile during \texttt{git clone} and they run garbage collection from time to time, effectively breaking the binary diffs.

The technical disadvantages of the first solution are hard to avoid. The second solution sounds more promising and we are planning to research it. The current plan to update Public Git Archive is to publish complete snapshots but limit their lifetime. There are going to be Long Term Support (LTS) snapshots with extended lifetime and researchers are encouraged to use them. The exact schedule is subject to change and is updated on the Public Git Archive website.

% Add a link to the schedule page? - Francesc

\section{Privacy and licensing}

Public Git Archive contains the full commit history of each repository, including commit messages, timestamps, author names and emails. GitHub Terms of Service (GHTS) explicitly allow passing personal information to third parties as long as the goal is doing research or archiving \cite{ghtos}. Public Git Archive is maintained solely for research purposes, the collected credentials are not used in any way in violation of the GHTS.

As was noticed in section \Romannum{2}, some developers may wish to remove their projects from the dataset. We provide a communication channel for repository removal requests. The details are given on the Public Git Archive website.

Each rooted repository inside each Siva file is licensed separately and according to the manifested project license. The projects which do not have an explicit license are distributed under the same terms as stated in GHTS exclusively for research purposes. The index file is licensed under \textit{Attribution-NonCommercial-ShareAlike 4.0 International.}

\section{Limitations}

Dataset miners should take into account several potential threats to validity \cite{KGBSGD16}.

Regarding the data collection process and the traditional trade-off between freshness and curation \cite{Cosentino16}, we chose to emphasize the curation of the dataset rather than a more up-to-date although limited amount of data. We rely on GHTorrent for the list of repositories to retrieve and thus our schedule of updates depends on theirs. The dataset and the pipeline to collect it are entirely transparent. The output is not always exactly the same due to GitHub's mutability, as some repositories change or disappear over time.

% What does "The dataset and the pipeline to collect it are entirely transparent" mean? - Francesc

Other notable concern is about the generality of the dataset. Selecting repositories based on the number of stargazers is arguable and may introduce bias. Fair probabilistic sampling of the complete list of repositories should improve the diversity, e.g. stratified random sampling \cite{Nagappan:2013:DSE:2491411.2491415}. Other popularity indicators can be explored such as the number of forks or accepted pull requests \cite{Sutton13}. By focusing on the number of stars as a measure of people's interest and awareness of the project, there is a risk to miss good quality samples. Therefore various source code quality metrics can be considered. Finally, there are going to be duplicate files across different repositories \cite{Lopes:2017:DMC:3152284.3133908}. Mentioned suggestions constitute the basis for the future work.

\section{Conclusion}

In this paper, we presented Public Git Archive, the largest source code dataset of top-starred Git repositories, described the novel scalable pipeline to reproduce it and the tools to download and use it. Public Git Archive is made available through HTTP and includes the source code of the projects, their metadata, and their development history. The retrieval pipeline is efficient and the dataset size is optimal thanks to the distributed cloning system and the custom Git repository archive format. We believe that Public Git Archive -- more than ten times bigger than any of the current datasets -- has the potential to boost the quality, confidence and diversity of the software engineering and mining research.

%\clearpage
\pagebreak
\bibliography{data_showcase}
\bibliographystyle{ieeetr}

\end{document}