Skip to content

Instantly share code, notes, and snippets.

@chrisdaaz
Last active November 14, 2021 06:00
Show Gist options
  • Save chrisdaaz/c0f93b9d30ae3a1fc19e9d8394efad12 to your computer and use it in GitHub Desktop.
Save chrisdaaz/c0f93b9d30ae3a1fc19e9d8394efad12 to your computer and use it in GitHub Desktop.
Week one reading for "Git and GitHub for Librarians" course.

Git Version Control for Librarians

Git provides librarians with a workflow for managing, documenting, and distibuting the full history and lifecycle of a digital project. While Git was designed for software development teams, its powerful facilitation of transparency and collaboration has made it a popular tool for scholars and librarians who work with digital collections, research, writing, data analysis, and websites.

Git and GitHub are not the same thing. Git is a command-line version control system for managing source code history. GitHub is a website where you can upload, download, and collaborate on project that are managed by Git.

GitHub is a company that provides a web-based platform for projects that use Git. There are several companies the provide hosting and features for Git-based projects, such as GitLab and Bitbucket, but we will focus on GitHub because it is the most popular choice for open source software communities working in the academic or library technology space. GitHub offers a lot of useful social, technical, educational, and management features for git-based work. Most of the concepts we will cover in the course are applicable to non-GitHub hosting platforms, but we will cover a few features that are exclusive to GitHub.

Here are some examples of library scenarios for using Git repositories and Git hosting platforms:

  • Publishing spreadsheets and metadata about processed collections
  • Sharing documentation for library systems
  • Distributing OCR'd text extracted from digitized materials [example]
  • Storing scripts for executing metadata ingests and transformations [example, example]
  • Archiving the source texts of open educational resources [example]
  • Collaborating with faculty or students on a research project

Beyond the specifics of Git and GitHub, there is value in having a stronger familarity with open source technologies and terminology. Having basic command-line knowledge and skills can create new opportunities for professional development and growth. Even if you're not a coder, I hope this resource equips you for future collaborations on library technology projects.

What is Version Control?

"Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later." Pro Git

Version control is the practice of tracking changes within project files over time. A version control system is a type of software that automates and enforces conventions for creating version histories. It lets us compare and restore versions of our projects. Although modern version control systems like Git can manage any file type, version control software is very helpful on managing projects using plain-text files, such as .xml, .html, .py, .csv, and .txt files among many others.

Version control systems automatically track files in order to help us with questions like:

  • What file, line, or character was changed?
  • When did the change occur?
  • Who made the change?
  • Why was the change made?

Version control protects a project from human errors and data loss resulting from hardware or network failures because we can go back to any point in the project's history to correct or investigate where things went wrong.

"Documents" by XKCD, licensed under a Creative Commons Attribution-NonCommercial 2.5 License

Project teams that do not use any form of version control often run into problems like not knowing which changes have been made. If you have never used version control software, you may have instead marked the versions of files within the files' names, perhaps with suffixes like "FINAL" or "v2" and then had to later deal with a new, final version. Perhaps you've created new files with some experimental content, fearing that there may be a use for it later. Manually wrangling multiple versions of the same project can be time-consuming, frustrating, and runs the risk of making mistakes, like losing the version you need. Version control is a way out of these problems.

Final doc

Git helps us avoid this problem.

Benefits of Version Control Systems

Version control systems offer librarians benefits directly related to the management and preservation of digital projects, such as:

  • Having access to the complete long-term history of every file within a project, including creations and deletions of files in addition to the edits to their contents
  • Isolating independent contributions to team projects without the risk of deleting other people's work
  • Reverting back to any previous version of a projects history
  • Linking specific changes in a project to the team's project management system for tracking progress, contributions, and milestones.

Distributed Version Control Systems

There are many types of version control systems, but we will be focusing on a distributed version control system: Git.

Distributed version control systems package the project's history alongside the project files in order to create back-up redundancies by default. Each team member who has a copy of the project files will also have a copy of the project's version history, enabling any team member the ability to fully restore a project if there's ever a server or system failure.

Distributed Version Control System Diagram

For example, when two people collaborate on a project using Git and GitHub, the project files and the project's history are copied to three locations: each person's computer and GitHub. Therefore, if one person loses their copy of the project, there is a copy on their collaborator's computer and online at GitHub. This means that nearly every Git action can take place on your computer; no need to visit a server to browse the project's history or take new snapshots of the project's files. The only time you would need to be connected to the internet is to push or pull updates to or from GitHub.

The term "local" used throughout this course refers to the repositories on our computers, as opposed to repositories on hosting platforms, like GitHub.

What is Git?

"Git is one of the most widely used version control systems in the world. It is a free, open source tool that can be downloaded to your local machine and used for logging all changes made to a group of designated computer files (referred to as a “git repository” or “repo” for short) over time. It can be used to control file versions locally by you alone on your computer, but is perhaps most powerful when employed to coordinate simultaneous work on a group of files shared among distributed groups of people." Library Carpentry, 2020

Git is designed to manage the version history of software development projects, but librarians, researchers, and archivists can use it on any project involving plain text files.

Plain Text

There are two main types of documents we use to write and edit text: plain-text and rich text. Plain text exposes the raw, semantic characters within a document, whereas rich text displays the formatting features and styles. For librarians, plain text offers some advantages over rich text, as Tenen and Wythoff (2014) explain:

Plain text both ensures transparency and answers the standards of long-term preservation. Microsoft Word may go the way of Word Perfect in the future, but plain text will always remain easy to read, catalog, mine, and transform. Furthermore, plain text enables easy and powerful versioning of the document, which is useful in collaboration and organizing drafts. Your plain text files will be accessible on cell phones, tablets, or, perhaps, on a low-powered terminal in some remote library. Plain text is backwards compatible and future-proof. Whatever software or hardware comes along next, it will be able to understand your plain text files.

Librarians might use plain text files on metadata migration projects, digital exhibit websites, documentation, and academic publishing. Because rich text formats package content with graphical elements, they are often distributed using binary file formats that specific programs can read and use. For example, binary formats like Microsoft Word documents and PDF files can be stored in Git repositories, but Git wouldn't be able to help us track the specific changes within those files due to their proprietary nature. Unlike binary computer files, plain text files do not need any specific software to be opened and read by humans. Git works best when managing the versions of plain text files but it was not designed for managing versions of binary files.

Binary Formats Plain text Formats
Documents .docx, .rtf, .pdf .txt, .html, .tex, .md
Data .xlsx, .mdb, .dta .csv, .xml, .json., .sql
Images .jpg, .tif, .gif, .png .svg
Code .py, .xslt, .js, .R

Table: Common Binary and Plain-text File Formats

Coming to a plain text file editor from a word processing program (like Microsoft Word), might feel like writing computer code rather than text for humans. That is because there is little material difference between plain text and code. Plain text is the format software developers use to write code. Plain text editors are not exclusive to writing code or reading data; people can write fiction in plain text (and some do).

How does Git work?

"Git thinks of its data more like a series of snapshots of a miniature filesystem" Pro Git

Git provides granular insight into a project's evolution. For example, Git automatically keeps a record of the files that were changed, the characters within the files that were changed, who made the changes, and when the files were changed. Because Git runs on your computer, you don't need to be connected to the internet to record updates to your project.

Before reading more about the details of how git works, I highly recommend watching "Git for Humans" by Alice Burnett, a conference presentation for User Experience (UX) professionals (content warning: there is some explicit language in the talk). This book covers all of the material in Burnett's presentation, but she does an excellent job explaining the basic concepts.

Git takes snapshots of the files within a project folder, essentially capturing the state of what the files look like at that moment and storing a reference to that state. Each snapshot is called a commit. The first commit contains every file within the project folder; each subsequent commit contains the changes to the files and any additions or deletions of files to the project folder. If a file has been stored in Git, and has not changed in new snapshots, Git does not store the file again; rather, it creates a link to the last identical version of the file. Each snapshot is given a hash, a reference indentifier (ID), which can be used to go back (and forth) to any version in the project's history.

Commit: "With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot." Pro Git

Git stores each snapshot's hash in a .git subfolder containing project's git history. This subfolder is created when you run the git init command from within the project directory. In most cases, you won't ever need to go into the .git directory. It's usually hidden by default. Keeping the version history within a subfolder of the project ensure that each copy of the project is packaged with a record of its history.

Every time Git takes a snapshot of our project, git creates a checksum as a reference ID for that version. Checksums are used to verify the interity of computer files, which makes it impossible for a file to change without Git knowing about it. This is the same technique used by digital preservation and archiving systems to detect file corruption. The type of checksum Git uses is an SHA-1 hash (which looks something like this: 24b9da6552252987aa493b52f8696cd6d3b00373). Git stores these hashes in a database within the .git directory in your project.

When people talk about "git repos" or "repositories", they are essentially referring to a computer folder containing the any files or subfolders that make up the source material for a project and the .git subfolder that stores the version snapshots (i.e. commits).

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment