Chubek/witty.rb

## witty.rb
#!/usr/bin/env ruby

#    ===   Witty.rb ===
#    A very simple Ruby Script
#    Author: Chubak Bidpaa (github.com/Chubek)
#
#  ** What does this do? **
#  This script demonstrates how to parse a Git index file (.git/index)
#  using nothing but the languages IO facilities. This is perhaps best
#  done in a systems language, or a strongly-typed language where there
#  is a good distinction between integers, characters and bytes, however
#  since Ruby is a 'sweet' language, and I mean that both figuratively and
#  literally (syntactic diabetes?) I wrote the demonstration in the language.
#  One could do this in any language though. Even AWK! But I digress.
#  Enough language talk. Let's talk about .git/index, hereby referred to as
#  `index`.
#
#  ** The Structure of `index`
#  The structure of this file is plainly explained at this page:
#  https://git-scm.com/docs/index-format
#  It's nothing above-the-board. It is your regular binary file.
#  It is not a 'database'. A database file must have a structural form.
#  `index` is very structurally loose. It's just a linear list of items,
#  prececeded by a magic which is succeded by a header, which is then succeded by
#  the number sof items in the 'list'.
#
#  ___NOTE___: Besides the list, there's an 'extensions' section. Which really
#  gives `index` no shot at being a genuine database! In this script, we do NOT
#  parse the extensions, because it may or may not occur, and plus, it's besides
#  the point of gaining info on the files in our repo.
#
#  The 'list' is a hudge-pudge of unsigned 32-bit integers, 16-bit flags, padding,
#  and one null-term string which is the path to the file FROM THE ROOT. That means, in this
#  pathname, absolute paths, and by that I mean 'POSIX absolute paths', are forbidden.
#  Everything is given from the root of the repository. That is where the '.git' directory
#  is located.
#
#  From this description, it seems like Git to be very hostile to non-POSIX systems. There
#  are no 'absolute' and 'relative' paths in Windows! But I guess, the people who ported
#  this goddamn git of a software to that goddamn git of an operating system knew how to
#  deal with it (I have not used PipDooze in several years, but IIRC, git only works in
#  the Windows version of Bash? Dunno).
#
#  ** The Format of Each Index Entry **
#  This table explains the format of each index entry:
#
#     Field Description
#     -------------------------------------------------------------------
# 1.  32-bit ctime seconds, the last time a file's metadata changed
#       this is stat(2) data
# 2.  32-bit ctime nanosecond fractions
#       this is stat(2) data
# 3.  32-bit mtime seconds, the last time a file's data changed
#       this is stat(2) data
# 4.  32-bit mtime nanosecond fractions
#       this is stat(2) data
# 5.  32-bit dev
#       this is stat(2) data
# 6.  32-bit ino
#       this is stat(2) data
# 7.  32-bit mode, split into (high to low bits)
#       |  4-bit object type
#       |    valid values in binary are 1000 (regular file), 1010 (symbolic link)
#       |    and 1110 (gitlink)
#       |  3-bit unused
#       |  9-bit unix permission. Only 0755 and 0644 are valid for regular files.
#       |    Symbolic links and gitlinks have value 0 in this field.
# 8.  32-bit uid
#       this is stat(2) data
# 9.  32-bit gid
#       this is stat(2) data
# 10. 32-bit file size
#       This is the on-disk size from stat(2), truncated to 32-bit.
# 11. Object name for the represented object
# 12. A 16-bit 'flags' field split into (high to low bits)
#       |  1-bit assume-valid flag
#       |  1-bit extended flag (must be zero in version 2)
#       |  2-bit stage (during merge)
#       |  12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
#       |    is stored in this field.
# 13. (Version 3 or later) A 16-bit field, only applicable if the
#     "extended flag" above is 1, split into (high to low bits).
#       |  1-bit reserved for future
#       |  1-bit skip-worktree flag (used by sparse checkout)
#       |  1-bit intent-to-add flag (used by "git add -N")
#       |  13-bit unused, must be zero
# 14. Entry path name (variable length) relative to top level directory
#       (without leading slash). '/' is used as path separator. The special
#       path components ".", ".." and ".git" (without quotes) are disallowed.
#       Trailing slash is also disallowed.
# 15. (Version 4) In version 4, the entry path name is prefix-compressed
#       relative to the path name for the previous entry (the very first
#       entry is encoded as if the path name for the previous entry is an
#       empty string).  At the beginning of an entry, an integer N in the
#       variable width encoding (the same encoding as the offset is encoded
#       for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
#       by a NUL-terminated string S.  Removing N bytes from the end of the
#       path name for the previous entry, and replacing it with the string S
#       yields the path name for this entry.
# 16. 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
#     while keeping the name NUL-terminated.
#
#   I think this is clear enough, but let's address the pathname. The pathname is
#   a null-terminated string, this means the authors of Git did not bank on people
#   having filenames as like as FILENAME_MAX. Keep in mind that in most POSIX systems
#   FILENAME_MAX is defined as 255 whilst in Windows it is defined as 256. But well,
#   MAYBE some pesky person is using a different file system, on the same OS that defines
#   FILENAME_MAX for its native filesystem? I mean, WHO KNOWS?
#
#   But, there's another reason FILENAME_MAX is not set at the maximum and instead, a null-term
#   string is chosen. And that's got to do with encoding, and multibyte strings. Git does not
#   really care about what the encoding of your pathname is, or ife its multibyte or ASCII or
#   Extended ASCII. It just puts a null at the end of the byte sequence that represents the path.
#
#   In a way, null-term strings are cancer. But this is a good place for their use.
#
#   Now, as you can clearly read in the table given above, in the later versions of Git, the exact
#   length for the string is given. And in this script, we have chosen this version. Mainly because,
#   again, null-term strings are CANCER. I mainly code in C and I use them a lot, but see, Ruby does not
#   support them, Python does not support them, Scheme does not support them, Java doesn't (?),
#   my mom and your mom do not support them, only systems languages support them, like D, Rust, etc.
#
#   So anyways, here I present to you, Witty.rb, a Ruby script that reads up `index`, aka .git/index,
#   do with the info as you wish!


require 'pathname'

def read_n_bytes(n)
  STDIN.read(n)
end

def read_uint16
  read_n_bytes(2).unpack('n').first
end

def read_uint32
  read_n_bytes(4).unpack('N').first
end

def read_uint64
  read_n_bytes(8).unpack('Q').first
end

def read_string
  str = ''
  loop do
    char = read_n_bytes(1)
    break if char == "\x00"
    str << char
  end
  str
end

def match_signature
  raise "Not a valid Git Index file" if read_uint32 != "DIRC".unpack('N').first
end

def read_version_number
  version = read_uint32
  raise "Unsupported Git Index version: #{version}" unless [2, 3, 4].include?(version)
  version
end

def read_number_of_entries
  read_uint32
end

def read_index_entry
  entry = Hash.new
  entry[:ctime_seconds] = read_uint32
  entry[:ctime_nanoseconds] = read_uint32
  entry[:mtime_seconds] = read_uint32
  entry[:mtime_nanoseconds] = read_uint32
  entry[:dev] = read_uint32
  entry[:inode] = read_uint32
  mode = read_uint32
  entry[:type] = (mode >> 12) & 0b1111
  entry[:permissions] = mode & 0b111111111
  entry[:uid] = read_uint32
  entry[:gid] = read_uint32
  entry[:size] = read_uint32
  entry[:object_name] = read_n_bytes(20).unpack('H*').first
  flags = read_uint16
  entry[:assume_valid] = (flags >> 15) & 0b1
  entry[:extended] = (flags >> 14) & 0b1
  entry[:stage] = (flags >> 12) & 0b11
  name_length = flags & 0b111111111111
  entry[:name] = read_n_bytes(name_length).force_encoding('UTF-8')

  padding_length = 8 - ((4 * 10) + 20 + 2 + name_length) % 8
  padding_length = 8 if padding_length == 0
  read_n_bytes(padding_length)

  entry
end

def read_all_paths
  match_signature
  version_no = read_version_number
  no_of_entries = read_number_of_entries

  paths = Array.new
  for _ in 0..(no_of_entries - 1)
    paths << Pathname.new(read_index_entry[:name])
  end

  paths
end


# Call `read_all_paths` and do as you wish
	#!/usr/bin/env ruby

	# === Witty.rb ===
	# A very simple Ruby Script
	# Author: Chubak Bidpaa (github.com/Chubek)
	#
	# What does this do?
	# This script demonstrates how to parse a Git index file (.git/index)
	# using nothing but the languages IO facilities. This is perhaps best
	# done in a systems language, or a strongly-typed language where there
	# is a good distinction between integers, characters and bytes, however
	# since Ruby is a 'sweet' language, and I mean that both figuratively and
	# literally (syntactic diabetes?) I wrote the demonstration in the language.
	# One could do this in any language though. Even AWK! But I digress.
	# Enough language talk. Let's talk about .git/index, hereby referred to as
	# `index`.
	#
	# ** The Structure of `index`
	# The structure of this file is plainly explained at this page:
	# https://git-scm.com/docs/index-format
	# It's nothing above-the-board. It is your regular binary file.
	# It is not a 'database'. A database file must have a structural form.
	# `index` is very structurally loose. It's just a linear list of items,
	# prececeded by a magic which is succeded by a header, which is then succeded by
	# the number sof items in the 'list'.
	#
	# ___NOTE___: Besides the list, there's an 'extensions' section. Which really
	# gives `index` no shot at being a genuine database! In this script, we do NOT
	# parse the extensions, because it may or may not occur, and plus, it's besides
	# the point of gaining info on the files in our repo.
	#
	# The 'list' is a hudge-pudge of unsigned 32-bit integers, 16-bit flags, padding,
	# and one null-term string which is the path to the file FROM THE ROOT. That means, in this
	# pathname, absolute paths, and by that I mean 'POSIX absolute paths', are forbidden.
	# Everything is given from the root of the repository. That is where the '.git' directory
	# is located.
	#
	# From this description, it seems like Git to be very hostile to non-POSIX systems. There
	# are no 'absolute' and 'relative' paths in Windows! But I guess, the people who ported
	# this goddamn git of a software to that goddamn git of an operating system knew how to
	# deal with it (I have not used PipDooze in several years, but IIRC, git only works in
	# the Windows version of Bash? Dunno).
	#
	# The Format of Each Index Entry
	# This table explains the format of each index entry:
	#
	# Field Description
	# -------------------------------------------------------------------
	# 1. 32-bit ctime seconds, the last time a file's metadata changed
	# this is stat(2) data
	# 2. 32-bit ctime nanosecond fractions
	# this is stat(2) data
	# 3. 32-bit mtime seconds, the last time a file's data changed
	# this is stat(2) data
	# 4. 32-bit mtime nanosecond fractions
	# this is stat(2) data
	# 5. 32-bit dev
	# this is stat(2) data
	# 6. 32-bit ino
	# this is stat(2) data
	# 7. 32-bit mode, split into (high to low bits)
	# \| 4-bit object type
	# \| valid values in binary are 1000 (regular file), 1010 (symbolic link)
	# \| and 1110 (gitlink)
	# \| 3-bit unused
	# \| 9-bit unix permission. Only 0755 and 0644 are valid for regular files.
	# \| Symbolic links and gitlinks have value 0 in this field.
	# 8. 32-bit uid
	# this is stat(2) data
	# 9. 32-bit gid
	# this is stat(2) data
	# 10. 32-bit file size
	# This is the on-disk size from stat(2), truncated to 32-bit.
	# 11. Object name for the represented object
	# 12. A 16-bit 'flags' field split into (high to low bits)
	# \| 1-bit assume-valid flag
	# \| 1-bit extended flag (must be zero in version 2)
	# \| 2-bit stage (during merge)
	# \| 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
	# \| is stored in this field.
	# 13. (Version 3 or later) A 16-bit field, only applicable if the
	# "extended flag" above is 1, split into (high to low bits).
	# \| 1-bit reserved for future
	# \| 1-bit skip-worktree flag (used by sparse checkout)
	# \| 1-bit intent-to-add flag (used by "git add -N")
	# \| 13-bit unused, must be zero
	# 14. Entry path name (variable length) relative to top level directory
	# (without leading slash). '/' is used as path separator. The special
	# path components ".", ".." and ".git" (without quotes) are disallowed.
	# Trailing slash is also disallowed.
	# 15. (Version 4) In version 4, the entry path name is prefix-compressed
	# relative to the path name for the previous entry (the very first
	# entry is encoded as if the path name for the previous entry is an
	# empty string). At the beginning of an entry, an integer N in the
	# variable width encoding (the same encoding as the offset is encoded
	# for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
	# by a NUL-terminated string S. Removing N bytes from the end of the
	# path name for the previous entry, and replacing it with the string S
	# yields the path name for this entry.
	# 16. 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
	# while keeping the name NUL-terminated.
	#
	# I think this is clear enough, but let's address the pathname. The pathname is
	# a null-terminated string, this means the authors of Git did not bank on people
	# having filenames as like as FILENAME_MAX. Keep in mind that in most POSIX systems
	# FILENAME_MAX is defined as 255 whilst in Windows it is defined as 256. But well,
	# MAYBE some pesky person is using a different file system, on the same OS that defines
	# FILENAME_MAX for its native filesystem? I mean, WHO KNOWS?
	#
	# But, there's another reason FILENAME_MAX is not set at the maximum and instead, a null-term
	# string is chosen. And that's got to do with encoding, and multibyte strings. Git does not
	# really care about what the encoding of your pathname is, or ife its multibyte or ASCII or
	# Extended ASCII. It just puts a null at the end of the byte sequence that represents the path.
	#
	# In a way, null-term strings are cancer. But this is a good place for their use.
	#
	# Now, as you can clearly read in the table given above, in the later versions of Git, the exact
	# length for the string is given. And in this script, we have chosen this version. Mainly because,
	# again, null-term strings are CANCER. I mainly code in C and I use them a lot, but see, Ruby does not
	# support them, Python does not support them, Scheme does not support them, Java doesn't (?),
	# my mom and your mom do not support them, only systems languages support them, like D, Rust, etc.
	#
	# So anyways, here I present to you, Witty.rb, a Ruby script that reads up `index`, aka .git/index,
	# do with the info as you wish!


	require 'pathname'

	def read_n_bytes(n)
	STDIN.read(n)
	end

	def read_uint16
	read_n_bytes(2).unpack('n').first
	end

	def read_uint32
	read_n_bytes(4).unpack('N').first
	end

	def read_uint64
	read_n_bytes(8).unpack('Q').first
	end

	def read_string
	str = ''
	loop do
	char = read_n_bytes(1)
	break if char == "\x00"
	str << char
	end
	str
	end

	def match_signature
	raise "Not a valid Git Index file" if read_uint32 != "DIRC".unpack('N').first
	end

	def read_version_number
	version = read_uint32
	raise "Unsupported Git Index version: #{version}" unless [2, 3, 4].include?(version)
	version
	end

	def read_number_of_entries
	read_uint32
	end

	def read_index_entry
	entry = Hash.new
	entry[:ctime_seconds] = read_uint32
	entry[:ctime_nanoseconds] = read_uint32
	entry[:mtime_seconds] = read_uint32
	entry[:mtime_nanoseconds] = read_uint32
	entry[:dev] = read_uint32
	entry[:inode] = read_uint32
	mode = read_uint32
	entry[:type] = (mode >> 12) & 0b1111
	entry[:permissions] = mode & 0b111111111
	entry[:uid] = read_uint32
	entry[:gid] = read_uint32
	entry[:size] = read_uint32
	entry[:object_name] = read_n_bytes(20).unpack('H*').first
	flags = read_uint16
	entry[:assume_valid] = (flags >> 15) & 0b1
	entry[:extended] = (flags >> 14) & 0b1
	entry[:stage] = (flags >> 12) & 0b11
	name_length = flags & 0b111111111111
	entry[:name] = read_n_bytes(name_length).force_encoding('UTF-8')

	padding_length = 8 - ((4 * 10) + 20 + 2 + name_length) % 8
	padding_length = 8 if padding_length == 0
	read_n_bytes(padding_length)

	entry
	end

	def read_all_paths
	match_signature
	version_no = read_version_number
	no_of_entries = read_number_of_entries

	paths = Array.new
	for _ in 0..(no_of_entries - 1)
	paths << Pathname.new(read_index_entry[:name])
	end

	paths
	end


	# Call `read_all_paths` and do as you wish