Skip to content

Instantly share code, notes, and snippets.

@despens
Created March 14, 2012 16:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save despens/2037764 to your computer and use it in GitHub Desktop.
Save despens/2037764 to your computer and use it in GitHub Desktop.
Table to store all leftover Geocities Files, existing ones and collections that still might pop up.
-- Table: files
-- DROP TABLE files;
CREATE TABLE files
(
-- Natural key for version ID.
-- Immutable for file and metadata.
-- Md5 checksum of concatenated fields
-- file_id, mime_type, mime_subtype, encoding, path_local,
-- archive_agent
archive_id character varying(32) NOT NULL,
-- Natural key for file ID.
-- Immutable for each ingested file.
-- md5 checksum of concatenated fields:
-- collection, hurl, last_modified, size, checksum_md5
file_id character varying(32) NOT NULL,
--
-- The following fields are used to calculate 'file_id'.
-- Ingested files never change, only their metadata
-- might be corrected later. The 'file_id' therefore can
-- be used to assign tagged properties to each file.
--
-- Name of collection the file belongs to.
-- Each collection is one harvesting process or bundled release,
-- for example 'geocities.archiveteam.torrent', 'JNC grab', 'oocities'.
collection text NOT NULL,
-- The URL the file was harvested from, guessed from its file name.
hurl text NOT NULL,
-- The last modified date of the file.
last_modified timestamp without time zone NOT NULL,
-- The file's size, as returned by GNU 'stat'.
size bigint NOT NULL
-- Md5 checksum of the file's contents.
checksum_md5 character varying(32) NOT NULL,
--
-- The following fields are used to calculate 'archive_id'.
-- They represent metadata that might be changed and corrected
-- after ingest and therefore require 'versioning'.
-- The date 'archive_ts' is not part of this checksum, as the
-- events can happen any time, for example by running a
-- script that is distributed after the release of the
-- database.
--
-- The mime type of the file, for example guessed by the GNU 'file'
-- command. Together with 'mime_subtype' and 'charset', this can be
-- used to serve the files via http.
-- Mime types are 'application', 'audio', 'image', 'text' or 'video'.
mime_type text NOT NULL,
-- See above. Subtypes are the second part of the mime type string
-- after the slash, for example in 'text/html' this is 'html'.
mime_subtype text NOT NULL,
-- Character encoding for the file, only relevant with mime types
-- of 'text'. Values might be 'us-ascii', 'utf-8', 'iso-8859-1', ...
-- The value can be guessed for example by the GNU 'file' command as
-- well.
charset text NOT NULL,
-- The file's path, relative to the local root of all collections.
path_local text NOT NULL,
-- A string identifying the agent that created this database row.
-- This might be names of persons, if the row was created manually,
-- or the name of a script if the row was created automatically.
archive_agent text NOT NULL,
-- Timestamp when this database row was created. Just for your info!
archive_ts timestamp with time zone NOT NULL DEFAULT now(),
-- The field 'archive_id' is the primary key.
CONSTRAINT files_pkey PRIMARY KEY (archive_id)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment