Skip to content

Instantly share code, notes, and snippets.

@reedstrm
Last active December 14, 2018 17:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save reedstrm/d3606ef5e4032c18ef12 to your computer and use it in GitHub Desktop.
Save reedstrm/d3606ef5e4032c18ef12 to your computer and use it in GitHub Desktop.
URL object hashes (shortening URL proposal)

PREAMBLE

We the developers of OpenStax/Connexions, in order to form a more readable URL, establish user harmony, ensure URL persistance, and support all the use cases, do suggest the following id hashing strategy for content objectIds.

ID FORMAT

The canonical form for an id is a uuid-v4, in the 8n-4n-4n-4n-12n format, e.g. 6f0881fc-4d30-43e1-9a3b-52a210ab5980 Python code for generating these:

  >>> import uuid
  >>> u=uuid.uuid4()
  >>> u
  UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
Or with an existing string:
  >>> uuid.UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
  UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')

SQL code (in repository):

  repository=# select uuid_generate_v4();
             uuid_generate_v4           
  --------------------------------------
   a9b8564d-9c87-4744-af85-194ee9759e35

And the usual typecasting works:
  repository=# select '6f0881fc-4d30-43e1-9a3b-52a210ab5980'::uuid;
                   uuid                 
  --------------------------------------
   6f0881fc-4d30-43e1-9a3b-52a210ab5980

ALTERNATE ENCODING

In order to allow for shorter urls, a base64 encoded and truncated form of the same uuid may be used. It is important to note that this is not the base64 encoding of the uuid string: it is the encoding of the underlying 128 number. Since we're going to be using these for URL components, we need to substitute the '+' and '/' characters (and optionally strip the = padding and newline) TODO: check if there's a way to define a custom encoding 'base64url'

Python:

>>> u
UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
>>> u.bytes.encode('base64')
'bwiB/E0wQ+GaO1KiEKtZgA==\n'
>>> 
>>> b=u.bytes.encode('base64').replace('+','-').replace('/','_').replace('=','')[:-1]
'bwiB_E0wQ-GaO1KiEKtZgA'

converting back (this is ugly, but will not be needed very often):

>>> uuid.UUID(hex=(b.replace('_','/').replace('-','+')+'==').decode('base64').encode('hex'))
UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')

SQL:

repository=# select replace(replace(replace(encode(uuid_send('6f0881fc-4d30-43e1-9a3b-52a210ab5980'::uuid),'base64'),'+','-'),'/','_'),'=','');
        replace         
------------------------
 bwiB_E0wQ-GaO1KiEKtZgA
(1 row)

and back (again, not frequent):

repository=# select substring(decode(replace(replace('bwiB_E0wQ-GaO1KiEKtZgA','-','+'),'_','/')::text||'==','base64')::text,3)::uuid;
              substring               
--------------------------------------
 6f0881fc-4d30-43e1-9a3b-52a210ab5980

Javascript::

http://stackoverflow.com/questions/6095115/javascript-convert-guid-in-string-format-into-base64

WHY BASE64?

All this does is demonstrate that the 22 character base64 encoding is completely equivalent to the full 36 character uuid. However, the b64 encoding has a nice property that the initial 6-8 characters are themselves random enough to serve as unique ids, in the context of our repository. In fact, for existing content, 5 characters are unique (4 has 24 pairwise collisions). I ran tests against randomly generated uuids. The process above found appoximately one collision every 10 million ids for length 8.

Why 8, and not 6? We're going to need a lot more ids when we start serving fragments (sub page sections) as individual objects.

PROPOSED URLs

URL Examples:

Current (all College Physics, first page of Thermodynamics):

http://cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:108/Introduction-to-Thermodynamics

Proposed complete:

http://cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics

Proposed shortened:

http://cnx.org/contents/Ax2o07Ul@9.1:0m3DW_eU/Introduction-to-Thermodynamics

Note that even shorter ones will redirect to "do the right thing". This url:

http://cnx.org/contents/Ax2o07Ul:0m3DW_eU

Will redirect to the Introduction to Thermodynamics page in the current version of the textbook.

Archive will redirect to the full canonical URL for any given object or context:page pairing.

i.e. http://archive.cnx.org/contents/Ax2o07Ul:0m3DW_eU becomes:

http://archive.cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics

Same hash encoding for extras:

Base64 ids (long and short) redirect to complete:

http://archive.cnx.org/extras/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics

retrieves the extras info regarding the page in that particular book context, including the shortened base64 id, and any context specific info.

URLs that contain a version for the page in a context will ignore the version: any given context can only contain a single instance of any given page. Its presence in the complete form is advisory. (contentious - good API design says 404 is a bad version is given)

Title (last component of URL) is optional and ignored.

Archive will interperate page numbers, and redirect to the appropriate complete URL form.

Shortened form for each object id will be served from archive as part of the /extras info for that object. Base64 version (22 chars) will be served as a new metadata field, as well as in the book json (base64 ids for each page in the book will be included)

It is expected that webview will maintain the short ID urls, and not redirect/rewrite them to the canonical versions. Redirection to versioned URLs is expected, however.

The return for the case of a too short base64 id prefix, that matches more than one object is not currently defined.

TODO:
  1. complete BNF for different id/version/context/page urls
  2. table of what archive vs. webview should return for each subcase
  3. step by step of what each component (varnish/nginx, webview, archive) should do for several cases
  4. General rule for archive: if a component is missing (version) redirect to a complete URL. If components conflict (that page uuid is not in that version of that book uuid), the URL doesn't exist, so 404
@karenc
Copy link

karenc commented Sep 10, 2015

By the way in python, there's already functions for doing base64 encoding that's safe for urls:

>>> u = uuid.uuid4(); print('{} {}'.format(repr(u.bytes.encode('base64')), repr(base64.urlsafe_b64encode(u.bytes))))
'OfEIVCNhTzO2eU96Qjr4iw==\n' 'OfEIVCNhTzO2eU96Qjr4iw=='
>>> u = uuid.uuid4(); print('{} {}'.format(repr(u.bytes.encode('base64')), repr(base64.urlsafe_b64encode(u.bytes))))
'J8lpfD/hRyK3Ma3+3S4rQA==\n' 'J8lpfD_hRyK3Ma3-3S4rQA=='
>>> u = uuid.uuid4(); print('{} {}'.format(repr(u.bytes.encode('base64')), repr(base64.urlsafe_b64encode(u.bytes))))
'Io6iSJvfSXq//4c7CITHvw==\n' 'Io6iSJvfSXq__4c7CITHvw=='

You can see that the + and / are replaced for urlsafe base64 encoding.

@reedstrm
Copy link
Author

Ah, good. I had a TODO in there re: a custom codec. I kinda liked not importing base64, but ...

@philschatz
Copy link

Hooray for shorter URLs! I think this addresses the long-chars part but could it be done with fewer bits (for uniqueness) and chars (for the URL)?

How many chars would be needed for uniqueness in base 24 (no "I" or "O" because they can be confused with "1" and "0") or base 34 (alpha + digits - "I" - "O")?

@reedstrm
Copy link
Author

Base24 and base34 would each lead to longer URLs: they encode fewer values per character.

@rich-hart
Copy link

Postgres Documentation:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment