reedstrm/url_ids.md

## url_ids.md

      
    Raw
  

              url_ids.md
            
          
    PREAMBLE

We the developers of OpenStax/Connexions, in order to form a more readable URL, establish user harmony,
ensure URL persistance, and support all the use cases, do suggest the following id hashing strategy for content objectIds.
ID FORMAT

The canonical form for an id is a uuid-v4, in the 8n-4n-4n-4n-12n format, e.g. 6f0881fc-4d30-43e1-9a3b-52a210ab5980
Python code for generating these:
  >>> import uuid
  >>> u=uuid.uuid4()
  >>> u
  UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
Or with an existing string:
  >>> uuid.UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
  UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')

SQL code (in repository):
  repository=# select uuid_generate_v4();
             uuid_generate_v4           
  --------------------------------------
   a9b8564d-9c87-4744-af85-194ee9759e35

And the usual typecasting works:
  repository=# select '6f0881fc-4d30-43e1-9a3b-52a210ab5980'::uuid;
                   uuid                 
  --------------------------------------
   6f0881fc-4d30-43e1-9a3b-52a210ab5980

ALTERNATE ENCODING

In order to allow for shorter urls, a base64 encoded and truncated form of
the same uuid may be used. It is important to note that this is not the
base64 encoding of the uuid string: it is the encoding of the underlying 128
number. Since we're going to be using these for URL components, we need to
substitute the '+' and '/' characters (and optionally strip the = padding and
newline) TODO: check if there's a way to define a custom encoding 'base64url'
Python:
>>> u
UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')
>>> u.bytes.encode('base64')
'bwiB/E0wQ+GaO1KiEKtZgA==\n'
>>> 
>>> b=u.bytes.encode('base64').replace('+','-').replace('/','_').replace('=','')[:-1]
'bwiB_E0wQ-GaO1KiEKtZgA'

converting back (this is ugly, but will not be needed very often):
>>> uuid.UUID(hex=(b.replace('_','/').replace('-','+')+'==').decode('base64').encode('hex'))
UUID('6f0881fc-4d30-43e1-9a3b-52a210ab5980')

SQL:
repository=# select replace(replace(replace(encode(uuid_send('6f0881fc-4d30-43e1-9a3b-52a210ab5980'::uuid),'base64'),'+','-'),'/','_'),'=','');
        replace         
------------------------
 bwiB_E0wQ-GaO1KiEKtZgA
(1 row)

and back (again, not frequent):
repository=# select substring(decode(replace(replace('bwiB_E0wQ-GaO1KiEKtZgA','-','+'),'_','/')::text||'==','base64')::text,3)::uuid;
              substring               
--------------------------------------
 6f0881fc-4d30-43e1-9a3b-52a210ab5980

Javascript::
http://stackoverflow.com/questions/6095115/javascript-convert-guid-in-string-format-into-base64
WHY BASE64?

All this does is demonstrate that the 22 character base64 encoding is
completely equivalent to the full 36 character uuid.
However, the b64 encoding has a nice property that the initial 6-8 characters
are themselves random enough to serve as unique ids, in the context of our
repository. In fact, for existing content, 5 characters are unique (4 has 24
pairwise collisions). I ran tests against randomly generated uuids. The process
above found appoximately one collision every 10 million ids for length 8.
Why 8, and not 6? We're going to need a lot more ids when we start serving
fragments (sub page sections) as individual objects.
PROPOSED URLs

URL Examples:
Current (all College Physics, first page of Thermodynamics):
http://cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:108/Introduction-to-Thermodynamics
Proposed complete:
http://cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics
Proposed shortened:
http://cnx.org/contents/Ax2o07Ul@9.1:0m3DW_eU/Introduction-to-Thermodynamics
Note that even shorter ones will redirect to "do the right thing". This url:
http://cnx.org/contents/Ax2o07Ul:0m3DW_eU
Will redirect to the Introduction to Thermodynamics page in the current version
of the textbook.
Archive will redirect to the full canonical URL for any given object or context:page pairing.
i.e. http://archive.cnx.org/contents/Ax2o07Ul:0m3DW_eU becomes:
http://archive.cnx.org/contents/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics
Same hash encoding for extras:
Base64 ids (long and short) redirect to complete:
http://archive.cnx.org/extras/031da8d3-b525-429c-80cf-6c8ed997733a@9.1:d26dc35b-f794-427e-b39d-f4b5496fd118@4/Introduction-to-Thermodynamics
retrieves the extras info regarding the page in that particular book context,
including the shortened base64 id, and any context specific info.
URLs that contain a version for the page in a context will ignore the version:
any given context can only contain a single instance of any given page. Its
presence in the complete form is advisory. (contentious - good API design says 404 is a bad version is given)
Title (last component of URL) is optional and ignored.
Archive will interperate page numbers, and redirect to the appropriate complete
URL form.
Shortened form for each object id will be served from archive as part of the
/extras info for that object. Base64 version (22 chars) will be served as a new
metadata field, as well as in the book json (base64 ids for each page in the
book will be included)
It is expected that webview will maintain the short ID urls, and not
redirect/rewrite them to the canonical versions. Redirection to versioned URLs
is expected, however.
The return for the case of a too short base64 id prefix, that matches more than
one object is not currently defined.
TODO:


complete BNF for different id/version/context/page urls
table of what archive vs. webview should return for each subcase
step by step of what each component (varnish/nginx, webview, archive) should do for several cases
General rule for archive: if a component is missing (version) redirect to a complete URL. If components conflict (that page uuid is not in that version of that book uuid), the URL doesn't exist, so 404