Skip to content

Instantly share code, notes, and snippets.

@Altai-man
Last active July 14, 2022 08:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Altai-man/c15e580480d44616eace320ccda95ed3 to your computer and use it in GitHub Desktop.
Save Altai-man/c15e580480d44616eace320ccda95ed3 to your computer and use it in GitHub Desktop.

A story about search URLs creation or what

my $index-text = recurse-until-str($.pod).join;
my @indices    = $.pod.meta;
my $fragment = qq[index-entry{@indices ?? '-' !! ''}{@indices.join('-')}{$index-text ?? '-' !! ''}$index-text]
             .subst('_', '__', :g).subst(' ', '_', :g);

really means.

So in the beginning, we had X<> used in numerous places to mark anchors where we want to jump from a text-to-index type of search.

The syntax of X<> used was not consistent, sometimes it was: X<Foo|Bar>, sometimes X<|Baz>, sometimes X<|Baz (Bazz)> or X<Foo|Bar,Baz>.

When we processed all those "there is more than one way to do it" codes, we generated URLs in this fashion:

  • Prefix with "index-entry"
  • If there is something after the | bit, append it, joining everything written with a -
  • If there is some text, append it, joining everything written with a -

The consequences are:

  • The resulting anchor part of the URL depends on both the exact text and both the exact "meta" (what's after | in the anchor), i.e.
    • If you want to change the text - 404
    • If you want to change meta, e.g. append a search category instead of bare Reference or change the category - 404
    • Because numerous formats were used in the sources without consistency, the URL forming code, that took everything in the account, produced URLs without a unified structure - sometimes there is a bare text, sometimes text + meta, sometimes meta

Bonus points confusion

This is how an URL is formed for the search.js file to take URLs to take users to:

value => escape($name), url => escape-json("/{ $kind.lc }/{ good-name($name) }"))

What does this good-name sub do? Surely does URL escaping, because we have a lot of special symbols in names, right? Right?

# these chars cannot appear in a unix filesystem path
sub good-name($name is copy --> Str) is export {
   	# / => $SOLIDUS
   	# % => $PERCENT_SIGN
   	# ^ => $CIRCUMFLEX_ACCENT
   	# # => $NUMBER_SIGN
    my @badchars  = ["/", "^", "%"];
    my @goodchars = @badchars
                    .map({ '$' ~ .uniname	   })
                    .map({ .subst(' ', '_', :g)});

    $name = $name.subst(@badchars[0], @goodchars[0], :g);
    $name = $name.subst(@badchars[1], @goodchars[1], :g);

   	# if it contains escaped sequences (like %20) we do not
   	# escape %
    if ( ! ($name ~~ /\%<xdigit>**2/) ) {
        $name = $name.subst(@badchars[2], @goodchars[2], :g);
    }

return $name;
}

Not only the code is buggy (there are three badchars, but only 4 are applied by indexes, so ### is not replaced (thanks god it isn't btw)), it does the wrong thing: URLs are URLs, they shouldn't give a damn about Unix filesystem.

There is a proper way of escaping special symbols in URIs via percent encoding, but because we served the documentation as Unix files via nginx, this whole time we

  1. did not apply URI escaping to things like || and it worked as an accident (| is not an allowed URL character) cause nginx allows you to violate the RFC;
  2. saw scary things such as $SOLIDUS$SOLIDUS in the URL bar instead of normal escaped characters, because... serving static files in the way we did it is a bad idea.

When you criticize - suggest solutions

Possible solutions:

  • Properly escape URLs in search, leaving rewriting of backward compat SOLIDUS/PERCENT_SIGN/CIRCUMFLEX_ACCENT to the engine, but now with main search code doing a consistent thing
  • Do a hard switch to a new schema of forming an URL for an anchor, that is:
    • if there is no text (invisible anchor to some paragraph), take meta part without category (that is, X<|Language,Foo> is #index-entry-Foo)
    • if there is text... We can either do the same as above, which would be a good unification and URLs are depending on less parts, or append the text after all to preserve more backward compat this way

Ideas?

Also this: https://gist.github.com/antoniogamiz/9f7feb5b93a95d0c121ec0ba03d28b6a?permalink_comment_id=3399287#gistcomment-3399287

@patrickbkr
Copy link

I'm unsure about the precise meaning of the X<|> tag. My guess:

  • There is an index of our documentation.
  • Altai-man recently managed to unify the hierarchy (categories) of the entries in that index.
  • One can add entries to that index. Each entry has a place in the index (Category + name) and a point in the documentation it points to. This is a bit like a Key/Value Store or a HashMap in that keys have to be unique, but values don't need to be.
  • The index entries are made unique by their category and name alone. Neither the page an entry points to nor the text it points at are part of the key of an entry.
  • It is possible to have multiple entries with the same name, but differing category.
  • The thing it points at can either be a piece of text or a pure position (no text).
  • The tag itself works like this: X<some text or none|category,name>.
  • It is not possible to provide multiple index entries in a single tag.
  • With the refactor of the index categories, no entries are ever without a category.

Given the above assumptions are true, I'd say:

  • The anchor must represent the index entry, nothing else.
  • Idealy the anchor is constructed from the actual index entry and not from the text of the second part of the X<> tag. I.e. Even if we do some processing to construct an index entry from the text in the X<> tag (e.g. if the category is left off, we assume some default category), the anchor has to represent the actual index entry.
  • The first part of the X<|> must never be part of the anchor.
  • So in principle the system acts as if anchors were constructed and placed based on the index alone and had nothing to do with X<> tags themself. The tags just happen to be the mechanism by which the index is filled.
  • Use standard percent URL encoding.

I personally find the "#index-entry-" prefix a bit cumbersome. I'd have prefered "#index-". But I consider this a minor nit. If there are relevant reasons to keep it the way it is, we should do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment