Altai-man/search-state.md Secret

## search-state.md

      
    Raw
  

              search-state.md
            
          
    A story about search URLs creation or what
my $index-text = recurse-until-str($.pod).join;
my @indices    = $.pod.meta;
my $fragment = qq[index-entry{@indices ?? '-' !! ''}{@indices.join('-')}{$index-text ?? '-' !! ''}$index-text]
             .subst('_', '__', :g).subst(' ', '_', :g);

really means.
So in the beginning, we had X<> used in numerous places to mark anchors where we want to jump from a text-to-index type of search.
The syntax of X<> used was not consistent, sometimes it was: X<Foo|Bar>, sometimes X<|Baz>, sometimes X<|Baz (Bazz)> or X<Foo|Bar,Baz>.
When we processed all those "there is more than one way to do it" codes, we generated URLs in this fashion:

Prefix with "index-entry"
If there is something after the | bit, append it, joining everything written with a -
If there is some text, append it, joining everything written with a -

The consequences are:

The resulting anchor part of the URL depends on both the exact text and both the exact "meta" (what's after | in the anchor), i.e.

If you want to change the text - 404
If you want to change meta, e.g. append a search category instead of bare Reference or change the category - 404
Because numerous formats were used in the sources without consistency, the URL forming code, that took everything in the account, produced URLs without a unified structure - sometimes there is a bare text, sometimes text + meta, sometimes meta


Bonus points confusion

This is how an URL is formed for the search.js file to take URLs to take users to:
value => escape($name), url => escape-json("/{ $kind.lc }/{ good-name($name) }"))

What does this good-name sub do? Surely does URL escaping, because we have a lot of special symbols in names, right? Right?
# these chars cannot appear in a unix filesystem path
sub good-name($name is copy --> Str) is export {
   	# / => $SOLIDUS
   	# % => $PERCENT_SIGN
   	# ^ => $CIRCUMFLEX_ACCENT
   	# # => $NUMBER_SIGN
    my @badchars  = ["/", "^", "%"];
    my @goodchars = @badchars
                    .map({ '$' ~ .uniname	   })
                    .map({ .subst(' ', '_', :g)});

    $name = $name.subst(@badchars[0], @goodchars[0], :g);
    $name = $name.subst(@badchars[1], @goodchars[1], :g);

   	# if it contains escaped sequences (like %20) we do not
   	# escape %
    if ( ! ($name ~~ /\%<xdigit>**2/) ) {
        $name = $name.subst(@badchars[2], @goodchars[2], :g);
    }

return $name;
}

Not only the code is buggy (there are three badchars, but only 4 are applied by indexes, so ### is not replaced (thanks god it isn't btw)), it does the wrong thing: URLs are URLs, they shouldn't give a damn about Unix filesystem.
There is a proper way of escaping special symbols in URIs via percent encoding, but because we served the documentation as Unix files via nginx, this whole time we

did not apply URI escaping to things like || and it worked as an accident (| is not an allowed URL character) cause nginx allows you to violate the RFC;
saw scary things such as $SOLIDUS$SOLIDUS in the URL bar instead of normal escaped characters, because... serving static files in the way we did it is a bad idea.

When you criticize - suggest solutions

Possible solutions:

Properly escape URLs in search, leaving rewriting of backward compat SOLIDUS/PERCENT_SIGN/CIRCUMFLEX_ACCENT to the engine, but now with main search code doing a consistent thing
Do a hard switch to a new schema of forming an URL for an anchor, that is:

if there is no text (invisible anchor to some paragraph), take meta part without category (that is, X<|Language,Foo> is #index-entry-Foo)
if there is text... We can either do the same as above, which would be a good unification and URLs are depending on less parts, or append the text after all to preserve more backward compat this way


Ideas?
Also this: https://gist.github.com/antoniogamiz/9f7feb5b93a95d0c121ec0ba03d28b6a?permalink_comment_id=3399287#gistcomment-3399287