A story about search URLs creation or what
my $index-text = recurse-until-str($.pod).join;
my @indices = $.pod.meta;
my $fragment = qq[index-entry{@indices ?? '-' !! ''}{@indices.join('-')}{$index-text ?? '-' !! ''}$index-text]
.subst('_', '__', :g).subst(' ', '_', :g);
really means.
So in the beginning, we had X<>
used in numerous places to mark anchors where we want to jump from a text-to-index type of search.
The syntax of X<>
used was not consistent, sometimes it was: X<Foo|Bar>, sometimes X<|Baz>, sometimes X<|Baz (Bazz)> or X<Foo|Bar,Baz>.
When we processed all those "there is more than one way to do it" codes, we generated URLs in this fashion:
- Prefix with "index-entry"
- If there is something after the
|
bit, append it, joining everything written with a-
- If there is some text, append it, joining everything written with a
-
The consequences are:
- The resulting anchor part of the URL depends on both the exact text and both the exact "meta" (what's after
|
in the anchor), i.e.- If you want to change the text - 404
- If you want to change meta, e.g. append a search category instead of bare
Reference
or change the category - 404 - Because numerous formats were used in the sources without consistency, the URL forming code, that took everything in the account, produced URLs without a unified structure - sometimes there is a bare text, sometimes text + meta, sometimes meta
This is how an URL is formed for the search.js file to take URLs to take users to:
value => escape($name), url => escape-json("/{ $kind.lc }/{ good-name($name) }"))
What does this good-name
sub do? Surely does URL escaping, because we have a lot of special symbols in names, right? Right?
# these chars cannot appear in a unix filesystem path
sub good-name($name is copy --> Str) is export {
# / => $SOLIDUS
# % => $PERCENT_SIGN
# ^ => $CIRCUMFLEX_ACCENT
# # => $NUMBER_SIGN
my @badchars = ["/", "^", "%"];
my @goodchars = @badchars
.map({ '$' ~ .uniname })
.map({ .subst(' ', '_', :g)});
$name = $name.subst(@badchars[0], @goodchars[0], :g);
$name = $name.subst(@badchars[1], @goodchars[1], :g);
# if it contains escaped sequences (like %20) we do not
# escape %
if ( ! ($name ~~ /\%<xdigit>**2/) ) {
$name = $name.subst(@badchars[2], @goodchars[2], :g);
}
return $name;
}
Not only the code is buggy (there are three badchars, but only 4 are applied by indexes, so ###
is not replaced (thanks god it isn't btw)), it does the wrong thing: URLs are URLs, they shouldn't give a damn about Unix filesystem.
There is a proper way of escaping special symbols in URIs via percent encoding, but because we served the documentation as Unix files via nginx, this whole time we
- did not apply URI escaping to things like
||
and it worked as an accident (|
is not an allowed URL character) cause nginx allows you to violate the RFC; - saw scary things such as
$SOLIDUS$SOLIDUS
in the URL bar instead of normal escaped characters, because... serving static files in the way we did it is a bad idea.
Possible solutions:
- Properly escape URLs in search, leaving rewriting of backward compat SOLIDUS/PERCENT_SIGN/CIRCUMFLEX_ACCENT to the engine, but now with main search code doing a consistent thing
- Do a hard switch to a new schema of forming an URL for an anchor, that is:
- if there is no text (invisible anchor to some paragraph), take meta part without category (that is,
X<|Language,Foo>
is#index-entry-Foo
) - if there is text... We can either do the same as above, which would be a good unification and URLs are depending on less parts, or append the text after all to preserve more backward compat this way
- if there is no text (invisible anchor to some paragraph), take meta part without category (that is,
Ideas?
I'm unsure about the precise meaning of the
X<|>
tag. My guess:X<some text or none|category,name>
.Given the above assumptions are true, I'd say:
X<>
tag. I.e. Even if we do some processing to construct an index entry from the text in theX<>
tag (e.g. if the category is left off, we assume some default category), the anchor has to represent the actual index entry.X<|>
must never be part of the anchor.X<>
tags themself. The tags just happen to be the mechanism by which the index is filled.I personally find the "#index-entry-" prefix a bit cumbersome. I'd have prefered "#index-". But I consider this a minor nit. If there are relevant reasons to keep it the way it is, we should do so.