Skip to content

Instantly share code, notes, and snippets.

@mikesamuel
Last active October 30, 2017 16:20
Show Gist options
  • Save mikesamuel/e9720a0acc0601372deba3bf0896f33a to your computer and use it in GitHub Desktop.
Save mikesamuel/e9720a0acc0601372deba3bf0896f33a to your computer and use it in GitHub Desktop.
API for building URL classifiers

URL Classifier Builder

This is now implemented: https://github.com/OWASP/url-classifier

Problem

Matching URLs with regular expressions is hard. Even experienced programmers who are familiar with the URL spec produce code like /http:\/\/example.com/ which spuriously matches unintended URLs like

  • http://example.com.evil.com/
  • http://example.com@evil.com/
  • http://example_com/
  • javascript:alert(1)//http://example.com

while failing to match simple variants that were probably oversights:

  • HTTP://example.com/ which uses a ucase scheme
  • http://EXAMPLE.com/ which uses a ucase hostname
  • https://example.com/ which uses a scheme that is equivalent for most intents and purposes.

A common "fix" for that example, /^https?:\/\/example\.com\//i, spuriously fails to match other variants:

  • http://example.com./ which use a trailing dot to disable DNS suffix searching
  • http://example.com:80/ which makes the port explicit

Epicycles can be added to a regex to work around problems as they're found but there is a tradeoff between correctness and readability/maintainability.

There are similar hazards when trying to constrain other parts of the URL like the paths. /^(?:https?:\/\/example\.com)?\/foo\/.*/ looks like it should match only URLs that have a path under /foo/ but spuriously matches

  • http://example.com/foo/../../../../etc/passed

which, used in the wrong context, can cause problems

Goals

Allow specifying logically complex predicates over URLs including sub-predicates that operate on

  • scheme
  • authority
  • host
  • port
  • path element globs (e.g. foo/**/*.html)
  • query parameters that MAY/MUST appear
  • query paremeters that MAY appear at most once
  • query name/value pair tests
  • fragment tests
  • fragment as a relative URL
  • content mime-type
  • embedded content
  • ORs of the above in any combination.
  • NOT of path, query, and fragment predicates since black-listing of those is less error-prone.

Hide the complexity of avoiding hazards like the above.

Reliably & quickly yield a result of (matches, does-not-match, invalid) for any input. Reliably & quickly produce a result of (matches, does-not-match) for any URI that matches the grammar defined in RFCs 3986+6874+7320. Reliably produce a result of invalid for any that do not match that grammar unless in lax mode.

Work equivalently online & offline. Specifically it will make no network connections so is not vulnerable to DNS lock-up attacks, etc. It is not a goal to treat numeric IPs equivalently to hosts that resolve to those IPs and we do not assume that 127.0.0.1 is equivalent to localhost.

Provide separate matchers to extend to different definitions of URL:

  • URLs as apparent to an end-user. The kind of thing a user might copy paste into a browser address bar: 例.com/😀#
  • URLs as apparent to a coder. The kind of thing a developer might put into an HTML document: https://例.com/😀#
  • URLs as apparent to a machine: https://xn--fsq/%F0%9F%98%80#

Simplifying assumptions

UTF-8 centric

We assume all %-sequences outside data or blob content can be decoded into UTF-8 and mark as invalid any inputs that include code-unit sequences that are not valid UTF-8 or that are not minimally encoded.

Empty domain search list

We assume that all hostnames are complete. For example, http://www/ might actually resolve to http://www.myorganization.org/ after the domain search list is applied. We can't do this and have stable predicates that do not depend on external services and that do not potentially leak information about servers inside a firewall to anyone outside the firewall who can specify a partial URL.

API

interface URLClassifier extends Function<URLValue, Classification> {
  Classification apply(URLValue x);

  static URLClassifier or(URLClassifier... ps);
  static URLClassifier or(Iterable<? extends URLClassifier> ps);
}

enum Classification {
  MATCH,
  NOT_A_MATCH,
  /**
   * For malformed URLs or those that may be widely misinterpreted
   * by different user-agents.
   */
  INVALID,
}

/**
 * Context about a URL which can be used to convert scheme names
 * to Scheme values and to convert relative URLs to absolute.
 */
final class URLContext {
  ...
}

/**
 * A URLValue is a URI Reference as defined by Std 66, RFC 7230.
 */
final class URLValue {

  /**
   * @param urlText a string as seen by a machine.
   */
  static URLValue fromMachineText(String urlText, URLContext context);

  /**
   * @param urlText a URL that an end-user might copy/paste into a
   *    browser bar.  For example, "www.example.com" or "user@example.com".
   * @return a URLValue whose originalUrlText may have a scheme prepended.
   */
  static URLValue fromUserText(String urlText, URLContext context);

  String getOriginalUrlText();
  String getAbsoluteUrlText();

  /**
   * True iff the default URLContext's authority was inherited
   * by this URL value.
   * Only AuthorityClassifier's that match all hosts will
   * match when this is true.
   */
  boolean usesPlaceholderAuthority();

  /**
   * True iff simplifying the path interpreted ".." relative to "/" or "".
   * <p>
   * For example,
   * Interpreting "/../bar" relative to "http://example.com/foo/"
   * leads to simplying "http://example.com/foo/../bar" to
   * "http://example.com/bar".
   * But the "/.." is applied to "/foo" so root's parent is not reached.
   * <p>
   * On the other hand,
   * Interpreting "/../../bar" relative to "http://example.com/foo/"
   * leads to simplifying "http://example.com/foo/../../bar" to
   * "http://example.com/bar".
   * In this case, the first "/.." is applied to "/foo" and the second
   * is applied to "/" so the root's parent is reached.
   */
  boolean pathSimplificationReachedRootsParent();

  /** May be Scheme.UNKNOWN */
  Scheme getScheme();
  /**
   * Raw text of the absoluteUrlText's authority.
   *
   * <p>Decoding must be done after determining where userinfo/host/port
   * boundaries occur and after determining whether the host is
   * numeric or named.
   */
  String getAuthority();
  /**
   * Raw text of absoluteUrlText's path.
   */
  String getPath();
  ...
}


/**
 * Specifying multiple sub-predicates of the same kind ORs those
 * together.
 * Classifiers of one kind AND with predicates of another kind except
 * where stated below.
 * For example,
 * <pre>
 *    .matchesSchemes(HTTP, HTTPS)
 *    .matchesSchemes(FILE)
 *    .matchesHosts("example.com")
 *    .matchesPathGlobs("/foo/**")
 *    .matchesPathGlobs("/*.js")
 * </pre>
 * corresponds to pseudocode like
 * <pre>
 * ((url.scheme in (HTTP, HTTPS))
 *    or (url.scheme is FILE))
 * and (not url.scheme.naturallyHasAuthority
 *      or url.authority == "example.com")
 * and (not url.scheme.naturallyHasPath
 *      or glob("/foo/**").matches(url.path)
 *      or glob("/*.js").matches(url.path))
 * </pre>
 *
 * <p>If a URL's scheme does not naturally have an authority,
 * then it MUST not have an authority and any authority predicate
 * is ignored.
 * For example, `file:` URLs and `data:` URLs do not naturally
 * have an authority.  `file:` by the nature of the scheme, and
 * `data:` because it is not a hierarchical scheme.
 *
 * <p>If a URL's scheme naturally has an authority then it MUST have an
 * authority and any authority predicate must also pass.
 * For example: `http:///` will never pass any predicate.
 * <a href="https://w3c.github.io/FileAPI/#DefinitionOfScheme">Blobs</a>
 * naturally have an authority.
 *
 * <p>If a URL's scheme does not naturally have a path or query component
 * then path and query predicates will not be applied.
 * All hierarchical URLs naturally have both, so a `file:` URL MUST match
 * any query predicates.
 *
 * <p>All URLs are treated as URI References, so fragments are allowed
 * regardless of scheme.
 *
 * <p>If a URL's scheme does not naturally have embedded content then any content
 * predicate is ignored.
 * For example, `http:` and other hierarchical URLs do not have embedded content.
 *
 * <p>If a URL's scheme does naturally have embedded content, then it MUST
 * have embedded content and any content predicate must match that content.
 * For example: `data:text/plain;base64` will not match any predicate but
 * `data:text/plain,` will match if the content predicate matches the empty
 * string.
 * Schemes that naturally have embedded content include `about:`, `blob:`,
 * `data:`, and `javascript:`.
 */
class URLClassifierBuilder {
  URLClassifier build();

  //// Flags that affect multiple sub-predicates or that govern
  //// how the original URL is interpreted in context.
  // URLs with NULs are a common problem case.
  // By default they are not matched.
  // blob and data URLs that need to embed NULs in content typically
  // base64 encode and NULs in decoded will not cause a mismatch.
  // If allowing NULs is definitely required enable this.
  URLClassifierBuilder matchesNULs(boolean allow);
  // Microsoft uses back-slash ('\\') to separate file components and many
  // Microsoft systems helpfully treat ('\\') as equivalent to the normal
  // URL path component separator ('/').
  // Enable this flag if you want to emulate this behavior.
  // By default, you don't.
  URLClassifierBuilder matchMicrosoftPathBugForBug(boolean enable);
  // If not enabled (the default), apply(x) will return INVALID when
  // x.pathSimplificationReachedRootsParent().
  // <p>
  // It is safe to enable this if you plan on substituting x.urlText
  // for x.originalUrlText in your output, but not if you plan on
  // using x.originalUrlText.
  URLClassifierBuilder allowPathsThatReachRootsParent(boolean enable);

  //// Sub-predicates of kind scheme     MATCH_ME://...
  URLClassifierBuilder matchesSchemes(Scheme... schemes);
  URLClassifierBuilder matchesSchemes(Iterable<? extends Scheme> schemes);
  // We can match data with an additional constraint on the mime-type.
  // We special-case data because content-types are not attached to
  // URLs with other schemes and its rare to want to match a data: URL
  // without caring about the type of data.
  URLClassifierBuilder matchesData(MimeTypeClassifier types);

  //// Sub-predicates of kind authority  http://MATCH_ME/...
  URLClassifierBuilder matchesHosts(String... hostnames);
  URLClassifierBuilder matchesHosts(Iterable<? extends String> hostnames);
  URLClassifierBuilder matchesAuthority(AuthorityClassifier authMatcher);

  //// Sub-predicates of kind path       http://example.com/MATCH_ME?...
  // In the glob, `**` matches one or more path components and * matches
  // a single path component at most.  Matching is done after processing the
  // special path components ".." and ".".
  // If a glob ends in "/?" then a slash is optionally allowed at the end.
  // For example,
  //   "**/*.html" matches all paths that end with ".html"
  //   "*.html" matches all single-component paths that end with ".html"
  //   "foo/**/bar" matches all paths that start with a foo component,
  //   followed by zero or more other components and ending with bar.
  //   "foo/" matches "foo/" but not "foo" while "foo/?" matches both.
  //   "foo**" is not a valid glob.
  // The following code-points may be %-encoded in a path glob to allow them
  // to be treated literally as part of a path component: ('/', '*', '?', '%').
  URLClassifierBuilder matchesPathGlobs(String... pathGlobs);
  URLClassifierBuilder matchesPathGlobs(Iterable<? extends String> pathGlobs);
  // Does not match any of the path globs where not(INVALID) == INVALID.
  URLClassifierBuilder notMatchesPathGlobs(String... pathGlobs);
  URLClassifierBuilder notMatchesPathGlobs(Iterable<? extends String> pathGlobs);
  // By default, path components like ("%2e", "%2e%2e") that, post decoding
  // are ambiguous with the special path components (".", "..") will not be
  // matched.  If these must be matched, then enable this but ensure that the
  // server that processes these deals with these path components correctly.
  // Default is TREAT_AS_INVALID
  URLClassifierBuilder matchesEncodedDots(EncodedDotStrategy strategy);
  enum EncodedDotStrategy {
    TREAT_AS_INVALID, DO_NOT_MATCH, MATCH_AS_PATH, MATCH_AS_DECODED
  }

  //// Sub-predicates of kind query      http://example.com/?MATCH_ME#...
  URLClassifierBuilder matchesQuery(QueryClassifier queryClassifier);
  // Reverses the predicate where not(INVALID) == INVALID.
  URLClassifierBuilder notMatchesQuery(QueryClassifier queryClassifier);

  //// Sub-predicates of kind fragment   http://example.com/#MATCH_ME
  URLClassifierBuilder matchesFragment(FragmentClassifier fragmentClassifier);
  // Reverses the predicate where not(INVALID) == INVALID.
  URLClassifierBuilder notMatchesFragment(FragmentClassifier fragmentClassifier);

  //// Sub-predicates of kind content    javascript:MATCH_ME data:foo/bar,MATCH_ME
  // Matches when the scheme-specific part matches the predicate.
  // This is applied after any content metadata is stripped and after decoding.
  // For example,
  // data: URLs have the mime-type and any base64 specifier stripped, and if the
  // base64 is specified, the content is base64 decoded;
  // blob: URLs have the origin stripped.
  URLClassifierBuilder matchesContent(ContentClassifier p);
}


/**
 * A URL predicate that only looks at the authority
 * which normally comprises hostname:port
 */
interface AuthorityClassifier extends URLClassifier {
  static AuthorityClassifier or(AuthorityClassifier... ps);
  static AuthorityClassifier or(Iterable<? extends AuthorityClassifier> ps);
}

/**
 * Builds a predicate over username:password@hostname:port style authorities.
 */
class AuthorityClassifierBuilder (
  AuthorityClassifier build();

  // Accepts names or numeric literals
  AuthorityClassifierBuilder matchesHosts(String... hostnames);
  AuthorityClassifierBuilder matchesHosts(InternetDomainName... addresses);
  AuthorityClassifierBuilder matchesHosts(Inet4Address... addresses);
  AuthorityClassifierBuilder matchesHosts(Inet6Address... addresses);
  /**
   * matchesHostGlob("**.example.com") matches any subdomain of example.com
   * including example.com.
   * <p>
   * matchesHostGlob("*.example.com") matches foo.example.com but neither
   * foo.bar.example.com nor example.com.
   * <p>
   * matchesHostGlob("example.*") matches "example." followed by any entry on
   * <a href="http://publicsuffix.org/">Mozilla's public suffix list</a> so will
   * match "example.com", "example.org", and "example.co.uk".
   * <p>
   * matchesHostGlob("**") matches any valid host including numeric IPAs.
   * <p>
   * One of "**." and "*." may appear at the beginning and ".*" may appear at
   * the end but otherwise, '*' may not appear in a hostname glob.
   */
  AuthorityClassifierBuilder matchesHostGlob(String... globs);

  // If a port matcher is specified we assume default ports based on scheme, so
  // matching ports (80, 443) matches http://example.com/ but not
  // https://example.com/
  // and https://example.com:80/ but not https://example.com:10000/
  AuthorityClassifierBuilder matchesPort(
      Predicate<? super Integer> hostMatcher);
  AuthorityClassifierBuilder matchesPort(int... ports);

  // Unless an authentication part predicate is specified no
  // URL with an authentication part will match, so
  // http://@example.com/ will not match.
  // unameMatcher receives the decoded username.
  AuthorityClassifierBuilder matchesUserName(
      Predicate<? super String> unameMatcher);

  // We intentionally do not allow matching against a password.
  // There is no way to match http://msamuel:hello-kitty@google.com/
  // via this API.  Also, get your own password.
}

/** A URLClassifier that considers only the query portion. */
interface QueryClassifier extends URLClassifier {
  // TODO: static or(...)
}

/**
 * Builds a predicate over URL queries.
 *
 * <p>The operators below operate on a query string like
 * "?key0=value0&key1=value1" after it has been decomposed into
 * a sequence of decoded key/value pairs.
 *
 * <p>For example, the query "?a=b%20c&a=d&e" specifies the
 * key value pairs [("a", "b c"), ("a", "d"), ("e", absent)].
 */
class QueryClassifierBuilder (
  QueryClassifier build();

  /**
   * Specify the keys that MAY appear -- all keys must match.
   * If no variant of {@link #mayHaveKeys} is called,
   * then any keys may appear.
   * Multiple calls union.
   *
   * <p>All variants of this method operate on keys post-percent decoding.
   */
  QueryClassifierBuilder mayHaveKeys(String... keys);
  QueryClassifierBuilder mayHaveKeys(Iterable<? extends String> keys);
  QueryClassifierBuilder mayHaveKeys(Predicate<? super String> p);

  /**
   * Specifies that the keys may not appear more than once.
   *
   * <p>All variants of this method operate on keys post-percent decoding.
   */
  QueryClassifierBuilder mayNotRepeatKeys(String... keys);
  QueryClassifierBuilder mayNotRepeatKeys(Iterable<? extends String> keys);
  QueryClassifierBuilder mayNotRepeatKeys(Predicate<? super String> p);

  /**
   * Specify which keys MUST appear ignoring order.
   * Does not match if any of the specified keys are missing.
   */
  QueryClassifierBuilder mustHaveKeys(String... keys);
  QueryClassifierBuilder mustHaveKeys(Iterable<? extends String> keys);

  /**
   * Specify that any values associated with the given key must match the
   * given predicate.
   * <ul>
   *   <li>For valueMustMatch("foo", p) the URI "?foo=bar" will cause a call
   *       p.apply(of("bar")).
   *   <li>For valueMustMatch("foo", p) the URI "?foo=" will cause a call
   *       p.apply(of("")).
   *   <li>For valueMustMatch("foo", p) the URI "?foo" will cause a call
   *       p.apply(absent).
   * </ul>
   * <p>The value received has been decoded.
   *
   * <p>This does not require that key appear.  If key appears multiple
   * times, the predicate will be applied to each value in the order
   * it appears.
   */
  QueryClassifierBuilder valueMustMatch(
      String key,
      Predicate<? super Optional<String>> valueClassifier);
}


/** A URLClassifier that considers only the fragment portion. */
interface FragmentClassifier extends URLClassifier {
  // TODO: static or(...)
}

/**
 * Builds a predicate over the fragment content excluding the '#'.
 * An absent fragment is equivalent to the empty fragment per RFC 3986.
 */
class FragmentClassifierBuilder (
  FragmentClassifier build();

  // If there is no fragment then p will receive the Optional.absent.
  // RFC 3986 says
  //    two URIs that differ only by the suffix "#" are considered
  //    different regardless of the scheme.
  FragmentClassifierBuilder matches(Predicate<? super Optional<String>> p);

  // It's not uncommon for single-page apps to embed a path in a query.
  // URLs will match when there is a fragment whose content passes as a relative
  // URL that matches p in the context of http://[special-unkown-host]/
  //
  // This requires that the fragment will be present, so to allow
  // no fragment OR the fragments described above, use FragmentClassifier.or(...).
  FragmentClassifierBuilder matchFragmentAsIfRelativeURL(URLClassifier p);
}


interface ContentClassifier extends URLClassifier {
  Classification applyToTextContent(CharSequence chars);
  Classification applyToBinaryContent(ByteBuffer bytes);
}
// TODO: collect use cases for content predicate and maybe provide a builder.


/**
 * Encapsulates knowledge about how to
 * recognize a scheme in a URL and how to decompose the URL
 * into parts that can be further checked.
 */
class Scheme {
  final ImmutableSet<String> lcaseNames;
  final boolean isHierarchical;
  final boolean naturallyHasAuthority;
  final boolean naturallyHasPath;
  final boolean naturallyHasQuery;
  final boolean naturallyEmbedsContent;
  final int defaultPortOrNegOne;

  PartRanges decompose(String urlText);

  String recompose(String partSource, PartRanges ranges);

  // Content can either be backed by a CharSequence or a ByteBuffer.
  ContentValue decodeContent(String urlText, PartRanges ranges);

  // Part ranges identifies the range of characters within
  // the scheme-specific part of a URL corresponding to
  //   * authority
  //   * path
  //   * query
  //   * content (for non-hierarchical)
  //   * contentMetadata (e.g. mime-type, base64 for data:)
  //   * fragment

  ...
}


/** Schemes understood by this library. */
class BuiltinScheme {
  public static final Scheme ABOUT;
  public static final Scheme BLOB;
  public static final Scheme DATA;
  public static final Scheme FILE;
  public static final Scheme HTTP;
  public static final Scheme HTTPS;
  public static final Scheme JAVASCRIPT;
  public static final Scheme MAILTO;
  public static final Scheme TEL;

  ...
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment