fpapado/0_Validating_URLs.md

## 0_Validating_URLs.md

      
    Raw
  

              0_Validating_URLs.md
            
          
    Validating URLs is hard! On the face of it, you could use a RegEx to match on the URL structure, but that is likely only going to match a narrow subset of URLs, whether intentionally or accidentally.
Beyond having a "valid" URL, an application might be applying extra conditions on top. For example, accepting only http: or https: URLs, but not file: and others. This might be ok depending on the application, but knowing that takes some active design work to figure out. For the purpose of handling user errors, and offering suggestions, it also helps to know in what manner the URL failed validation. For example, we might be able to offer precise suggestions for our application-specific errors, while general URL parsing issues can be out of scope.
To make life easier, let's separate the term "valid" and "invalid" into a few different ones:

Valid URL: a URL that can be parsed successfully according to the standard (which, uhm, I am reluctant to dive into, or rather I feel unqualified to give you an overview of. For the purpose of this, I will be referring to the WHATWG (Web Hypertext Application Technology Working Group) URL parser, which exists in browsers and some other JS environments. If you have a good overview of URLs and all the specs and addendums, please let me know)
Invalid URL: a URL that gives a validation error when parsing
Acceptable URL: a URL that is valid, and passes additional requirements set out by the application
Unacceptable URL: a URL that is invalid, or does not pass additional requirements set out by the application

Another issue that comes up with URL parsing, then, is ensuring that a client's and server's definition of "acceptable" are in sync. It would be a problem if the server could accept URLs that the client then throws an error for, or even worse, that one client (e.g. a mobile app) finds acceptable, but another one (e.g. a web app) rejects. The more specific the definition of acceptable, the more chance of these issues popping up. This can also happen if the definition of "valid" is shaky. If the front-end uses a RegEx, but the back-end uses an RFC compliant parser, it can be a recipe for bugs!
Example

For example, suppose we want to verify that a URL is:

Valid
Absolute
Has a top-level domain (TLD)
Has a scheme of http: or https:

We would first validate the URL (e.g. by trying to parse, and seeing if the parser returns an error). We would then do the additional checks. If we have parsed the URL to something more structured, these checks will be simpler to perform. Finally, we should probably expose the error reason, with whatever granularity we want, or that the parser allows for (not all parsers give granular errors, and not all of them are actionable without more design):

Invalid URL (a relative URL will be invalid of there is no base path)
Must have a top-level domain, for example https://example.com
Must have a scheme of http: or https:

TODO: Give examples of acceptable/unacceptable URLs here
Some of these errors might be "recoverable" automatically. For example, you could assume a scheme of http: if one is missing, though that might be an insecure default. You probably only want to do these on the front-end, and have the backend be stricter (TODO: explain why?).
Error suggestions

In JavaScript

Now, we get to JavaScript, and parsing URLs in a nice way. Browsers offer the URL constructor, as a means of parsing and creating URL objects, that offer structured access to their parts. The constructor will throw if a URL is invalid, so we can use it to validate URLs! We can then use the structured access to accept/reject based on application-specific parts.
Using a RegEx is a bad idea, because people might forget cases, or conflate "invalid URL" with "unacceptable for our application". For example:

Non-ASCII characters (valid URL, optionally you could transform to Punycode)
A lack of a TLD (valid URL, maybe not acceptable in some application)
Single versus double slash after the protocol
Absolute versus relative (relative is fine, if you have a base set)

TODO: Talk about "parse, don't validate"

  
## 1_index.js
// Change to toggle debug logging
const DEBUG = true;

/**
 * Validate a subset of URLs, that:
 * - is a valid URL (and absolute, because we provide no base URL)
 * - has a TLD
 * - has a protocol of http: or https:
 *
 * NOTE: A URL not having a TLD or http(s) as the protocol does not make it an **invalid** URL.
 * That is a deliberate restriction that our app imposes, on top of being a valid URL.
 *
 * @see https://url.spec.whatwg.org/#urls
 */
function isAbsoluteHttpUrlWithTld(value) {
  let url;
  try {
    url = new URL(value);
  } catch (err) {
    // Invalid URL; maybe there is no protocol, or it is relative, or any of the many other possible reasons
    return {isAccepted: false, reason: 'Invalid URL'};
  }

  // Check whether the protocol is http or https, and reject otherwise
  if (url.protocol !== 'http:' && url.protocol !== 'https:') {
    return {isAccepted: false, reason: 'Protocol not http(s)'};
  }

  // Check if there is a TLD
  // (Not sure if this check offers anything in for machines, but it can be good to give hints to a user)
  if (url.origin.split('.').length === 1) {
    return {isAccepted: false, reason: 'No TLD'};
  }

  return {isAccepted: true, url};
}

// Place RegEx behind getter, to avoid state
const getUrlRegEx = () => /^(http(s)?:\/\/)[\w-]+(\.[\w-]+)+(\S*)$/gim;

const isUrlV0 = value => {
  if (typeof value === 'string' && value.match(getUrlRegEx())) {
    return true;
  }
  return false;
};

function main() {
  const acceptedUrls = [
    // Some "straightforward" cases
    'https://www.example.org',
    'https://subdomain.example.org',
    // A single slash is valid
    'http:/example.org',
    // Internationalisation
    // Non-ASCII characters are allowed, both in domain and TLD names
    // Under the hood, they are translated to an LDH representation
    // Concretely:
    //  http://παράδειγμα.δοκιμή -> http://xn--hxajbheg2az3al.xn--jxalpdlp/
    //  @see https://en.wikipedia.org/wiki/Punycode
    // The URL constructor handles this conversion already, even if not needed
    // The Greek example.com (registered with IANA, but sunset as of 2013)
    // @see https://en.wikipedia.org/wiki/IDN_Test_TLDs
    'http://παράδειγμα.δοκιμή',
    // Similar to the above, but TLD is in ASCII
    'http://παράδειγμα.gr',
    // The above, but written directly in LDH
    'http://xn--hxajbheg2az3al.xn--jxalpdlp/',
    // There's more things here, like ports, but out of scope
  ];

  const unacceptedUrls = [
    // Valid, but not accepted by our app
    './index.html', // Relative URL
    'http://example', // No TLD
    'ftp://example.org', // Unsupported protocol
    'myprotocol://whatever.com/yep', // Made up protocol
    // Invalid URLs
    '://example.org', // Missing protocol
    'bugfreefi', // All messed up
    'asd http://example.org', // Additional characters
    'http://example.org asd', // Additional characters
    'http//bugfree/path/to/here', // Missing colon
  ];

  // Validate the URLs with each candidate validator
  let candidates = [
    {label: 'Base', validator: isUrlV0},
    {label: 'Complete', validator: isAbsoluteHttpUrlWithTld},
  ];

  for (let {label, validator} of candidates) {
    const unacceptedRes = unacceptedUrls
      .map(validator)
      .map(debugLog)
      .filter(it => it.isAccepted === true);

    const acceptedRes = acceptedUrls
      .map(validator)
      .map(debugLog)
      .filter(it => it.isAccepted === true);

    console.log(
      `(${label}) Are all unacceptable URLs caught? ${unacceptedRes.length === 0}`,
    );

    console.log(
      `(${label}) Are all acceptable URLs ok? ${acceptedRes.length ===
        acceptedUrls.length}`,
    );
  }
}

function debugLog(val) {
  if (DEBUG === true) {
    console.log(val);
  }
  return val;
}

main();
	// Change to toggle debug logging
	const DEBUG = true;

	/**
	* Validate a subset of URLs, that:
	* - is a valid URL (and absolute, because we provide no base URL)
	* - has a TLD
	* - has a protocol of http: or https:
	*
	* NOTE: A URL not having a TLD or http(s) as the protocol does not make it an invalid URL.
	* That is a deliberate restriction that our app imposes, on top of being a valid URL.
	*
	* @see https://url.spec.whatwg.org/#urls
	*/
	function isAbsoluteHttpUrlWithTld(value) {
	let url;
	try {
	url = new URL(value);
	} catch (err) {
	// Invalid URL; maybe there is no protocol, or it is relative, or any of the many other possible reasons
	return {isAccepted: false, reason: 'Invalid URL'};
	}

	// Check whether the protocol is http or https, and reject otherwise
	if (url.protocol !== 'http:' && url.protocol !== 'https:') {
	return {isAccepted: false, reason: 'Protocol not http(s)'};
	}

	// Check if there is a TLD
	// (Not sure if this check offers anything in for machines, but it can be good to give hints to a user)
	if (url.origin.split('.').length === 1) {
	return {isAccepted: false, reason: 'No TLD'};
	}

	return {isAccepted: true, url};
	}

	// Place RegEx behind getter, to avoid state
	const getUrlRegEx = () => /^(http(s)?:\/\/)[\w-]+(\.[\w-]+)+(\S*)$/gim;

	const isUrlV0 = value => {
	if (typeof value === 'string' && value.match(getUrlRegEx())) {
	return true;
	}
	return false;
	};

	function main() {
	const acceptedUrls = [
	// Some "straightforward" cases
	'https://www.example.org',
	'https://subdomain.example.org',
	// A single slash is valid
	'http:/example.org',
	// Internationalisation
	// Non-ASCII characters are allowed, both in domain and TLD names
	// Under the hood, they are translated to an LDH representation
	// Concretely:
	// http://παράδειγμα.δοκιμή -> http://xn--hxajbheg2az3al.xn--jxalpdlp/
	// @see https://en.wikipedia.org/wiki/Punycode
	// The URL constructor handles this conversion already, even if not needed
	// The Greek example.com (registered with IANA, but sunset as of 2013)
	// @see https://en.wikipedia.org/wiki/IDN_Test_TLDs
	'http://παράδειγμα.δοκιμή',
	// Similar to the above, but TLD is in ASCII
	'http://παράδειγμα.gr',
	// The above, but written directly in LDH
	'http://xn--hxajbheg2az3al.xn--jxalpdlp/',
	// There's more things here, like ports, but out of scope
	];

	const unacceptedUrls = [
	// Valid, but not accepted by our app
	'./index.html', // Relative URL
	'http://example', // No TLD
	'ftp://example.org', // Unsupported protocol
	'myprotocol://whatever.com/yep', // Made up protocol
	// Invalid URLs
	'://example.org', // Missing protocol
	'bugfreefi', // All messed up
	'asd http://example.org', // Additional characters
	'http://example.org asd', // Additional characters
	'http//bugfree/path/to/here', // Missing colon
	];

	// Validate the URLs with each candidate validator
	let candidates = [
	{label: 'Base', validator: isUrlV0},
	{label: 'Complete', validator: isAbsoluteHttpUrlWithTld},
	];

	for (let {label, validator} of candidates) {
	const unacceptedRes = unacceptedUrls
	.map(validator)
	.map(debugLog)
	.filter(it => it.isAccepted === true);

	const acceptedRes = acceptedUrls
	.map(validator)
	.map(debugLog)
	.filter(it => it.isAccepted === true);

	console.log(
	`(${label}) Are all unacceptable URLs caught? ${unacceptedRes.length === 0}`,
	);

	console.log(
	`(${label}) Are all acceptable URLs ok? ${acceptedRes.length ===
	acceptedUrls.length}`,
	);
	}
	}

	function debugLog(val) {
	if (DEBUG === true) {
	console.log(val);
	}
	return val;
	}

	main();