Validating URLs is hard! On the face of it, you could use a RegEx to match on the URL structure, but that is likely only going to match a narrow subset of URLs, whether intentionally or accidentally.
Beyond having a "valid" URL, an application might be applying extra conditions on top. For example, accepting only
https: URLs, but not
file: and others. This might be ok depending on the application, but knowing that takes some active design work to figure out. For the purpose of handling user errors, and offering suggestions, it also helps to know in what manner the URL failed validation. For example, we might be able to offer precise suggestions for our application-specific errors, while general URL parsing issues can be out of scope.
To make life easier, let's separate the term "valid" and "invalid" into a few different ones:
- Valid URL: a URL that can be parsed successfully according to the standard (which, uhm, I am reluctant to dive into, or rather I feel unqualified to give you an overview of. For the purpose of this, I will be referring to the WHATWG (Web Hypertext Application Technology Working Group) URL parser, which exists in browsers and some other JS environments. If you have a good overview of URLs and all the specs and addendums, please let me know)
- Invalid URL: a URL that gives a validation error when parsing
- Acceptable URL: a URL that is valid, and passes additional requirements set out by the application
- Unacceptable URL: a URL that is invalid, or does not pass additional requirements set out by the application
Another issue that comes up with URL parsing, then, is ensuring that a client's and server's definition of "acceptable" are in sync. It would be a problem if the server could accept URLs that the client then throws an error for, or even worse, that one client (e.g. a mobile app) finds acceptable, but another one (e.g. a web app) rejects. The more specific the definition of acceptable, the more chance of these issues popping up. This can also happen if the definition of "valid" is shaky. If the front-end uses a RegEx, but the back-end uses an RFC compliant parser, it can be a recipe for bugs!
For example, suppose we want to verify that a URL is:
- Has a top-level domain (TLD)
- Has a scheme of http: or https:
We would first validate the URL (e.g. by trying to parse, and seeing if the parser returns an error). We would then do the additional checks. If we have parsed the URL to something more structured, these checks will be simpler to perform. Finally, we should probably expose the error reason, with whatever granularity we want, or that the parser allows for (not all parsers give granular errors, and not all of them are actionable without more design):
- Invalid URL (a relative URL will be invalid of there is no base path)
- Must have a top-level domain, for example https://example.com
- Must have a scheme of http: or https:
TODO: Give examples of acceptable/unacceptable URLs here
Some of these errors might be "recoverable" automatically. For example, you could assume a scheme of http: if one is missing, though that might be an insecure default. You probably only want to do these on the front-end, and have the backend be stricter (TODO: explain why?).
throw if a URL is invalid, so we can use it to validate URLs! We can then use the structured access to accept/reject based on application-specific parts.
Using a RegEx is a bad idea, because people might forget cases, or conflate "invalid URL" with "unacceptable for our application". For example:
- Non-ASCII characters (valid URL, optionally you could transform to Punycode)
- A lack of a TLD (valid URL, maybe not acceptable in some application)
- Single versus double slash after the protocol
- Absolute versus relative (relative is fine, if you have a base set)
TODO: Talk about "parse, don't validate"