Skip to content

Instantly share code, notes, and snippets.

@seandstewart
Last active October 3, 2019 08:51
Show Gist options
  • Save seandstewart/a6b6b2adb1e633eb1f3f52b93eaa0105 to your computer and use it in GitHub Desktop.
Save seandstewart/a6b6b2adb1e633eb1f3f52b93eaa0105 to your computer and use it in GitHub Desktop.
url regex regular expression pattern python
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
"""
I developed the following regex pattern after researching popular libraries which support some form of URL type validation
for use in my own forthcoming implementation in my library, `typical`. You can see a sneak-peek of the action here:
https://github.com/seandstewart/typical/blob/schema/typic/types/url.py
The pattern is largely based upon marshmallow's regex pattern, found here:
https://github.com/marshmallow-code/marshmallow/blob/298870ef6c089fb4d91efae9ca4168453ffe00d2/marshmallow/validate.py#L37
And then pydantic's implementation of the above, found here:
https://github.com/samuelcolvin/pydantic/blob/5015a7e48bc869adf99b78eb38075951549e9ea7/pydantic/utils.py#L156
However, I had a few issues with this implementation:
1. The pattern was generated at call-time based upon external parameters.
2. It's hard to debug inline while developing.
3. The individual pieces of the network address are un-named, so after matching a string, it's hard to have insight into *what* was matched.
The following pattern is a single pattern compiled at run-time. In my mind this gives us a few advantages:
1. Using one multi-line pattern with the `re.VERBOSE` flag makes it readable and easy to debug.
2. We have a single pattern that works for all network addresses.
3. Unlike `yarl`'s `URL` and `urllib`'s `parse`, properly identify resources which are only a host.
- `foo.bar`, is matched to the `host` group instead of the `path`, for example.
4. Named groups give us immediate insight into the properties of the resource we're validating against.
As a bonus, I've supplied a regex pattern to identify whether an IP address is internal-only.
"""
NET_ADDR_PATTERN = re.compile(
r"""
^
(
# Scheme
((?P<scheme>(?:[a-z0-9\.\-\+]*))://)?
# Auth
(?P<auth>(?:(?P<username>[^:@]+?)[:@](?P<password>[^:@]*?)[:@]))?
# Host
(?P<host>(?:
# Domain
(?P<domain>
(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+
(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)
)
# Localhost
|(?P<localhost>localhost)
|(?P<dotless>(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.?))
# IPV4
|(?P<ipv4>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
# IPV6
|(?P<ipv6>\[[A-F0-9]*:[A-F0-9:]+\])
))?
# Port
(:(?P<port>(?:\d+)))?
)?
# Path, Q-string & fragment
(?P<relative>(?:/?|[/?#]\S+))
$
""",
re.IGNORECASE | re.VERBOSE,
)
INTERNAL_IP_PATTERN = re.compile(
r"""
^
# IPv4
(127\.)|
(192\.168\.)|
(10\.)|(172\.1[6-9]\.)|
(172\.2[0-9]\.)|(172\.3[0-1]\.)|
# IPv6
(::1)|([F][CD])
$
""",
re.I | re.VERBOSE,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment