Skip to content

Instantly share code, notes, and snippets.

@kmike
Last active October 16, 2022 16:57
Show Gist options
  • Save kmike/1fd10869a1af9a54cddbeca38694454a to your computer and use it in GitHub Desktop.
Save kmike/1fd10869a1af9a54cddbeca38694454a to your computer and use it in GitHub Desktop.
url_has_any_extension benchmark
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "fd8192cf",
"metadata": {},
"source": [
"Two implementations of url_has_any_extension:\n",
"\n",
"* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n",
"* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n",
"\n",
"The new implementation is more correct, because it works for extensions like .tar.gz."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8f646b9b",
"metadata": {},
"outputs": [],
"source": [
"import posixpath\n",
"\n",
"from scrapy.utils.url import parse_url\n",
"\n",
"def url_has_any_extension_27(url, extensions):\n",
" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
" lowercase_path = parse_url(url).path.lower()\n",
" return any(lowercase_path.endswith(ext) for ext in extensions)\n",
"\n",
"\n",
"def url_has_any_extension_26(url, extensions):\n",
" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
" return posixpath.splitext(parse_url(url).path)[1].lower() in extensions"
]
},
{
"cell_type": "markdown",
"id": "ccf338bd",
"metadata": {},
"source": [
"Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "3fbcbd78",
"metadata": {},
"outputs": [],
"source": [
"from scrapy.linkextractors import IGNORED_EXTENSIONS\n",
"\n",
"# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n",
"# We're using set because LinkExtractor uses set for deny_extensions.\n",
"extensions = {'.' + e for e in IGNORED_EXTENSIONS}"
]
},
{
"cell_type": "markdown",
"id": "b1b93a99",
"metadata": {},
"source": [
"Case 1: an URL where an extension is present.\n",
"\n",
"Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "305279c3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(True, True)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n",
"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "ebd8a37b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
]
}
],
"source": [
"%timeit url_has_any_extension_26(url, extensions)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "60af1899",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
]
}
],
"source": [
"%timeit url_has_any_extension_27(url, extensions)"
]
},
{
"cell_type": "markdown",
"id": "5ce4aa3d",
"metadata": {},
"source": [
"New version is slower, but both are super-fast. There is probably nothing to worry about.\n",
"\n",
"Case 2: extension is not present in URL."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "f56ff91c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(False, False)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n",
"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "9d16127d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
]
}
],
"source": [
"%timeit url_has_any_extension_26(url, extensions)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "ad968d6d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
]
}
],
"source": [
"%timeit url_has_any_extension_27(url, extensions)"
]
},
{
"cell_type": "markdown",
"id": "43a62e75",
"metadata": {},
"source": [
"Again, the new version is slower, but both are very fast. There is probably nothing to worry about."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment