kmike/notebook.ipynb

## notebook.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fd8192cf",
   "metadata": {},
   "source": [
    "Two implementations of url_has_any_extension:\n",
    "\n",
    "* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n",
    "* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n",
    "\n",
    "The new implementation is more correct, because it works for extensions like .tar.gz."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8f646b9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import posixpath\n",
    "\n",
    "from scrapy.utils.url import parse_url\n",
    "\n",
    "def url_has_any_extension_27(url, extensions):\n",
    "    \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
    "    lowercase_path = parse_url(url).path.lower()\n",
    "    return any(lowercase_path.endswith(ext) for ext in extensions)\n",
    "\n",
    "\n",
    "def url_has_any_extension_26(url, extensions):\n",
    "    \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
    "    return posixpath.splitext(parse_url(url).path)[1].lower() in extensions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccf338bd",
   "metadata": {},
   "source": [
    "Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "3fbcbd78",
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy.linkextractors import IGNORED_EXTENSIONS\n",
    "\n",
    "# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n",
    "# We're using set because LinkExtractor uses set for deny_extensions.\n",
    "extensions = {'.' + e for e in IGNORED_EXTENSIONS}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1b93a99",
   "metadata": {},
   "source": [
    "Case 1: an URL where an extension is present.\n",
    "\n",
    "Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "305279c3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(True, True)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n",
    "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ebd8a37b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit url_has_any_extension_26(url, extensions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "60af1899",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit url_has_any_extension_27(url, extensions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ce4aa3d",
   "metadata": {},
   "source": [
    "New version is slower, but both are super-fast. There is probably nothing to worry about.\n",
    "\n",
    "Case 2: extension is not present in URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f56ff91c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(False, False)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n",
    "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "9d16127d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit url_has_any_extension_26(url, extensions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "ad968d6d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit url_has_any_extension_27(url, extensions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43a62e75",
   "metadata": {},
   "source": [
    "Again, the new version is slower, but both are very fast. There is probably nothing to worry about."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "fd8192cf",
	"metadata": {},
	"source": [
	"Two implementations of url_has_any_extension:\n",
	"\n",
	"* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n",
	"* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n",
	"\n",
	"The new implementation is more correct, because it works for extensions like .tar.gz."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "8f646b9b",
	"metadata": {},
	"outputs": [],
	"source": [
	"import posixpath\n",
	"\n",
	"from scrapy.utils.url import parse_url\n",
	"\n",
	"def url_has_any_extension_27(url, extensions):\n",
	" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
	" lowercase_path = parse_url(url).path.lower()\n",
	" return any(lowercase_path.endswith(ext) for ext in extensions)\n",
	"\n",
	"\n",
	"def url_has_any_extension_26(url, extensions):\n",
	" \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
	" return posixpath.splitext(parse_url(url).path)[1].lower() in extensions"
	]
	},
	{
	"cell_type": "markdown",
	"id": "ccf338bd",
	"metadata": {},
	"source": [
	"Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"id": "3fbcbd78",
	"metadata": {},
	"outputs": [],
	"source": [
	"from scrapy.linkextractors import IGNORED_EXTENSIONS\n",
	"\n",
	"# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n",
	"# We're using set because LinkExtractor uses set for deny_extensions.\n",
	"extensions = {'.' + e for e in IGNORED_EXTENSIONS}"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b1b93a99",
	"metadata": {},
	"source": [
	"Case 1: an URL where an extension is present.\n",
	"\n",
	"Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"id": "305279c3",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(True, True)"
	]
	},
	"execution_count": 25,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n",
	"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"id": "ebd8a37b",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit url_has_any_extension_26(url, extensions)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"id": "60af1899",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit url_has_any_extension_27(url, extensions)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5ce4aa3d",
	"metadata": {},
	"source": [
	"New version is slower, but both are super-fast. There is probably nothing to worry about.\n",
	"\n",
	"Case 2: extension is not present in URL."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"id": "f56ff91c",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(False, False)"
	]
	},
	"execution_count": 28,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n",
	"url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"id": "9d16127d",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit url_has_any_extension_26(url, extensions)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"id": "ad968d6d",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit url_has_any_extension_27(url, extensions)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "43a62e75",
	"metadata": {},
	"source": [
	"Again, the new version is slower, but both are very fast. There is probably nothing to worry about."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}