Created
December 7, 2021 17:13
-
-
Save jsoma/08b0dffabe95a8fcd99048efc0560d0c to your computer and use it in GitHub Desktop.
How to download a list of files using Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "30f8ab6d", | |
"metadata": {}, | |
"source": [ | |
"# Downloading a list of files with Python \n", | |
"\n", | |
"Let's say I have a long list of PDFs I want to download, like... a list of school board minutes?\n", | |
"\n", | |
"http://www.vineland.org/board-of-education/board-meeting-minutes-2021" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7377b0fa", | |
"metadata": {}, | |
"source": [ | |
"# Just the answer, immediately!\n", | |
"\n", | |
"## The perfect way" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "bbb34bd6", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "d132c26a26c547b3995594d77e4e3f9f", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
" 0%| | 0/5 [00:00<?, ?it/s]" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\n", | |
"http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf\n", | |
"http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf\n", | |
"http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf\n", | |
"http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf\n" | |
] | |
} | |
], | |
"source": [ | |
"from pathlib import Path\n", | |
"from tqdm.auto import tqdm\n", | |
"import requests\n", | |
"\n", | |
"urls = open('files.txt').read().splitlines()\n", | |
"output_dir = Path('downloaded')\n", | |
"output_dir.mkdir(parents=True, exist_ok=True)\n", | |
"\n", | |
"for url in tqdm(urls):\n", | |
" print(url)\n", | |
" \n", | |
" filename = Path(url).name\n", | |
" \n", | |
" response = requests.get(url)\n", | |
" output_dir.joinpath(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "4f4c8ffd", | |
"metadata": {}, | |
"source": [ | |
"## The fewer-lines-of-code-but-less-flexible way" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "584b4463", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20%281%29.pdf\n", | |
"http://www.vineland.org/sites/default/files/09_15_21%20Combined%20Meeting%20Minutes%20%281%29.pdf\n", | |
"http://www.vineland.org/sites/default/files/08_04_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"http://www.vineland.org/sites/default/files/07_28_21%20%20Board%20Retreat%20Minutes.pdf\n", | |
"http://www.vineland.org/sites/default/files/07_07_21%20Combined%20Meeting%20Minutes.pdf\n" | |
] | |
} | |
], | |
"source": [ | |
"import urllib.request\n", | |
"from os import path\n", | |
"\n", | |
"urls = open('files-alt.txt').read().splitlines()\n", | |
"\n", | |
"for url in urls:\n", | |
" filename = path.basename(url)\n", | |
" urllib.request.urlretrieve(url, filename)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3fcff50f", | |
"metadata": {}, | |
"source": [ | |
"# Downloading files one at a time\n", | |
"\n", | |
"* `http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20%281%29.pdf`\n", | |
"* `http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf`" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fb16c796", | |
"metadata": {}, | |
"source": [ | |
"### Method one: `urllib`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "f1b55ebe", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import urllib.request" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "5276e652", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20%281%29.pdf\"\n", | |
"filename = \"output.pdf\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "0d707141", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('output.pdf', <http.client.HTTPMessage at 0x112fd8be0>)" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"urllib.request.urlretrieve(url, filename)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "082e9989", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\"\n", | |
"filename = \"output.pdf\"\n", | |
"\n", | |
"# urllib.request.urlretrieve(url, filename)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "abcec260", | |
"metadata": {}, | |
"source": [ | |
"### Method two: `requests` and `pathlib`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "cb621e08", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# pip install requests" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "aaafc851", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import requests\n", | |
"from pathlib import Path" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "a32a4b88", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"175404" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20%281%29.pdf\"\n", | |
"filename = \"output.pdf\"\n", | |
"\n", | |
"response = requests.get(url)\n", | |
"# .write_text vs .write_bytes\n", | |
"Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "1a551770", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"175404" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\"\n", | |
"filename = \"output.pdf\"\n", | |
"\n", | |
"response = requests.get(url)\n", | |
"Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "a59f77b4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9da7f6cc", | |
"metadata": {}, | |
"source": [ | |
"# Automatic filenames\n", | |
"\n", | |
"### Method one: `os.path`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"id": "ef8b5dc6", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Saving to 10_06_21 Combined Meeting Minutes (1).pdf\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"175404" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from os import path\n", | |
"\n", | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\"\n", | |
"\n", | |
"filename = path.basename(url)\n", | |
"print(\"Saving to\", filename)\n", | |
"\n", | |
"response = requests.get(url)\n", | |
"Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0c979f17", | |
"metadata": {}, | |
"source": [ | |
"### Method two: `pathlib`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"id": "17178dda", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"I'm going to save to 10_06_21 Combined Meeting Minutes (1).pdf\n" | |
] | |
} | |
], | |
"source": [ | |
"filename = Path(url).name\n", | |
"print(\"I'm going to save to\", filename)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "473cc29d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"175404" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"url = \"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\"\n", | |
"\n", | |
"# Hey Path, pull out the filename we want to save it as\n", | |
"filename = Path(url).name\n", | |
"\n", | |
"# Hey requests,go get the file\n", | |
"response = requests.get(url)\n", | |
"\n", | |
"# Hey both of you, work together to save it to the filename\n", | |
"Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae3d7ab1", | |
"metadata": {}, | |
"source": [ | |
"# From a plain list of files\n", | |
"\n", | |
"a.k.a. `readlines()` is awful and tricky" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"id": "8d6f222c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf',\n", | |
" 'http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf',\n", | |
" 'http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf',\n", | |
" 'http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf',\n", | |
" 'http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf']" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# open(\"files.txt\").readlines() is sad and bad and includes \\n\n", | |
"# whereas this is better and just includes the full URLs\n", | |
"urls = open(\"files.txt\").read().splitlines()\n", | |
"urls" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"id": "0d12249d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"------\n", | |
"http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\n", | |
"I want to save this as 10_06_21 Combined Meeting Minutes (1).pdf\n", | |
"------\n", | |
"http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf\n", | |
"I want to save this as 09_15_21 Combined Meeting Minutes (1).pdf\n", | |
"------\n", | |
"http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf\n", | |
"I want to save this as 08_04_21 Combined Meeting Minutes.pdf\n", | |
"------\n", | |
"http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf\n", | |
"I want to save this as 07_28_21 Board Retreat Minutes.pdf\n", | |
"------\n", | |
"http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf\n", | |
"I want to save this as 07_07_21 Combined Meeting Minutes.pdf\n" | |
] | |
} | |
], | |
"source": [ | |
"for url in urls:\n", | |
" print(\"------\")\n", | |
" print(url)\n", | |
" filename = Path(url).name\n", | |
" print(\"I want to save this as\", filename)\n", | |
" response = requests.get(url)\n", | |
" Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "eb167704", | |
"metadata": {}, | |
"source": [ | |
"## Adding a progress bar" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"id": "bb60d003", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#!pip install tqdm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"id": "856964b8", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "f24b8465644f4e51a5ae9042f34fa03b", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
" 0%| | 0/5 [00:00<?, ?it/s]" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"from tqdm.auto import tqdm\n", | |
"\n", | |
"for url in tqdm(urls):\n", | |
" filename = Path(url).name\n", | |
" response = requests.get(url)\n", | |
" Path(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6214c706", | |
"metadata": {}, | |
"source": [ | |
"## Saving into a separate folder" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"id": "f2f92848", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"id": "be6d420d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'07_07_21 Combined Meeting Minutes.pdf'" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"filename" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"id": "842f80de", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"PosixPath('downloads/pdfs/secret-pdfs/07_07_21 Combined Meeting Minutes.pdf')" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"download_dir.joinpath(filename)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"id": "4f71a8d1", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "1422a21eefb54d39bae9242d0d211d5a", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
" 0%| | 0/5 [00:00<?, ?it/s]" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"download_dir = Path('downloads/pdfs/secret-pdfs')\n", | |
"download_dir.mkdir(parents=True, exist_ok=True)\n", | |
"\n", | |
"for url in tqdm(urls):\n", | |
" filename = Path(url).name\n", | |
" response = requests.get(url)\n", | |
" download_dir.joinpath(filename).write_bytes(response.content)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "deef6759", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "dbcc3a04", | |
"metadata": {}, | |
"source": [ | |
"# From inside of a CSV/pandas dataframe\n", | |
"\n", | |
"With a custom filename!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"id": "40898c35", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>date</th>\n", | |
" <th>url</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2021-10-06</td>\n", | |
" <td>http://www.vineland.org/sites/default/files/10...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2021-09-15</td>\n", | |
" <td>http://www.vineland.org/sites/default/files/09...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2021-08-04</td>\n", | |
" <td>http://www.vineland.org/sites/default/files/08...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2021-07-28</td>\n", | |
" <td>http://www.vineland.org/sites/default/files/07...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2021-07-07</td>\n", | |
" <td>http://www.vineland.org/sites/default/files/07...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" date url\n", | |
"0 2021-10-06 http://www.vineland.org/sites/default/files/10...\n", | |
"1 2021-09-15 http://www.vineland.org/sites/default/files/09...\n", | |
"2 2021-08-04 http://www.vineland.org/sites/default/files/08...\n", | |
"3 2021-07-28 http://www.vineland.org/sites/default/files/07...\n", | |
"4 2021-07-07 http://www.vineland.org/sites/default/files/07..." | |
] | |
}, | |
"execution_count": 47, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"df = pd.read_csv(\"filelist.csv\")\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"id": "87ff2297", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Downloading http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"0 None\n", | |
"1 None\n", | |
"2 None\n", | |
"3 None\n", | |
"4 None\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"download_dir = Path('downloads/pdfs/secret-pdfs-from-pandas')\n", | |
"download_dir.mkdir(parents=True, exist_ok=True)\n", | |
"\n", | |
"def download_file(row):\n", | |
" url = row['url']\n", | |
" print(\"Downloading\", url)\n", | |
" \n", | |
" # filename = Path(url).name\n", | |
" # filename = row['date'] + \"-minutes.pdf\"\n", | |
" filename = f\"{row['date']}-minutes.pdf\"\n", | |
" response = requests.get(url)\n", | |
" download_dir.joinpath(filename).write_bytes(response.content)\n", | |
" \n", | |
"df.apply(download_file, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"id": "c0edbe71", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from tqdm.auto import tqdm\n", | |
"tqdm.pandas()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"id": "3d7d2b31", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "cb4936d1f4bf4150a1b471f749f86470", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
" 0%| | 0/5 [00:00<?, ?it/s]" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Downloading http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf\n", | |
"Downloading http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"0 None\n", | |
"1 None\n", | |
"2 None\n", | |
"3 None\n", | |
"4 None\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"download_dir = Path('downloads/pdfs/secret-pdfs-from-pandas')\n", | |
"download_dir.mkdir(parents=True, exist_ok=True)\n", | |
"\n", | |
"def download_file(row):\n", | |
" url = row['url']\n", | |
" print(\"Downloading\", url)\n", | |
" \n", | |
" # filename = Path(url).name\n", | |
" # filename = row['date'] + \"-minutes.pdf\"\n", | |
" filename = f\"{row['date']}-minutes.pdf\"\n", | |
" response = requests.get(url)\n", | |
" download_dir.joinpath(filename).write_bytes(response.content)\n", | |
" \n", | |
"df.progress_apply(download_file, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "7f3bb3a8", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "39b224bf", | |
"metadata": {}, | |
"source": [ | |
"## Saving a nice CSV file as a rough-and-tumble list of filenames" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"id": "858d32b0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.to_csv(\"filelist.txt\", index=False, columns=['url'], header=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0e00fc18", | |
"metadata": {}, | |
"source": [ | |
"# Skipping Python and just using `wget`\n", | |
"\n", | |
"The easiest method" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"id": "6e4a6cee", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2021-12-07 11:57:31-- http://www.vineland.org/sites/default/files/07_07_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"Resolving www.vineland.org (www.vineland.org)... 205.186.152.177\n", | |
"Connecting to www.vineland.org (www.vineland.org)|205.186.152.177|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 171931 (168K) [application/pdf]\n", | |
"Saving to: ‘07_07_21 Combined Meeting Minutes.pdf’\n", | |
"\n", | |
"07_07_21 Combined M 100%[===================>] 167.90K --.-KB/s in 0.1s \n", | |
"\n", | |
"2021-12-07 11:57:32 (1.28 MB/s) - ‘07_07_21 Combined Meeting Minutes.pdf’ saved [171931/171931]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# curl\n", | |
"# wget\n", | |
"# brew install wget\n", | |
"!wget \"http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"id": "336696bc", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2021-12-07 11:57:55-- http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20(1).pdf\n", | |
"Resolving www.vineland.org (www.vineland.org)... 205.186.152.177\n", | |
"Connecting to www.vineland.org (www.vineland.org)|205.186.152.177|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 175404 (171K) [application/pdf]\n", | |
"Saving to: ‘10_06_21 Combined Meeting Minutes (1).pdf’\n", | |
"\n", | |
"10_06_21 Combined M 100%[===================>] 171.29K 1.11MB/s in 0.2s \n", | |
"\n", | |
"2021-12-07 11:57:55 (1.11 MB/s) - ‘10_06_21 Combined Meeting Minutes (1).pdf’ saved [175404/175404]\n", | |
"\n", | |
"--2021-12-07 11:57:55-- http://www.vineland.org/sites/default/files/09_15_21%20Combined%20Meeting%20Minutes%20(1).pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 208685 (204K) [application/pdf]\n", | |
"Saving to: ‘09_15_21 Combined Meeting Minutes (1).pdf’\n", | |
"\n", | |
"09_15_21 Combined M 100%[===================>] 203.79K --.-KB/s in 0.09s \n", | |
"\n", | |
"2021-12-07 11:57:55 (2.14 MB/s) - ‘09_15_21 Combined Meeting Minutes (1).pdf’ saved [208685/208685]\n", | |
"\n", | |
"--2021-12-07 11:57:55-- http://www.vineland.org/sites/default/files/08_04_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 181284 (177K) [application/pdf]\n", | |
"Saving to: ‘08_04_21 Combined Meeting Minutes.pdf’\n", | |
"\n", | |
"08_04_21 Combined M 100%[===================>] 177.04K --.-KB/s in 0.07s \n", | |
"\n", | |
"2021-12-07 11:57:55 (2.62 MB/s) - ‘08_04_21 Combined Meeting Minutes.pdf’ saved [181284/181284]\n", | |
"\n", | |
"--2021-12-07 11:57:55-- http://www.vineland.org/sites/default/files/07_28_21%20%20Board%20Retreat%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 126985 (124K) [application/pdf]\n", | |
"Saving to: ‘07_28_21 Board Retreat Minutes.pdf’\n", | |
"\n", | |
"07_28_21 Board Ret 100%[===================>] 124.01K --.-KB/s in 0.04s \n", | |
"\n", | |
"2021-12-07 11:57:55 (3.38 MB/s) - ‘07_28_21 Board Retreat Minutes.pdf’ saved [126985/126985]\n", | |
"\n", | |
"--2021-12-07 11:57:55-- http://www.vineland.org/sites/default/files/07_07_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 171931 (168K) [application/pdf]\n", | |
"Saving to: ‘07_07_21 Combined Meeting Minutes.pdf.1’\n", | |
"\n", | |
"07_07_21 Combined M 100%[===================>] 167.90K --.-KB/s in 0.04s \n", | |
"\n", | |
"2021-12-07 11:57:55 (4.02 MB/s) - ‘07_07_21 Combined Meeting Minutes.pdf.1’ saved [171931/171931]\n", | |
"\n", | |
"FINISHED --2021-12-07 11:57:55--\n", | |
"Total wall clock time: 0.5s\n", | |
"Downloaded: 5 files, 844K in 0.4s (2.13 MB/s)\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget -i files.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"id": "36f57167", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2021-12-07 11:58:20-- http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20(1).pdf\n", | |
"Resolving www.vineland.org (www.vineland.org)... 205.186.152.177\n", | |
"Connecting to www.vineland.org (www.vineland.org)|205.186.152.177|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 175404 (171K) [application/pdf]\n", | |
"Saving to: ‘downloads/10_06_21 Combined Meeting Minutes (1).pdf’\n", | |
"\n", | |
"10_06_21 Combined M 100%[===================>] 171.29K --.-KB/s in 0.1s \n", | |
"\n", | |
"2021-12-07 11:58:20 (1.24 MB/s) - ‘downloads/10_06_21 Combined Meeting Minutes (1).pdf’ saved [175404/175404]\n", | |
"\n", | |
"--2021-12-07 11:58:20-- http://www.vineland.org/sites/default/files/09_15_21%20Combined%20Meeting%20Minutes%20(1).pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 208685 (204K) [application/pdf]\n", | |
"Saving to: ‘downloads/09_15_21 Combined Meeting Minutes (1).pdf’\n", | |
"\n", | |
"09_15_21 Combined M 100%[===================>] 203.79K --.-KB/s in 0.08s \n", | |
"\n", | |
"2021-12-07 11:58:20 (2.46 MB/s) - ‘downloads/09_15_21 Combined Meeting Minutes (1).pdf’ saved [208685/208685]\n", | |
"\n", | |
"--2021-12-07 11:58:20-- http://www.vineland.org/sites/default/files/08_04_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 181284 (177K) [application/pdf]\n", | |
"Saving to: ‘downloads/08_04_21 Combined Meeting Minutes.pdf’\n", | |
"\n", | |
"08_04_21 Combined M 100%[===================>] 177.04K --.-KB/s in 0.05s \n", | |
"\n", | |
"2021-12-07 11:58:20 (3.47 MB/s) - ‘downloads/08_04_21 Combined Meeting Minutes.pdf’ saved [181284/181284]\n", | |
"\n", | |
"--2021-12-07 11:58:20-- http://www.vineland.org/sites/default/files/07_28_21%20%20Board%20Retreat%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 126985 (124K) [application/pdf]\n", | |
"Saving to: ‘downloads/07_28_21 Board Retreat Minutes.pdf’\n", | |
"\n", | |
"07_28_21 Board Ret 100%[===================>] 124.01K --.-KB/s in 0.03s \n", | |
"\n", | |
"2021-12-07 11:58:20 (4.41 MB/s) - ‘downloads/07_28_21 Board Retreat Minutes.pdf’ saved [126985/126985]\n", | |
"\n", | |
"--2021-12-07 11:58:20-- http://www.vineland.org/sites/default/files/07_07_21%20Combined%20Meeting%20Minutes.pdf\n", | |
"Reusing existing connection to www.vineland.org:80.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 171931 (168K) [application/pdf]\n", | |
"Saving to: ‘downloads/07_07_21 Combined Meeting Minutes.pdf’\n", | |
"\n", | |
"07_07_21 Combined M 100%[===================>] 167.90K --.-KB/s in 0.04s \n", | |
"\n", | |
"2021-12-07 11:58:20 (4.24 MB/s) - ‘downloads/07_07_21 Combined Meeting Minutes.pdf’ saved [171931/171931]\n", | |
"\n", | |
"FINISHED --2021-12-07 11:58:20--\n", | |
"Total wall clock time: 0.5s\n", | |
"Downloaded: 5 files, 844K in 0.3s (2.48 MB/s)\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget -i files.txt -P downloads" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "298a7b37", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
date | url | |
---|---|---|
2021-10-06 | http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf | |
2021-09-15 | http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf | |
2021-08-04 | http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf | |
2021-07-28 | http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf | |
2021-07-07 | http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://www.vineland.org/sites/default/files/10_06_21%20Combined%20Meeting%20Minutes%20%281%29.pdf | |
http://www.vineland.org/sites/default/files/09_15_21%20Combined%20Meeting%20Minutes%20%281%29.pdf | |
http://www.vineland.org/sites/default/files/08_04_21%20Combined%20Meeting%20Minutes.pdf | |
http://www.vineland.org/sites/default/files/07_28_21%20%20Board%20Retreat%20Minutes.pdf | |
http://www.vineland.org/sites/default/files/07_07_21%20Combined%20Meeting%20Minutes.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://www.vineland.org/sites/default/files/10_06_21 Combined Meeting Minutes (1).pdf | |
http://www.vineland.org/sites/default/files/09_15_21 Combined Meeting Minutes (1).pdf | |
http://www.vineland.org/sites/default/files/08_04_21 Combined Meeting Minutes.pdf | |
http://www.vineland.org/sites/default/files/07_28_21 Board Retreat Minutes.pdf | |
http://www.vineland.org/sites/default/files/07_07_21 Combined Meeting Minutes.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment