Skip to content

Instantly share code, notes, and snippets.

@dewanshrawat15
Last active October 13, 2019 14:15
Show Gist options
  • Save dewanshrawat15/06ab03074f6d65c5c7bc7cb13eb66ba6 to your computer and use it in GitHub Desktop.
Save dewanshrawat15/06ab03074f6d65c5c7bc7cb13eb66ba6 to your computer and use it in GitHub Desktop.

Downloader

Works as a basic downloader where you specify the link of the page from multiple files have to be downloaded. Using that page as reference, you specify the type of files that have to be downloaded (for example, pptx / mp3 / mp4). The downloader does the rest for you.

Dependencies

Download the requirements.txt and main.py file. Run:

  • python -m venv myvenv
  • pip install -r requirements.txt
  • python main.py Follow the instructions that follow.

Updates

Add argparse support, a feature to resume downloads

Contact

Please reach out to the author at dewanshrawat15@gmail.com

Downloader

Works as a basic downloader where you specify the link of the page from multiple files have to be downloaded. Using that page as reference, you specify the type of files that have to be downloaded (for example, pptx / mp3 / mp4). The downloader does the rest for you.

Index

  • README.md
  • main.py
  • requirements.txt
import re
from os.path import basename
from tqdm import tqdm
import requests
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import argparse
import os
def download(link, name):
response = requests.get(link, stream=True)
file_size = int(response.headers['content-length'])
downloaded_file_name = name
if os.path.isfile(downloaded_file_name):
file_size_local = os.stat(downloaded_file_name).st_size
if file_size_local == file_size:
print("" + downloaded_file_name + " => File already exists")
else:
print("Downloading "+downloaded_file_name)
with open(downloaded_file_name, 'wb') as f:
for data in tqdm(iterable=response.iter_content(chunk_size=1024), total=file_size / 1024, unit='KB'):
if data:
f.write(data)
print(downloaded_file_name+" downloaded")
def scrape(link, extensionFormat):
src = urlopen(link)
codebase = BeautifulSoup(src, 'html.parser')
codebase = codebase.findAll("a")
req = []
names = []
for i in codebase:
temp = i.get("href")
if extensionFormat in temp:
name = i.getText()
name = name.replace('\n', '')
if '/' in name:
name.replace('/', "|")
name = re.sub(' +', ' ', name)
name = name + "." + extensionFormat
url = urljoin(link, temp)
names.append(name)
req.append(url)
for i in range(len(req)):
# print(req[i], names[i])
download(req[i], names[i])
url = input("Enter the source url to download from => ")
extension = input("Enter the extension of the files that have to be downloaded ")
scrape(url, extension)
beautifulsoup4==4.8.1
certifi==2019.9.11
chardet==3.0.4
idna==2.8
requests==2.22.0
soupsieve==1.9.4
tqdm==4.36.1
urllib3==1.25.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment