Skip to content

Instantly share code, notes, and snippets.

@Manouchehri
Last active May 29, 2022 11:44
Show Gist options
  • Save Manouchehri/0ce55d239fb07c41c92f to your computer and use it in GitHub Desktop.
Save Manouchehri/0ce55d239fb07c41c92f to your computer and use it in GitHub Desktop.
Allowing gzip encoding with urllib
__author__ = 'David Manouchehri'
from bs4 import BeautifulSoup
import urllib.request
import gzip
import io
url = 'http://yoururlgoesherehopefullythisisntavalidurl.com/pages.html'
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0'}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
if response.info().get('Content-Encoding') == 'gzip':
pagedata = gzip.decompress(response.read())
elif response.info().get('Content-Encoding') == 'deflate':
pagedata = response.read()
elif response.info().get('Content-Encoding'):
print('Encoding type unknown')
else:
pagedata = response.read()
soup = BeautifulSoup(pagedata)
print(soup.prettify())
@spex66
Copy link

spex66 commented May 7, 2021

thx for sharing!
your snippet help me to fix httpie/http-prompt > cli.py which failed reading a compressed resource coming from the --spec parameter :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment