Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
When parsing nytimes.com, Anaconda 2.2.0 has serious problems (BeautifulSoup 4.3.2 + Python 3.4.3 (Anaconda 2.2.0)).
## Using Python 2.7.6, BS 4.2.1
# '2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2]'
from urllib import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://www.nytimes.com").read()
soup = BeautifulSoup(page)
## Count headlines
print(len(soup.select("h2")))
# 153
print(len(page))
# 278412
print(len(soup.prettify()))
# 247651
## Using Python 3.4.3, BS 4.3.2, via Anaconda distribution
# '3.4.3 |Anaconda 2.2.0 (x86_64)| (default, Mar 6 2015, 12:07:41) \n[GCC 4.2.1 (Apple Inc. build 5577)]'
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://www.nytimes.com").read()
soup = BeautifulSoup(page)
## Count headlines
print(len(soup.select("h2")))
# 1
print(len(page))
# 278412
print(len(soup.prettify()))
# 84650 ... WTF???
# Using pyenv build of Python 3.4.3, BS 4.3.2
# '3.4.3 (default, Mar 10 2015, 08:04:33) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)]'
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://www.nytimes.com").read()
soup = BeautifulSoup(page)
## Count headlines
print(len(soup.select("h2")))
# 152
print(len(page))
# 277977
print(len(soup.prettify()))
# 247987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment