Skip to content

Instantly share code, notes, and snippets.

@josht-jpg
Last active September 7, 2020 16:56
Show Gist options
  • Save josht-jpg/dcee2fc139f2a3019beba90d12b9453e to your computer and use it in GitHub Desktop.
Save josht-jpg/dcee2fc139f2a3019beba90d12b9453e to your computer and use it in GitHub Desktop.
Getting goodreads titles
from bs4 import BeautifulSoup
#get_contents removes extraneous empty lists returned from .find_all() in line 16
def get_contents(tag):
return tag.contents[1]
titles = pd.Series(dtype = "string")
#Loops through each webpage in the goodreads list
for i in np.arange(1, 12):
url = "https://www.goodreads.com/list/show/16.Best_Books_of_the_19th_Century?page=" + str(i)
page = request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
page_titles = pd.Series(soup.find_all(class_ = "bookTitle"))
page_titles = page_titles.apply(get_contents)
titles = titles.append(page_titles)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment