Last active
September 7, 2020 16:56
-
-
Save josht-jpg/dcee2fc139f2a3019beba90d12b9453e to your computer and use it in GitHub Desktop.
Getting goodreads titles
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup | |
#get_contents removes extraneous empty lists returned from .find_all() in line 16 | |
def get_contents(tag): | |
return tag.contents[1] | |
titles = pd.Series(dtype = "string") | |
#Loops through each webpage in the goodreads list | |
for i in np.arange(1, 12): | |
url = "https://www.goodreads.com/list/show/16.Best_Books_of_the_19th_Century?page=" + str(i) | |
page = request.urlopen(url) | |
soup = BeautifulSoup(page, "html.parser") | |
page_titles = pd.Series(soup.find_all(class_ = "bookTitle")) | |
page_titles = page_titles.apply(get_contents) | |
titles = titles.append(page_titles) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment