Skip to content

Instantly share code, notes, and snippets.

@josht-jpg
Last active September 7, 2020 01:00
Show Gist options
  • Save josht-jpg/d2d2212d427ab1a05578183727e7ff5b to your computer and use it in GitHub Desktop.
Save josht-jpg/d2d2212d427ab1a05578183727e7ff5b to your computer and use it in GitHub Desktop.
Filtering out Dostoevsky titles
#Goodreads has over 50 pages of titles of Dostoevsky books.
#All relevant titles are on the first two pages.
page1_url = "https://www.goodreads.com/author/list/3137322.Fyodor_Dostoyevsky?page=1&per_page=30"
page2_url = "https://www.goodreads.com/author/list/3137322.Fyodor_Dostoyevsky?page=2&per_page=30"
page1 = request.urlopen(page1_url)
page2 = request.urlopen(page2_url)
page1_soup = BeautifulSoup(page1, "html.parser")
page2_soup = BeautifulSoup(page2, "html.parser")
dostoyevsky_titles = pd.Series(page1_soup.find_all(class_ = "bookTitle"))
dostoyevsky_titles = dostoyevsky_titles.append(pd.Series(page2_soup.find_all(class_ = "bookTitle")))
dostoyevsky_titles = dostoyevsky_titles.apply(get_contents)
titles = titles[~ titles.isin(dostoyevsky_titles)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment