Skip to content

Instantly share code, notes, and snippets.

@glickmac
Last active December 17, 2019 19:44
Show Gist options
  • Save glickmac/80d74c15793d68b21bb90db3eb9685b4 to your computer and use it in GitHub Desktop.
Save glickmac/80d74c15793d68b21bb90db3eb9685b4 to your computer and use it in GitHub Desktop.
import requests
from bs4 import BeautifulSoup
import nltk
nltk.download("stopwords")
nltk.download('vader_lexicon')
### Pull
url = 'http://www.gutenberg.org/files/501/501-0.txt'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
### Convert to string and clean text to remove annotations
text = str(text)
text = text.replace("\n", " ").replace("\r", " ").replace("\\r", " ").replace("\\n", " ").replace("_", "").lower()
text = text.split("the first chapter")[1].split("illustration: the end")[0]
### Save File to Data Folder
with open("../data/Doctor_Dolittle.txt", "w") as f:
f.write(text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment