Skip to content

Instantly share code, notes, and snippets.

@ecdedios
Created May 25, 2020 17:46
Show Gist options
  • Save ecdedios/3c779b0d9d8f9466f66d46ac07d7bff7 to your computer and use it in GitHub Desktop.
Save ecdedios/3c779b0d9d8f9466f66d46ac07d7bff7 to your computer and use it in GitHub Desktop.
Get coronavirus-related articles from npr.org using the newspaper library.
import requests
import json
import time
import newspaper
import pickle
npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')
corpus = []
count = 0
for article in npr.articles:
time.sleep(1)
article.download()
article.parse()
text = article.text
corpus.append(text)
if count % 10 == 0 and count != 0:
print('Obtained {} articles'.format(count))
count += 1
corpus300 = corpus[:300]
with open("npr_coronavirus.txt", "wb") as fp: # Pickling
pickle.dump(corpus300, fp)
# with open("npr_coronavirus.txt", "rb") as fp: # Unpickling
# corpus = pickle.load(fp)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment