Skip to content

Instantly share code, notes, and snippets.

@sergiolucero
Last active June 22, 2022 01:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sergiolucero/c1ff2cc169e87e56b099b2663a3742b2 to your computer and use it in GitHub Desktop.
Save sergiolucero/c1ff2cc169e87e56b099b2663a3742b2 to your computer and use it in GitHub Desktop.
descarga revistas La Bicicleta
import wget, fitz, glob
import requests, time
from bs4 import BeautifulSoup
url='http://www.memoriachilena.gob.cl/602/w3-article-100795.html#documentos'
bs = BeautifulSoup(requests.get(url).text, 'lxml')
links = [link['href'] for link in bs.find_all('a')
if '.pdf' in link.get('href','')]
print(len(links))
t0=time.time()
for ix, url in enumerate(links):
if ix%10==5:
print('%d/%d [DT=%f]' %(ix, len(links), round(time.time()-t0,2)))
wget.download(url)
for fn in glob.glob('*.pdf'):
txt = [page.get_text() for page in fitz.open(fn)]
print(fn, len(txt))
@sergiolucero
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment