Skip to content

Instantly share code, notes, and snippets.

@alanorth
Last active November 6, 2018 07:23
Show Gist options
  • Save alanorth/2206f24483fe5f0454fc to your computer and use it in GitHub Desktop.
Save alanorth/2206f24483fe5f0454fc to your computer and use it in GitHub Desktop.
Read CSV, fetch PDFs, and generate thumbnails. The filename column is derived from the dc.identifier.url field (using split in OpenRefine, but could be done elsewhere obviously), and we should watch out for URL encoded stuff (ugh).
cg.subject.ciat dc.contributor.author dc.contributor.corporate dc.cplace.country dc.date.issued dc.description.abstract dc.identifier.citation dc.identifier.status dc.identifier.uri dc.identifier.url filename dc.language.iso dc.publisher dc.rplace.region dc.subject dc.title dc.type.output
POLICY||NUTRITION Pachico, DH||Seré Rabé, C 1981 Pachico, Douglas H.; Seré Rabé, Carlos. 1981. Food consumption patterns and malnutrition in Latin America : Some issues for commodity priorities and policy analysis. Centro Internacional de Agricultura Tropical (CIAT), Cali, CO. 36 p. Open Access http://ciat-library.ciat.cgiar.org/ciat_digital/CIAT/64661.pdf 64661.pdf en Centro Internacional de Agricultura Tropical (CIAT) LATIN AMERICA FOOD CONSUMPTION||MALNUTRITION||LATIN AMERICA||CONSUMO DE ALIMENTOS||MALNUTRICIÓN||AMÉRICA LATINA Food consumption patterns and malnutrition in Latin America : some issues for commodity priorities and policy analysis Report
MONITORING AND REPORTING Woolley, JN||Pachico, DH 1987 Woolley, Jonathan N.; Pachico, Douglas H. 1987. Un marco metodológico para la investigación en campos de agricultores. Centro Internacional de Agricultura Tropical (CIAT), Cali, CO. 43 p. Open Access http://ciat-library.ciat.cgiar.org/ciat_digital/CIAT/64195.pdf 64195.pdf es Centro Internacional de Agricultura Tropical (CIAT) FIELD EXPERIMENTATION||METHODS||COLOMBIA||EXPERIMENTACIÓN EN CAMPO||MÉTODOS||COLOMBIA Un marco metodológico para la investigación en campos de agricultores Report
#!/usr/bin/env python
#
# generate-thumbnails.py 1.0.0
#
# Copyright 2018 Alan Orth.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# ---
# Reads the "filename" and "dc.identifier.url" fields from a CSV,
# fetches the PDF, and generates a thumbnail using GraphicsMagick.
#
# The script is written for Python 3+ and requires PETL and Requests:
#
# $ pip install petl requests
#
# See: https://petl.readthedocs.org/en/latest
# See: https://requests.readthedocs.org/en/master
import os.path
import petl as etl
import re
import requests
import signal
import subprocess
import sys
def signal_handler(signal, frame):
sys.exit(0)
# Process thumbnails from filename.pdf to filename.jpg using GraphicsMagick
# and Ghostscript. Equivalent to the following shell invocation:
#
# gm convert -quality 85 -thumbnail x400 -flatten 64661.pdf\[0\] cover.jpg
#
def create_thumbnail(record):
filename = record[0]
thumbnail = os.path.splitext(filename)[0] + '.jpg'
# check if we already have a thumbnail
if os.path.isfile(thumbnail):
print("> Thumbnail for", filename, "already exists")
else:
print("> Creating thumbnail for", filename)
subprocess.run(["gm", "convert", "-quality", "85", "-thumbnail", "x400", "-flatten", filename + "[0]", thumbnail])
return
def download_bitstream(record):
# some records have multiple URLs separated by "||"
pattern = re.compile("\|\|")
urls = pattern.split(record[0])
filenames = pattern.split(record[1])
for url, filename in zip(urls, filenames):
print("URL: " + url)
print("File: " + filename)
# check if file exists
if os.path.isfile(filename):
print(">", filename, "already downloaded")
else:
print("> Downloading", filename)
response = requests.get(url, stream=True)
if response.status_code == 200:
with open(filename, 'wb') as fd:
for chunk in response:
fd.write(chunk)
else:
print("> Download failed, I'll try again next time")
return
# make sure the user passed us the name of a CSV on the command line
if len(sys.argv) == 2:
# read records from the CSV
records = etl.fromcsv(sys.argv[1])
else:
print("Usage: " + sys.argv[0] + " filename.csv")
exit()
# set the signal handler for SIGINT (^C)
signal.signal(signal.SIGINT, signal_handler)
# get URL and filename fields for each record
# make sure other URL fields like dc.identifier.url[] etc are merged into this one and filename column exists!
for record in etl.values(records, 'dc.identifier.url', 'filename'):
download_bitstream(record)
# maybe only generate thumbnails if -t is passed?
#create_thumbnail(record)
$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
> Downloading 64661.pdf
> Creating thumbnail for 64661.pdf
Processing 64195.pdf
> Downloading 64195.pdf
> Creating thumbnail for 64195.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment