Skip to content

Instantly share code, notes, and snippets.

@jwoglom
Created October 13, 2020 02:35
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save jwoglom/361a1051bfb8168ae69acafcc568005b to your computer and use it in GitHub Desktop.
Save jwoglom/361a1051bfb8168ae69acafcc568005b to your computer and use it in GitHub Desktop.
Download Perusall readings as PDF
title = "The title of the article"
urls="""
<image URLs scraped from the page>
"""
# dependencies: imagemagick, img2pdf
from data import title, urls
folder = title.replace(' ','-')
import requests
import os
if not os.path.exists(folder):
os.mkdir(folder)
i = 0
for u in urls.splitlines():
if u:
print('Downloading chunk', i, 'of', title)
open('{}/{:0>2}.png'.format(folder, i), 'wb').write(requests.get(u.strip()).content)
i += 1
pgno = 1
for j in range(0, i, 6):
f = ' '.join(['{}/{:0>2}.png'.format(folder, k) for k in range(j, min(i, j+6))])
print('Converting page', pgno)
os.system('convert -append %s %s/page_%s.png' % (f, folder, pgno))
pgno += 1
print('Converting to pdf')
pages = ' '.join(['{}/page_{}.png'.format(folder, k) for k in range(1, pgno)])
os.system('img2pdf %s -o %s.pdf' % (pages, title))
print('Done')
/*
* Click on a reading in the Perusall web interface,
* and run this script in the developer console.
* Copy-and-paste the console.info output to data.py.
*/
var len = 0;
var times = 0;
var i = setInterval(() => {
var img = document.querySelectorAll("img.chunk"); img[img.length-1].scrollIntoView();
if (len < img.length) {
len = img.length;
} else if (times > 3) {
var urls = [];
img.forEach((e) => urls.push(e.src));
var spl = location.pathname.split('/');
console.info('urls = """\n'+urls.join('\n')+'\n"""\n\ntitle="'+spl[spl.length-1]+'"\n');
clearInterval(i);
} else {
times++;
}
}, 2000);
@tomasjmlopes
Copy link

I would suggest adding magick to line 20 in the download_perusall.py as windows often has more convert.exe commands. Apart of that great little script.
os.system('magick convert -append %s %s/page_%s.png' % (f, folder, pgno))

@ishanfdo18098
Copy link

Thank you

@esspadoo
Copy link

esspadoo commented Oct 5, 2023

I've tried to run the script and it works but not well.
I will explaing what is my issue.
I'm having problems, i think, in the part where I have to scrape the file online on perusall to get the urls.
The problem is that the get_urls script work and return me the urls but not all of them, and when the download _perusall.py script run it create a pdf that show only the initial and last pages.
Probably the problem is that perusall dosen't load the page in time for being captured
Any suggestion of what to do? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment