Skip to content

Instantly share code, notes, and snippets.

@Lesik
Last active April 8, 2022 12:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Lesik/e839585e92d1ac79ce18e9e3c4a5afdb to your computer and use it in GitHub Desktop.
Save Lesik/e839585e92d1ac79ce18e9e3c4a5afdb to your computer and use it in GitHub Desktop.
Fetch a lendable book from archive.org as individual page images

By looking at the book online, you already (unwillingly and unknowingly) downloaded the pages to your computer, the only difference is that you have to keep the browser open, stay logged in and have internet connectivity to read it. I see no moral concerns with taking the data that's already on your computer and saving it in a more accessible way.

Open Chrome and devtools, then visit the book's online reading page and click "Borrow for 1 hour".

In the network tab, click XHR and copy the response of BookReaderJISA.php?... and save it as data in the console.

Then, run:

data.data.brOptions.data.forEach((first, firstI) => {
    first.forEach((second, secondI) => {
        window.setTimeout(_ => {
            const image = document.createElement("img");
            image.src = second.uri;
            document.body.append(image);
        }, (firstI + (secondI * 5)) * 4000);
    });
});

If you're motivated, rewrite the code to wait until an image has fully loaded before firing off the next. I did a quick-and-dirty with hardcoded delay.

Switch to the img filter and wait until everything has loaded. Then, right-click on any entry and select "Save all as HAR with content".

Lastly, run (in node):

const fs = require('fs');
const path = require('path');

const json = JSON.parse(fs.readFileSync('/tmp/archive.org.har'));

for (let entry of json.log.entries) {
  const filename = path.basename(entry.request.url);
  const buffer = new Buffer(entry.response.content.text, "base64");
  fs.writeFileSync(`/tmp/book/${filename}`, buffer);
}
@Lesik
Copy link
Author

Lesik commented Feb 2, 2021

Images are downloaded as jp2, to convert and resize (requires bash):

$ for file in $(find -name '*.jp2'); do convert -resize 50% "$file" "${file%.jp2}.jpg"; done

Combine into a pdf:

$ pdfjoin --fitpaper false --rotateoversize false *.jpg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment