Skip to content

Instantly share code, notes, and snippets.

@oaustegard
Created April 21, 2025 15:36
Show Gist options
  • Save oaustegard/2bc7a7537626882aac03db985a0774d2 to your computer and use it in GitHub Desktop.
Save oaustegard/2bc7a7537626882aac03db985a0774d2 to your computer and use it in GitHub Desktop.
Vibe Building a Client‑Side PDF Compressor with Ghostscript‑WASM

Vibe Building a Client‑Side PDF Compressor with Ghostscript‑WASM

Hosted at: https://austegard.com/pdf-compressor.html


Modern browsers are incredibly powerful. If you’ve ever tried Squoosh.app’s client‑side image compression, you know just how slick a fully‑in‑browser workflow can feel. What if we extended that idea to PDFs—shrinking large documents without ever sending them to a server?

In this post I’ll walk through how we built PDF Compressor, a 100% client‑side PDF optimizer written in pure HTML/JS/WebAssembly. We started with a Claude 3.7 Sonnet–generated plan, iterated with ChatGPT o4‑mini‑high, and ended up with a reasonabvly‑fast, vector‑preserving compressor powered by Ghostscript‑WASM.


1. The Challenge

“I want a static, client‑side only PDF compressor.”
— Inspired by sqoosh.app’s success with images, our goal was to let users drop a PDF into their browser and get a much smaller file back—no server, no privacy worries.

Key requirements:

  • Purely front‑end: host on GitHub Pages (which hosts austegard.com) with just a pdf-processor.html.
  • True PDF optimization: ideally preserve text & vector graphics, not rasterize everything.
  • Good UX: keep the interface responsive, show progress, offer human‑friendly quality settings.

2. First Prototype: Canvas + JPEG (Too “Scanned”)

Our initial proof‑of‑concept used:

  1. PDF.js to render each page to a <canvas>.
  2. canvas.toDataURL('image/jpeg', quality) to re‑encode at a user‑selected JPEG quality.
  3. jsPDF to stitch the pages back into a new PDF.
// Pseudocode
for (let page of pdfDoc.pages) {
  renderToCanvas(page);
  const img = canvas.toDataURL('image/jpeg', 0.8);
  outPdf.addImage(img, 'JPEG', 0, 0);
}

👍 + Lightning‑fast, tiny code footprint
👎 Flattened every page into a bitmap; text looked like a scanned printout

This approach was great for pictures, but made text and vector diagrams look crudely rasterized.


3. Ghostscript‑WASM to the Rescue

We needed a vector‑preserving solution. Enter Ghostscript, the venerable PDF toolkit—compiled to WebAssembly by the jspawn/ghostscript‑wasm project.

How it works

  • The Emscripten build exports a UMD factory (gs.js) or an ESM module (gs.mjs).
  • We load it in the browser and call:
    await gs.callMain([
      '-sDEVICE=pdfwrite',
      '-dCompatibilityLevel=1.4',
      '-dPDFSETTINGS=/ebook',
      '-dNOPAUSE','-dQUIET','-dBATCH',
      '-sOutputFile=out.pdf',
      'in.pdf'
    ]);
  • Ghostscript recompresses images inside the PDF, optimizes text/graphics streams, but never rasterizes your vectors or fonts.

4. Loading Woes & Regression Lessons

“Cannot read properties of undefined (reading ‘writeFile’)”

We hit a nasty snag: our first Ghostscript integration kept failing because we loaded the UMD bundle as if it were an ES module. That meant Module.FS (the in‑memory filesystem) was never defined:

// ❌ Bad: gs.js is a UMD/CJS wrapper, not an ES module
import initGhostscript from '…/gs.js';

Every time we “fixed” that, we accidentally regressed into a frozen UI with no spinner, or a 404 on the wrong CDN path. The back‑and‑forth was a reminder of the annoyances of vibe coding, especially with ChatGPT...


5. Final Solution: Clean ESM + Polished UX

We settled on the official ES module entrypoint gs.mjs from jsDelivr. That gives us a true initGhostscript() promise that resolves to a ready‑to‑use gs object, complete with gs.FS and gs.callMain().

<script type="module">
import initGhostscript from 'https://cdn.jsdelivr.net/npm/@jspawn/ghostscript-wasm@0.0.2/gs.mjs';

const gs = await initGhostscript({
  locateFile: file => 
    `https://cdn.jsdelivr.net/npm/@jspawn/ghostscript-wasm@0.0.2/${file}`
});

// now gs.FS.writeFile() and gs.callMain() work reliably
</script>

We also unblocked the UI by wrapping the heavy callMain() in a setTimeout(), showing a CSS spinner so users know compression is underway.

The result is a single HTML page — no build tools required—that runs entirely in the browser and produces vector‑quality compressed PDFs.


6. Try It Yourself

Check it out live at:

🔗 https://austegard.com/pdf-compressor.html


Conclusion

What began as a quick sketch in Claude 3.7 Sonnet matured through iterative AI‑assisted brainstorming into a robust, client‑side PDF compressor powered by WebAssembly. Along the way we:

  • Learned that pure JPEG‑raster approaches, while easy, compromise text fidelity.
  • Tackled module‑loading gotchas and UI freeze issues.
  • Delivered a seamless UX with vector‑preserving compression at your fingertips.

If you’re building browser‑only tooling, WebAssembly + Ghostscript might just be the secret sauce you need. Enjoy!

@oaustegard
Copy link
Author

All code was vibe coded -- I only served as product manager, deployment engineer and tester. 96.3% of the above blog post was also written by o4-mini-high: I served as editor, adding some links and removing some unnecessary detail about iterations.

All in the tool was created, with 4 different working iterations, then documented in a little over an hour.

@oaustegard
Copy link
Author

@oaustegard
Copy link
Author

Update

Adding Text Extraction: The Multi-Page Challenge

We also added a text extraction feature to our PDF compressor, which turned out to be trickier than anticipated. The initial implementation only extracted the last page of text—not very useful for multi-page documents!

The Page-by-Page Solution

The breakthrough came from understanding how Ghostscript's text extraction works. Instead of a single output file, we needed to tell Ghostscript to create individual files for each page using a special pattern syntax:

await gs.callMain([
  '-dNOPAUSE', '-dBATCH', '-dQUIET',
  '-sDEVICE=txtwrite',
  '-sOutputFile=page%d.txt', // The %d is the magic here!
  '-dTextFormat=3',
  'input.pdf'
]);

That %d placeholder instructs Ghostscript to create sequentially-numbered files (page1.txt, page2.txt, etc.) for each page's text content. We then stitch them together with page markers:

let allText = '';
let pageIndex = 1;
let hasMorePages = true;

while (hasMorePages) {
  try {
    const pageFilename = `page${pageIndex}.txt`;
    const pageText = new TextDecoder().decode(
      gs.FS.readFile(pageFilename)
    );
    allText += `--- Page ${pageIndex} ---\n${pageText}\n\n`;
    pageIndex++;
  } catch (e) {
    // No more pages found
    hasMorePages = false;
  }
}

This pattern leverages browser-side errors as a control flow mechanism—when we attempt to read a non-existent file, the error tells us we've reached the end of the document.

Text Cleanup & Formatting

We added a cleanupPDFText() function to post-process the extracted text, trimming excessive whitespace and improving readability:

function cleanupPDFText(text) {
  return text
    // Trim trailing whitespace on each line
    .replace(/[ \t]+$/gm, '')
    // Trim leading whitespace on each line IF it's excessive (more than 3 spaces)
    .replace(/^[ \t]{3,}/gm, '    ');
}

This simple addition makes our PDF utility more versatile—now you can grab both the optimized document and its textual content in one browser-based workflow. All with zero server dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment