Skip to content

Instantly share code, notes, and snippets.

@huned
Created May 11, 2020 16:00
Show Gist options
  • Save huned/ba0d8ab744c5fa474ac5f18bf2c73d54 to your computer and use it in GitHub Desktop.
Save huned/ba0d8ab744c5fa474ac5f18bf2c73d54 to your computer and use it in GitHub Desktop.
Convert a PDF to JSON with pdf.js.
const fs = require('fs')
const pdfjs = require('pdfjs-dist')
const readPdf = async filename => {
const buf = await fs.promises.readFile(filename)
const doc = await pdfjs.getDocument(buf).promise
const docBlob = {
metadata: null,
pages: []
}
docBlob.metadata = await doc.getMetadata()
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const pageBlob = {}
pageBlob.viewport = page.getViewport({ scale: 1.0 })
pageBlob.items = []
const textContent = await page.getTextContent()
for (const item of textContent.items) {
// Set some convenience properties
item.x1 = item.transform[4]
item.x2 = item.x1 + item.width
item.y1 = item.transform[5]
item.y2 = item.y1 + item.height
pageBlob.items.push(item)
}
docBlob.pages.push(pageBlob)
}
console.log(require('util').inspect(docBlob, { depth: null }))
return docBlob
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment