Skip to content

Instantly share code, notes, and snippets.

@Gubolin
Last active May 30, 2023 01:52
Show Gist options
  • Save Gubolin/11174910 to your computer and use it in GitHub Desktop.
Save Gubolin/11174910 to your computer and use it in GitHub Desktop.
PDF.js text extraction
// adapted from https://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript/9213180#9213180
page.render(renderContext).promise.then(function(pdf){
page.getTextContent().then(function(textContent){
var page_text = "";
var last_block = null;
for(j = 0; j < textContent.length; j++){
var block = textContent[j];
if(last_block != null && last_block.str[last_block.str.length - 1] != ' '){
if(block.x < last_block.x){
page_text += "\r\n";
}else if(last_block.y != block.y && (last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null )){
page_text += '\r\n';
}
}
page_text += block.str;
last_block = block;
}
console.log(page_text);
});
callback(pdf);
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment