Skip to content

Instantly share code, notes, and snippets.

@mogsdad
Last active April 10, 2024 01:38
Show Gist options
  • Star 51 You must be signed in to star a gist
  • Fork 17 You must be signed in to fork a gist
  • Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
For http://stackoverflow.com/questions/26613809, a question about getting pdf attachments in gmail as text. I got a little carried away - this does much more than asked.

Google Apps Script pdfToText Utility#

This is a helper function that will convert a given PDF file blob into text, as well as offering options to save the original PDF, intermediate Google Doc, and/or final plain text files. Additionally, the language used for Optical Character Recognition (OCR) may be specified, defaulting to 'en' (English).

Note: Updated 12 May 2015 due to deprecation of DocsList. Thanks to Bruce McPherson for the getDriveFolderFromPath() utility.

    // Start with a Blob object
    var blob = gmailAttchment.getAs(MimeType.PDF);
    
    // fileId will be the ID of a saved text file (default behavior):
    var fileId = pdfToText( blob );

    // filetext will contain text from pdf file, no residual files are saved:
    var filetext = pdfToText( blob, {keepTextfile: false} );

    // we can save other converted file types, too:
    var options = {
       keepPdf : true,            // Keep a copy of the original PDF file.
       keepGdoc : true,           // Keep a copy of the OCR Google Doc file.
       keepTextfile : true,       // Keep a copy of the text file. (default)
       path : "attachments/today" // Folder path to store file(s) in.
    }
    filetext = pdfToText( blob, options );
/**
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* @param {blob} pdfFile Blob containing pdf file
* @param {object} options (Optional) Object specifying handling details
*
* @returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
// Helper utility from http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}
@ChaIcuWo
Copy link

ChaIcuWo commented Sep 4, 2015

Generally this works really well for my purposes, but I've had a repeating issue with some PDFs where the text returned is not from the entire document. In the most recent instance this is a 19 page PDF and only the first 10 pages are returned. This repeats even if I randomly re-arrange the pages in the document. Is this a known issue/limitation?

EDIT: I found this https://stackoverflow.com/questions/27303488/using-gas-to-overcome-the-max-file-size-limit-for-google-drive-api-drive-files-i immediately after posting this... Seems we're stuck with this

@bobsquito
Copy link

I've been using this for a few months but yesterday it just stopped working completely, is anyone else having this problem?

It fails on line
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

Just says internal error. I've also noticed that I can't right click a pdf in Drive and go to "open with > google docs" as that just errors too. I hope this gets fixed as I use this a lot.

@woetwoet
Copy link

Same issue here.... Also internal error on the same line.

@woetwoet
Copy link

I logged it as an issue with google : https://code.google.com/p/google-apps-script-issues/issues/detail?id=6201. As far as I see it is a bug with the google dudes.

@dan23njguy
Copy link

Hi mogsdad,

I currently use a script to download emailed pdf docs and save them to drive. Would you be interested in combining the two scripts or adding on to either? I am interested in engaging someone to code various scripts.

Thanks!
Dan

@arturoliveira
Copy link

@dan23njguy i currently also use another script to save attachments from gmail to drive and was looking for a way to search text inside pdfs and found this. Would you be interested in working together ?

@2wice2
Copy link

2wice2 commented Mar 19, 2018

Is it possible to rename PDF from the content of PDF?

@Armsp0
Copy link

Armsp0 commented May 24, 2018

I try to debug/run this code and I get the error "TypeError: Cannot call method "getName" of undefined etc"
What is wrong here? obviously I am a novice....

@vladox
Copy link

vladox commented Apr 10, 2021

I'll recommend using pdftotext from the poppler package

@ddavidio
Copy link

Hi David, by any chance, is there a license for this gist? I would like to use it to one of my projects but I want to make sure that I am not infringing on your copyright

@muhammadammarafzal
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

@appscriptexpert
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help https://appsscriptexpert.com/

@muhammadammarafzal
Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help https://appsscriptexpert.com/

It's giving error, "TypeError: DriveApp.getFolderById(...).insert is not a function".
https://script.google.com/d/17hfC56JdTkwL-XoXh_88u3bYysJuqd9ihKLtoO_rgEh-7lkCXPiHe9o6/edit?usp=sharing

@thokoe
Copy link

thokoe commented Oct 20, 2022

Hi, Is it possible to run the script without having to save the google doc to drive and then delete it.

@appscriptexpert
Copy link

appscriptexpert commented Oct 20, 2022 via email

@dahse89
Copy link

dahse89 commented Apr 13, 2023

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

@ariessetiyawan
Copy link

The Script wasn't working for me, but i found this https://www.labnol.org/extract-text-from-pdf-220422

/*
 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file
 */

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
    {
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
    },
    pdfDocument.getBlob(),
    {
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',
    }
  );

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed
  DriveApp.getFileById(id).setTrashed(true);

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
};

how abaut get Image ?, when I add script

const ImgContent = DocumentApp.openById(id).getBody().getImage();

I cannot get all Images....in PDF file, there are 3 images but 2 images detected only

@fernandomora
Copy link

The Drive.Files.insert api is outdated, now it needs a ParentReference on parents field, and the request is always uploading to the root folder

The if (options.path) must be replaced by

  if (options.path) {
    const folder = getDriveFolderFromPath (options.path);
    if (folder) {
      const parentReference = Drive.newParentReference();
      parentReference.id = folder.getId();
      parents.push(parentReference);
    }
  }

@lidia-rbr
Copy link

I'm using this function that works well:

/**
 * EXTRACT TEXT CONTENT FROM PDF 
 * 
 * @param {string} fileId
 * @param {string} parentFolderId
 * @returns {string} pdfContent
 */
function extractTextFromPDF(fileId, parentFolderId) {

  const destFolder = Drive.Files.get(parentFolderId, { "supportsAllDrives": true });
  const newFile = {
    "fileId": fileId,
    "parents": [
      destFolder
    ]
  };
  const args = {
    "resource": {
      "parents": [
        destFolder
      ],
      "name": "temp",
      "mimeType": "application/vnd.google-apps.document",
    },
    "supportsAllDrives": true
  };

  const newTargetDoc = Drive.Files.copy(newFile, fileId, args);
  const newTargetFile = DocumentApp.openById(newTargetDoc.getId());
  const pdfContent = newTargetFile.getBody().getText();

  return pdfContent;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment