Skip to content

Instantly share code, notes, and snippets.

@ebta
Forked from mbohun/NOTES.md
Last active September 13, 2023 01:44
Show Gist options
  • Save ebta/ccde7c9f0c9e2e5527f6b76b15b46660 to your computer and use it in GitHub Desktop.
Save ebta/ccde7c9f0c9e2e5527f6b76b15b46660 to your computer and use it in GitHub Desktop.
Converting MS Office files to PDF

Converting MS Office files to PDF

(Description of the different solutions / alternatives)

1. Microsoft Windows based solutions

1.1 Microsoft Graph API (Office 365)

This is the current, official, Microsoft endorsed/supported solution ("cloud based")
(2017 - present)

  1. The user uploads their MS Office document (source.doc in our example snippet bellow) to their Microsoft OneDrive

  2. The user then uses the Microsoft Graph REST API to send a HTTP GET Request to the Convert content endpoint:

    GET /drive/root:/{path to file}:/content?format={format}
    

    setting the following parameters:

    • {path to file} you want to convert
    • {format} the desired output file format (PDF in our case)

    example:

    https://graph.microsoft.com/v1.0/me/drive/root:/source.doc:/content?format=pdf   
    
  3. The Microsoft Graph REST API sends back a HTTP Response (Header) containing a Location field with the URL of the converted PDF document (ready for download)

  4. The user then downloads the converted PDF document from the URL returned in the previous step

HttpWebRequest convToPdfRequest =
    (HttpWebRequest)WebRequest.Create("https://graph.microsoft.com/v1.0/me/drive/root:/source.doc:/content?format=pdf");
            
HttpWebResponse convToPdfResponse =
    (HttpWebResponse)myHttpWebRequest.GetResponse();

string pdfDownloadUrl = convToPdfResponse.GetResponseHeader("Location");

1.2 Microsoft SharePoint 2010 Word Automation Services

This is the previous, official Microsoft endorsed/supported (server-side) solution
(2010 - 2016)

  1. The User has to purchase SharePoint Server 2010 ("standard edition", or "enterprise edition")
  2. Word Automation Services is a service that installs and runs (by default) with a stand-alone SharePoint Server 2010 installation
    • Microsoft recomends the number of worker processes be set to no more than one less than the number of processors on your server
    • Microsoft recommends that you configure the system for a maximum of 90 document conversions per worker process per minute
    • By default, it starts conversion processes at 15 minute intervals; In addition, there are scenarios where you may want Word Automation Services to use as much resources as possible. Those scenarios may also benefit from setting the interval to one minute
  3. Once installed and configured you can start using it:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.SharePoint;
using Microsoft.Office.Word.Server.Conversions;

class Program
{
    static void Main(string[] args)
    {
        string siteUrl = "http://localhost";
        // If you manually installed Word automation services, then replace the name
        // in the following line with the name that you assigned to the service when
        // you installed it.
        string wordAutomationServiceName = "Word Automation Services";
        using (SPSite spSite = new SPSite(siteUrl))
        {
            ConversionJob job = new ConversionJob(wordAutomationServiceName);
            job.UserToken = spSite.UserToken;
            job.Settings.UpdateFields = true;
            job.Settings.OutputFormat = SaveFormat.PDF;
            job.AddFile(siteUrl + "/Shared%20Documents/source.doc",
                siteUrl + "/Shared%20Documents/source.pdf");
            job.Start();
        }
    }
}

NOTE: The official Microsoft description of this solution, including installation, configuration, software development advice with C# examples is here.

1.3 Microsoft Office over COM

This is an older, constrained solution (perhaps "workaround" is more fitting)
(apx. 2005 - 2010)

Constains / Limitations

"All current versions of Microsoft Office were designed, tested, and configured to run as end-user products on a client workstation. They assume an interactive desktop and user profile. They do not provide the level of reentrancy or security that is necessary to meet the needs of server-side components that are designed to run unattended.

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment."
Microsoft's official explanations of the contraints/limitations of using this approach is here.

SUMMARY: The above constrains/limitations details are the reason why the "MsOfficeToPdfConverter service" will be most likely implemented and run as interactive-desktop-like Microsoft Windows application, rahter than a "real" Microsoft Windows service.

  1. Microsoft Windows env. with Microsoft Office 2007 (or higher) pre-installed
  2. The solution is a script-or-application written in a language of your choice (PowerShell, .NET/C#, Python/pywin32) that uses Microsoft COM layer to invoke Microsoft Office (for example MS Word) functionality:
    For each of the MS Office documents you want to convert, the script-or-application:
    1. opens the MS Office document
    2. saves the document in the desired output format (PDF)
    3. close the document

Example (This is the original script that was tested to convert the example patient record MS Office files to PDF on a Microsoft Windows 10 env):

# This script converts all the .doc and .docx files in the `$documents_path` dir to .pdf
#
# It needs proper/robust error handling:
#    - https://stackoverflow.com/questions/16534292/basic-powershell-batch-convert-word-docx-to-pdf
#      NOTE: the 2nd post about crashes, and their ("typical") m$ workaround
#

$documents_path = '.\test_files'

$word_app = New-Object -ComObject Word.Application

# This filter will find .doc as well as .docx documents
Get-ChildItem -Path $documents_path -Filter *.doc? | ForEach-Object {
    $document = $word_app.Documents.Open($_.FullName)
    $pdf_filename = "$($_.DirectoryName)\$($_.BaseName).pdf"
    $document.SaveAs([ref] $pdf_filename, [ref] 17)
    $document.Close()
}

$word_app.Quit()

2. Non-Microsoft Windows based solutions

2.1 Adobe Acrobat DC solution

2.2 Other 3-rd party solutions

REFERENCES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment