Skip to content

Instantly share code, notes, and snippets.

@amitpatelx
Created July 2, 2020 06:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amitpatelx/a317e9712ffeb527df1795cd68546956 to your computer and use it in GitHub Desktop.
Save amitpatelx/a317e9712ffeb527df1795cd68546956 to your computer and use it in GitHub Desktop.
How do PDF files work and Why It's hard to convert them into plain text?

How do PDF files work?

PDF files display texts correctly wherever they are viewed because they carry their typographic information(look and position of each letter individually) with them. Fonts in the document are embedded in the PDF file and are used after distribution to reconstruct the document. The display does not depend on the needed font files being available on the viewing machine, nor on the language of its operating system.

PDF documents present their pages as images. The ability to change the basic text is limited. Most PDF files can be searched, because the file has two layers. There is an image layer that is presented on- screen. Behind that there is usually a text layer that can be matched to the characters displayed on the screen.

When the starting point for a PDF file is a set of images, or a scanning process, this text layer is not present and the result is an image-only PDF. When the starting point is an editable document, the text layer can be created and the PDF is called 'Normal' or 'Searchable'. The creator of a PDF can require provision of a password to allow access the text layer.

How does PDF Converter work?

PDF Converter has the ability to perform Optical Character Recognition OCR. This is the process of extracting text from an image. It does not need to use OCR to unlock PDF or XPS files with an accessible text layer - it must capture the page layout and arrange the given text and other elements correctly on each page in the new document.

Optical Character Recognition (OCR) is normally used only for input pages without an accessible text layer or when non standard character encoding is detected, but you can require it for any conversion under Processing Options in the Converter Assistant.

Handling Image-only Pages

Pages without a text layer are a special case for conversion. You can decide how the program should handle these pages: convert them with the built-in Optical Character Recognition (OCR), transfer them as images to the target document or skip them. You can require inspection of the first pages (up to ten) in files you open. Optionally, you can set conversion to be stopped, if no text-layer pages are detected.

Let's understand the issue of conversion between Pdf and Word document.

A key issue for both PDF users and Word users alike is the ability to edit. While the PDF user is concerned with preventing it, a Word user demands it. Whether it’s concern for saving the content of a file or for accessing it.

If you’re not the creator of either the source document or the PDF document itself, could you determine who, when and what changes were made during the original editing of the Word Document?

My answer is no. The PDF preserves only the look-and-feel of Word Documents. It doesn’t preserve anything else especially meta-data or recorded changes. There is no hidden Word data in the PDF file, no information of recorded changes. Hence, there is nothing for a PDF converter to recover. The information which you don’t see on the paper area is lost during PDF creation.

In fact, when you create a PDF file, the actual paragraphs and words you do see aren’t preserved either. All a PDF does is preserve the look and position of each letter individually (that’s why the conversion for PDF back to Word is complicated—our application has to “figure out” which letters combine to make certain words and other elements).

If you think about it, the PDF really is a form of “electronic paper.” That is, it generates the same effect as a printed page held in your hand—just by looking at it, you can’t figure out how many edits or changes were made to it to get it to that final version you hold in your hand. What gets lost in translation, stays lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment