Skip to content

Instantly share code, notes, and snippets.

@gettalong
Last active October 29, 2017 16:41
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gettalong/d894c29b551917573c708cd5110653a0 to your computer and use it in GitHub Desktop.
Save gettalong/d894c29b551917573c708cd5110653a0 to your computer and use it in GitHub Desktop.
Unicode NFC/NFD differences in PDF

When creating a PDF it depends on the application writing the PDF whether decomposed Unicode characters ("combining sequences") are correctly positioned.

The basic way (that most applications use) is to just treat the separate Unicode characters as if they were normal characters. This leads to incorrectly positioned combining marks as the glyph width of the combining mark is not suitable for all characters it can be combined with.

A better way would be to perform Unicode normalization (see http://unicode.org/reports/tr15/), more specifically Normalization Form C (NFC) which composes characters if possible (in contrast to NFD which decomposes them). However, this may lead to changes in the meaning of some characters (see the link and scroll down to figure 3).

The best way would be to use fonts that contain all needed information to correctly position combining characters. Many modern OpenType fonts include such information in internal structures (like the GPOS table). Note that the application writing the PDF needs to be able to handle this information since for PDF glyph positioning is done by the writer, not the reader!

The script umlaut.rb uses HexaPDF to create a sample PDF that shows "Müller" twice, once in NFC form and once in NFD form. As can be seen the output of the NFD form positions the diaresis incorrectly (as expected since HexaPDF's Canvas#text method simply outputs the given Unicode string). The used font is Linux Libertine, a free font with many typographic features.

Also see: https://en.wikipedia.org/wiki/Combining_character

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
require 'hexapdf'
doc = HexaPDF::Document.new
doc.config['font.map'] = {'sans' => {none: "LinLibertine_Rah.ttf"}}
canvas = doc.pages.add(doc.wrap(Type: :Page, MediaBox: [0, 0, 300, 200])).canvas
canvas.font("sans", size: 100)
canvas.text("Müller".unicode_normalize(:nfc), at: [10, 100])
canvas.text("Müller".unicode_normalize(:nfd), at: [10, 0])
doc.write('umlaut.pdf')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment