Skip to content

Instantly share code, notes, and snippets.

@mvidner
Last active April 8, 2019 08:00
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save mvidner/e96ac917d9a54e09d9730220a34b0d24 to your computer and use it in GitHub Desktop.

Problems with Bidirectional (BiDi) Text

If the whole paragraph contains only right-to-left text, it poses no problem. Problems are much likelier to occur if we mix the text directions.

Here I want to write down my knowledge so that the simple parts are easily accessible to beginners, and terms are defined for reference when sorting and solving bugs

TODO: show example fixes to the problems (and note that they may look wrong, because of LTR framing)

Symptoms

Here we discuss the symptoms of the problems, as they appear to the untrained eye.

Brackets are wrong

In left-to-right text, the the opening bracket is at the left, C-shaped. In right-to-left text, glyph mirroring is used so that the same Unicode character is placed at the right, D-shaped.

If something goes wrong, you will see brackets turned the wrong way, or in the wrong place.

Possible causes:

  • Wrong Brackets in the Text
  • UBA too old (6.2)

Line of Text Misaligned

Most lines of a right-to-left text are correctly aligned right but some are aligned left. They start (at the left) with a Latin character.

Possible causes:

Slashes in file system paths are misplaced

A path such as /folder/file.txt appears as folder/file.txt/ instead.

Possible causes:

Causes and Solutions

To find the Cause of a Symptom requires knowledge and/or luck. See also Techniques and Tools below.

Directional Context is Wrong

Cause: when the main language of the text is Arabic, all lines should start at the right even if they happen to only contain words in the Latin script. This requires that the rendering engine knows the overall directional context of the text, in this case right-to-left.

It may happen that the context is wrong or missing (defaulting to left-to-right). Most of the Arabic text will still be correctly aligned right but lines that start with a Latin letter will get misaligned left.

Solution 1: Fix the code (yourself, or ask your vendor) to correctly propagate directional context.

Solution 2: Explicitly add directional context with HTML markup: <div dir='rtl'>...</div>.

Explicit Formatting Characters are Missing

Cause: In some cases bidirectional text needs explicit formatting characters to be rendered properly.

Slashes in File System Paths

The slash (/) is a Weak character so the initial slash in a file system path gets the right-to-left direction and appears at the wrong side.

Solution: Prepend the initial slash with a left-to-right mark, or enclose the path in a left-to-right embedding.

Incorrect Correct
RR /LL/LL RR RR (LRM)/LL/LL RR
RR /LL/LL RR RR (LRE)/LL/LL(PDF) RR

Where

  • (LRM) is a Left-to-right mark (U+200E)
  • (LRE) is Left-to-right embedding (U+202A)
  • (PDF) is Pop Directional Formatting (U+202C)

Wrong Brackets in the Text

Cause: The original text has correct brackets, like (), but the translated text has them wrong, like ((, )(, )). (The translator probably got betrayed by the additional context that the translation tools often provide. BiDi is hard.)

Solution: Fix the translation. Use a translation tool that shows the translated text in its own paragraph(s), without surrounding quotes and the like.

UBA Too Old (6.2)

Cause: Up to the version 6.2, the Unicode Bidirectional Algorithm did not have a good solution for plain bracketed LTR text in a RTL context. The text "First outside (then inside)" would get rendered as "(First outside (then inside"

Solution: ask your vendor for a newer version. (Qt has significant fixes in version 5.11)

Solution2: If a translation layer is involved, add Explicit Formatting Characters

Techniques and Tools

BiDi Debugger

https://bidi-debugger.herokuapp.com/ displays each character on its own line, eliminating confusion about which character comes first and whether it is mirrored or not. It also shows the Unicode codepoint number.

BIDI (UBA) C Reference

https://unicode.org/cldr/utility/bidic.jsp displays the gory details of the Unicode Bidirectional Algorithm.

(An older Java based version of this only implements the 6.2 version of UBA)

bidi-test (YaST specific)

https://github.com/mvidner/bidi-test

Glossary

Glyph

A visual representation of a character. In unidirectional text, fonts account for the most differences between glyphs for the same characters.

(In Arabic, a character has up to four different glyph forms depending on its position in a word: initial, medial, final, isolated)

In right-to-left text, bracket glyphs get mirrored.

Strong Character

A character that knows its directionality: Latin letters are left-to-right, Arabic letters are right-to-left.

Unicode Bidirectional Algorithm

A 50-page specification.

Version 6.3 is a significant improvement over 6.2.

Weak Character

A character that gets its directionality from neighboring strong characters.

Example: brackets, slash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment