Problems with Bidirectional (BiDi) Text
If the whole paragraph contains only right-to-left text, it poses no problem. Problems are much likelier to occur if we mix the text directions.
Here I want to write down my knowledge so that the simple parts are easily accessible to beginners, and terms are defined for reference when sorting and solving bugs
TODO: show example fixes to the problems (and note that they may look wrong, because of LTR framing)
Here we discuss the symptoms of the problems, as they appear to the untrained eye.
Brackets are wrong
In left-to-right text, the the opening bracket is at the left, C-shaped. In right-to-left text, glyph mirroring is used so that the same Unicode character is placed at the right, D-shaped.
- Wrong Brackets in the Text
- UBA too old (6.2)
Line of Text Misaligned
Most lines of a right-to-left text are correctly aligned right but some are aligned left. They start (at the left) with a Latin character.
Slashes in file system paths are misplaced
A path such as /folder/file.txt appears as folder/file.txt/ instead.
Causes and Solutions
To find the Cause of a Symptom requires knowledge and/or luck. See also Techniques and Tools below.
Cause: when the main language of the text is Arabic, all lines should start at the right even if they happen to only contain words in the Latin script. This requires that the rendering engine knows the overall directional context of the text, in this case right-to-left.
It may happen that the context is wrong or missing (defaulting to left-to-right). Most of the Arabic text will still be correctly aligned right but lines that start with a Latin letter will get misaligned left.
Solution 1: Fix the code (yourself, or ask your vendor) to correctly propagate directional context.
Solution 2: Explicitly add directional context with HTML markup:
Explicit Formatting Characters are Missing
Cause: In some cases bidirectional text needs explicit formatting characters to be rendered properly.
The slash (/) is a Weak character so the initial slash in a file system path gets the right-to-left direction and appears at the wrong side.
Solution: Prepend the initial slash with a left-to-right mark, or enclose the path in a left-to-right embedding.
|RR /LL/LL RR||RR (LRM)/LL/LL RR|
|RR /LL/LL RR||RR (LRE)/LL/LL(PDF) RR|
- (LRM) is a Left-to-right mark (U+200E)
- (LRE) is Left-to-right embedding (U+202A)
- (PDF) is Pop Directional Formatting (U+202C)
Wrong Brackets in the Text
Cause: The original text has correct brackets, like (), but the translated text has them wrong, like ((, )(, )). (The translator probably got betrayed by the additional context that the translation tools often provide. BiDi is hard.)
Solution: Fix the translation. Use a translation tool that shows the translated text in its own paragraph(s), without surrounding quotes and the like.
UBA Too Old (6.2)
Cause: Up to the version 6.2, the Unicode Bidirectional Algorithm did not have a good solution for plain bracketed LTR text in a RTL context. The text "First outside (then inside)" would get rendered as "(First outside (then inside"
Solution: ask your vendor for a newer version. (Qt has significant fixes in version 5.11)
Solution2: If a translation layer is involved, add Explicit Formatting Characters
Techniques and Tools
https://bidi-debugger.herokuapp.com/ displays each character on its own line, eliminating confusion about which character comes first and whether it is mirrored or not. It also shows the Unicode codepoint number.
BIDI (UBA) C Reference
https://unicode.org/cldr/utility/bidic.jsp displays the gory details of the Unicode Bidirectional Algorithm.
(An older Java based version of this only implements the 6.2 version of UBA)
bidi-test (YaST specific)
A visual representation of a character. In unidirectional text, fonts account for the most differences between glyphs for the same characters.
(In Arabic, a character has up to four different glyph forms depending on its position in a word: initial, medial, final, isolated)
In right-to-left text, bracket glyphs get mirrored.
A character that knows its directionality: Latin letters are left-to-right, Arabic letters are right-to-left.
Unicode Bidirectional Algorithm
Version 6.3 is a significant improvement over 6.2.
A character that gets its directionality from neighboring strong characters.
Example: brackets, slash.