mapmeld/arabic_svg.md

## arabic_svg.md

      
    Raw
  

              arabic_svg.md
            
          
    What is going on with Arabic in OSM iD?

The pull request in question
tl;dr in the OSM editor, Chrome and Safari show Arabic road labels incorrectly, so we will overwrite Chrome's rules
I'm happy to discuss Arabic support with you or do a presentation!
Background

Arabic is written right-to-left (same for Persian/Farsi and Urdu, also Hebrew and Divehi alphabets). One consequence of
that is that the layout of the page is mirrored (eg Wikipedia's menu on the left of the page appears on the right).
One exception is numbers, which are known as 'Eastern Arabic Numerals' and 'Eastern Arabic-Indic Numerals' and are written left-to-right (eg 100 will appear as ١٠٠).
Previously I added the right-to-left layout for OSM iD editor, but it didn't end there...
The SVG Problem

The OSM map looks great. The in-browser editor, iD, uses D3.js which renders everything using SVG elements. To label points and polygons, you have a label floating
on the map, which is OK. All browsers can handle Arabic script in this case.
To label roads and railways, OSM iD has the text label follow the road-line using SVG's textPath element. On Chrome and
Safari (the Webkit browsers) there is a known bug in the browser for right-to-left text inside a textPath, where the letters
are appearing disconnected and words in the wrong order.  This is not a problem for Firefox and I don't know about IE/Edge at this point.
Milad's Solution

I'm impressed with Milad (@miladkdz) for tackling this problem. This is a bug in Chrome itself, right? But his solution was
to flip the letters and words to solve the left-to-right order, then to replace each letter with the appropriate form
of the letter specified in Unicode.
Intial, medial, and final forms of Arabic letters

OK, consider the Arabic vowel "ee". By itself, the letter looks like a snake: ي
When the same letter starts a word, you mostly see the two dots يـ
When it's in the middle of the word, the two dots are still prominent, but it has more of a peak ـيـ
When it's at the end of the word, part of the snake comes back: ـي
Not all letters have 3-4 different forms - for example, the vowel "ah" can either be ا by itself or ـا connected to the right (it never connects to the left).
Unicode and Arabic letter forms

Typically you write characters from the standard Unicode Arabic character block, and the font / rendering on your computer will figure out
the correct way to join them and display them right-to-left. For example if I paste ي three times, I will see the initial, medial, and end forms: ييي  and the final letter will be at the end of the word (the left).
This is where SVG textPath is messing up in Chrome. Fortunately we can overwrite it using an extended Unicode block of Arabic "presentation forms". There are over 750 variations but we don't need nearly this many to render OSM labels. Using a presentation form character says HEY use this particular shape, and overrides the usual one. These aren't completely supported by every font, but the intial/medial/final forms work in OSM iD Editor and its preferred fonts, as far as I can tell.
Initial Issues with the Pull Request


The actual name of the element shouldn't be edited... only the text blob being inserted into a textPath
The flip script shouldn't affect Latin labels, or combined Latin/Arabic labels (called "bidi" for bi-directional)
If we are going to replace each Arabic letter with its corresponding form, we need the dictionary to be comprehensive, or we will have some letters out of place. There is no .toLowerCase() for forms.

My Suggestions

Arabic Letters

The table of Arabic characters has to include everything to work. This partially was due to Milad focusing on Persian/Farsi and
me checking against Arabic writing in North Africa.
I re-ordered the letters by character-code, and it made it easier to find some gaps. I also added some variants of the vowel,
with markings above or below (آ or أ or إ).  You can find a character code in the JS interpreter using "x".charCodeAt(0) or get the string for a character code using String.fromCharCode(1000)
One missing letter was ػ which does not appear to be used often. Is it used on OSM? Do we need to support it?
utilDisplayName

I moved the flip script into a different part of the OSM code, to utilDisplayName where it only affects the SVG textPath.
Fortunately there is still access to the element tags and we can still filter it to roads and railways.
Refactoring for Bidi Text

The initial script would split the name up by whitespace, use a separate function to flip letters within a word, and rearrange the words.
This doesn't work with combined Arabic and Latin (bidi text). I found two ideal examples on OSM: RN 3 طو and Avenue Habib Bourguiba شارع الحبيب بورقيبة.
In the new system, I have a single function, a right-to-left 'buffer', and the eventual output. As I look at each new character:

if it's an RTL character, I add it to the buffer
if it's a space between RTL words, I continue adding the space and words to the buffer (so word order will be changed)
otherwise (LTR or dividing character), I flip in the RTL buffer, dump it into output, then add the LTR character directly to output
at the end of the word, I flip and output any leftover letters in the buffer (in all-Arabic labels, this will be the only output)

It appears to be working on these examples, and any thought experiments that I had about numbers in the middle of Arabic words.