Skip to content

Instantly share code, notes, and snippets.

@Angles
Last active June 16, 2017 17:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Angles/c7a8450b3596e531e20f to your computer and use it in GitHub Desktop.
Save Angles/c7a8450b3596e531e20f to your computer and use it in GitHub Desktop.
Unicode bidirectional stuff
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<title>UNICODE BIDIRECTIONAL STUFF</title>
<meta name="Generator" content="Cocoa HTML Writer">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 32.0px 'Helvetica Neue'; color: #efefef; -webkit-text-stroke: #efefef}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Helvetica Neue'; color: #efefef; -webkit-text-stroke: #efefef}
li.li2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Helvetica Neue'; color: #efefef; -webkit-text-stroke: #efefef}
span.s1 {font-family: 'Helvetica Neue'; font-weight: bold; font-style: normal; font-size: 32.00pt; font-kerning: none}
span.s2 {font-family: 'Helvetica Neue'; font-weight: bold; font-style: normal; font-size: 16.00pt; font-kerning: none}
span.s3 {font-family: 'Helvetica Neue'; font-weight: normal; font-style: normal; font-size: 16.00pt; font-kerning: none}
span.s4 {font-family: 'Helvetica Neue'; font-weight: normal; font-style: normal; font-size: 16.00pt; text-decoration: underline ; font-kerning: none; color: #0000ee; -webkit-text-stroke: 0px #0000ee}
span.s5 {font-family: 'Helvetica Neue'; font-weight: normal; font-style: normal; font-size: 16.00pt}
span.s6 {font-family: 'Helvetica Neue'; font-weight: normal; font-style: normal; font-size: 16.00pt; color: #0000ee}
span.s7 {font-family: 'Helvetica Neue'; font-weight: normal; font-style: normal; font-size: 16.00pt; text-decoration: underline ; font-kerning: none; -webkit-text-stroke: 0px #0000ee}
ul.ul1 {list-style-type: disc}
ul.ul2 {list-style-type: circle}
a {
//color: #A2BDE0;
background-color: darkblue;
//background-color: darkgreen;
//text-decoration: none !important;;
}
body {
margin: 2em;
//font-family: Verdana, Arial, Tahoma, sans-serif;
font-size: 1.5em;
//color: white;
//color: #CCCCCC !important;
//color: orange !important;
//background-color: #c0c0c0;
//background-color: #d0d0d0;
//background-color: #000000;
background-color: #262B30;
}
p {
/* top ; right ; bottom ; left */
//margin: 5.0px 20.0px 20.0px 20.0px;
margin: 5.0px 20.0px 10.0px 20.0px !important;
}
span, ul {
//margin: 20.0px 20.0px 0.0px 0.0px;
//color: orange !important;
color: #CCCCCC !important;
}
</style>
</head>
<body>
<p class="p1"><span class="s1">UNICODE BIDIRECTIONAL STUFF</span></p>
<p class="p2"><span class="s2">Topic: Unicode items 200E-F, 202A - 202E</span></p>
<p class="p2"><span class="s2">Scenario</span><span class="s3">: copy text from web page, try to simplify it, unexpected bidirectional Unicode formatting items may be left over, should be removed to get a clean scrape free of high order characters. For some reason these can be in an English page with no bidirectional content, has no need for this, but they can be there. (Additional discussion below.)</span></p>
<p class="p2"><span class="s3">Available bidirectional changes are:</span></p>
<ul class="ul1">
<li class="li2"><span class="s5"></span><span class="s3">direction mark:</span></li>
<ul class="ul2">
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/200e/index.htm"><span class="s7">u200E</span></a></span><span class="s3"> LRM (left-to-right mark).</span></li>
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/200f/index.htm"><span class="s7">u200F</span></a></span><span class="s3"> RLM (right-to-left mark).</span></li>
</ul>
<li class="li2"><span class="s5"></span><span class="s3">embed or override:</span></li>
<ul class="ul2">
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/202a/index.htm"><span class="s7">u202A</span></a></span><span class="s3"> LRE (left-to-right embedding), then u202C to end the change.</span></li>
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/202b/index.htm"><span class="s7">u202B</span></a></span><span class="s3"> RTE (right-to-left embedding), then u202C to end thechange.</span></li>
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/202c/index.htm"><span class="s7">u202C</span></a></span><span class="s3"> POP DIRECTIONAL FORMATTING, ends any bidirectional change.</span></li>
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/202d/index.htm"><span class="s7">u202D</span></a></span><span class="s3"> LEFT-TO-RIGHT OVERRIDE, then u202C to end it.</span></li>
<li class="li2"><span class="s6"><a href="http://www.fileformat.info/info/unicode/char/202e/index.htm"><span class="s7">u202E</span></a></span><span class="s3"> RIGHT-TO-LEFT OVERRIDE, then u202C to end it.</span></li>
</ul>
</ul>
<p class="p2"><span class="s2">Discussion:</span></p>
<p class="p2"><span class="s3">These are formatting items, not characters. Often these embeds or overrides are located in web post comment blocks, at the first line break just below the poster's name and the beginning of the comment. Also in a Google search results page, at the first line break for a given result, and possibly elsewhere in that first line, perhaps associated with some kind of dash therein. The LRM is often found in text copied from an iOS contact address field.</span></p>
<p class="p2"><span class="s3">END OF SUMMARY.</span></p>
<p class="p2"><span class="s3">- Bidirectional reference: http://www.w3.org/TR/html4/struct/dirlang.html <a href="http://www.w3.org/TR/html4/struct/dirlang.html"><span class="s4">via W3</span></a></span></p>
<p class="p2"><span class="s3">- fileformat.info Unicode search: http://www.fileformat.info/info/unicode/char/search.htm <a href="http://www.fileformat.info/info/unicode/char/search.htm"><span class="s4">Link</span></a></span></p>
</body>
</html>

UNICODE BIDIRECTIONAL STUFF

Topic: Unicode items 200E-F, 202A - 202E

Scenario: copy text from web page, try to simplify it, unexpected bidirectional Unicode formatting items may be left over, should be removed to get a clean scrape free of high order characters. For some reason these can be in an English page with no bidirectional content, has no need for this, but they can be there. (Additional discussion below.)

Available bidirectional items are:

  • direction mark:
    • u200E LRM (left-to-right mark).
    • u200F RLM (right-to-left mark).
  • embed or override:
    • u202A LRE (left-to-right embedding), then u202C to end the change.
    • u202B RTE (right-to-left embedding), then u202C to end thechange.
    • u202C POP DIRECTIONAL FORMATTING, ends any bidirectional change.
    • u202D LEFT-TO-RIGHT OVERRIDE, then u202C to end it.
    • u202E RIGHT-TO-LEFT OVERRIDE, then u202C to end it.

Discussion:
These are formatting items, not characters. Often these embeds or overrides are located in web post comment blocks, at the first line break just below the poster's name and the beginning of the comment. Also in a Google search results page, at the first line break for a given result, and possibly elsewhere in that first line, perhaps associated with some kind of dash therein. The LRM is often found in text copied from iOS contact address field.

END OF SUMMARY v.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment