Skip to content

Instantly share code, notes, and snippets.

@bgoodr
Created July 23, 2016 20:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bgoodr/1f085ef942fb71ba6af2cd7268f480f7 to your computer and use it in GitHub Desktop.
Save bgoodr/1f085ef942fb71ba6af2cd7268f480f7 to your computer and use it in GitHub Desktop.
Output of python script that uses Unidecode python package to translate several web pages (see answer to http://stackoverflow.com/questions/38249708/python-library-to-translate-multi-byte-characters-into-7-bit-ascii-in-python/38249916)
brentg@wilddog:~/scratch_sandboxes/python/parse_html$ ./simple_wget
url: https://system76.com/laptops/kudu
current encoding: utf-8
ext: old
Wrote https__system76__com__laptops__kudu.old.html
ext: new
Wrote https__system76__com__laptops__kudu.new.html
Executing: diff https__system76__com__laptops__kudu.old.html https__system76__com__laptops__kudu.new.html
243c243
< Starting at <span class="config-price">$899</span> – or – <span class="finance-price">$80</span>
---
> Starting at <span class="config-price">$899</span> - or - <span class="finance-price">$80</span>
369c369
< <p>With a wonderful glow to each key and five levels of brightness, it’s easy to type no matter the light. Experience buttery smooth tracking and gestures with a large, perfectly textured multitouch trackpad.</p>
---
> <p>With a wonderful glow to each key and five levels of brightness, it's easy to type no matter the light. Experience buttery smooth tracking and gestures with a large, perfectly textured multitouch trackpad.</p>
428c428
< <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
---
> <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>
url: http://stackoverflow.com/a/38249916/257924
current encoding: utf-8
ext: old
Wrote http__stackoverflow__com__a__38249916__257924.old.html
ext: new
Wrote http__stackoverflow__com__a__38249916__257924.new.html
Executing: diff http__stackoverflow__com__a__38249916__257924.old.html http__stackoverflow__com__a__38249916__257924.new.html
696c696
< foo = 'abcdéfg'
---
> foo = 'abcdefg'
801c801
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
902c902
< <span class="comment-copy">Not specifically what I was looking for since I would like to translate some of the more obvious characters into 7-bit ASCII such as Unicodes <code>EM DASH</code> character &quot;—&quot;, into the 7-bit ASCII dash character &quot;-&quot; (Unicode <code>HYPHEN-MINUS</code> character). Yes, I know the characters are not grammaticalily the same. This is a lossy translation with no desire to reverse the process. See <a href="https://gist.githubusercontent.com/bgoodr/4f83d150b109f657b34b6f587f218e66/raw/390791e9d1db9eeaf7bb1884d2998ccac79a3d86/attempt_at_using_str_encode_2.txt" rel="nofollow">my second gist</a> which uses Python 2.</span>
---
> <span class="comment-copy">Not specifically what I was looking for since I would like to translate some of the more obvious characters into 7-bit ASCII such as Unicodes <code>EM DASH</code> character &quot;--&quot;, into the 7-bit ASCII dash character &quot;-&quot; (Unicode <code>HYPHEN-MINUS</code> character). Yes, I know the characters are not grammaticalily the same. This is a lossy translation with no desire to reverse the process. See <a href="https://gist.githubusercontent.com/bgoodr/4f83d150b109f657b34b6f587f218e66/raw/390791e9d1db9eeaf7bb1884d2998ccac79a3d86/attempt_at_using_str_encode_2.txt" rel="nofollow">my second gist</a> which uses Python 2.</span>
918c918
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
1284c1284
< A word for “useful components”?
---
> A word for "useful components"?
url: https://www.peterbe.com/plog/unicode-to-ascii
current encoding: ISO-8859-1
ext: old
Wrote https__www__peterbe__com__plog__unicode-to-ascii.old.html
ext: new
Wrote https__www__peterbe__com__plog__unicode-to-ascii.new.html
Executing: diff https__www__peterbe__com__plog__unicode-to-ascii.old.html https__www__peterbe__com__plog__unicode-to-ascii.new.html
155c155
< <div class="highlight"><pre>&gt;&gt;&gt; title = u&quot;Klüft skräms inför på fédéral électoral große&quot;
---
> <div class="highlight"><pre>&gt;&gt;&gt; title = u&quot;KlA1/4ft skrA$?ms infAPr pAY= fA(c)dA(c)ral A(c)lectoral groAe&quot;
160c160
< <p>But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements. </p>
---
> <p>But as you can see, a lot of the characters are gone. I'd much rather that a word like "KlA1/4ft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements. </p>
176c176
</p>
---
</p>
200c200
< It's been years since I took any German, but wouldn't 'Klüft' more accurately be saved as 'Klueft'? I recall that 'Küchen' and 'Kuchen' are two different words entirely (Kitchen and Cake, respectively).
---
> It's been years since I took any German, but wouldn't 'KlA1/4ft' more accurately be saved as 'Klueft'? I recall that 'KA1/4chen' and 'Kuchen' are two different words entirely (Kitchen and Cake, respectively).
272c272
< 1) Klüft is not a german word, so don't worry too much.<br>2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.<br>3) If the id should match the title, why does it have to be ascii?
---
> 1) KlA1/4ft is not a german word, so don't worry too much.<br>2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.<br>3) If the id should match the title, why does it have to be ascii?
319c319
< I can also (by test) say, that it doesn't work with Scandinavian letters (æ, ø and å) -- they get ignored completely.
---
> I can also (by test) say, that it doesn't work with Scandinavian letters (A|, A, and AY=) -- they get ignored completely.
339c339
< &quot;på&quot; became &quot;pa&quot;
---
> &quot;pAY=&quot; became &quot;pa&quot;
359c359
< Well okay, but &quot;Rødgrød med fløde&quot; became &quot;Rdgrd med flde&quot;.
---
> Well okay, but &quot;RA,dgrA,d med flA,de&quot; became &quot;Rdgrd med flde&quot;.
413c413
< Hi, I wrote a script based on your idea. It transforms number, str and unicode to ASCII: <a href="http://www.haypocalc.com/perso/prog/python/any2ascii.py" rel="nofollow">http://www.haypocalc.com/perso/prog/python/any2ascii.py</a><br><br>It takes care of some caracters like &quot;ßø&amp;#322;&quot; (just fill smart_unicode dictionnary ;-)).<br><br>Haypo
---
> Hi, I wrote a script based on your idea. It transforms number, str and unicode to ASCII: <a href="http://www.haypocalc.com/perso/prog/python/any2ascii.py" rel="nofollow">http://www.haypocalc.com/perso/prog/python/any2ascii.py</a><br><br>It takes care of some caracters like &quot;AA,&amp;#322;&quot; (just fill smart_unicode dictionnary ;-)).<br><br>Haypo
532c532
< This is fantastic stuff - I was having trouble parsing film results where, for example, Rashômon was represented as Rashomon. Testing for both the unicode and ascii normalized strings before iterating to the next result really sealed it. Thanks.
---
> This is fantastic stuff - I was having trouble parsing film results where, for example, RashA'mon was represented as Rashomon. Testing for both the unicode and ascii normalized strings before iterating to the next result really sealed it. Thanks.
627c627
< There's now the &quot;unidecode&quot; package that does all the job <a href="http://pypi.python.org/pypi/Unidecode/" rel="nofollow">http://pypi.python.org/pypi/Unidecode/</a><br><br>&gt;&gt;&gt; from unidecode import unidecode<br>&gt;&gt;&gt; utext = u&quot;œuf dür&quot;<br>&gt;&gt;&gt; unidecode(utext)<br>u'oeuf dur'<br>&gt;&gt;&gt; from unicodedata import normalize<br>&gt;&gt;&gt; normalize('NFKD', utext).encode('ascii','ignore')<br>'uf dur'<br><br>A better support for special latin extended characters (French, German) that should tranlitterate to multiple ASCII characters.
---
> There's now the &quot;unidecode&quot; package that does all the job <a href="http://pypi.python.org/pypi/Unidecode/" rel="nofollow">http://pypi.python.org/pypi/Unidecode/</a><br><br>&gt;&gt;&gt; from unidecode import unidecode<br>&gt;&gt;&gt; utext = u&quot;Auf dA1/4r&quot;<br>&gt;&gt;&gt; unidecode(utext)<br>u'oeuf dur'<br>&gt;&gt;&gt; from unicodedata import normalize<br>&gt;&gt;&gt; normalize('NFKD', utext).encode('ascii','ignore')<br>'uf dur'<br><br>A better support for special latin extended characters (French, German) that should tranlitterate to multiple ASCII characters.
url: http://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472
current encoding: utf-8
ext: old
Wrote http__stackoverflow__com__questions__227459__ascii-value-of-a-character-in-python__rq__1__comment35813354_227472.old.html
ext: new
Wrote http__stackoverflow__com__questions__227459__ascii-value-of-a-character-in-python__rq__1__comment35813354_227472.new.html
Executing: diff http__stackoverflow__com__questions__227459__ascii-value-of-a-character-in-python__rq__1__comment35813354_227472.old.html http__stackoverflow__com__questions__227459__ascii-value-of-a-character-in-python__rq__1__comment35813354_227472.new.html
672c672
< <span class="comment-copy">Note that chr also acts as unichr in Python 3. <code>chr(31415) -&gt; &#39;窷&#39;</code></span>
---
> <span class="comment-copy">Note that chr also acts as unichr in Python 3. <code>chr(31415) -&gt; &#39;Liao &#39;</code></span>
699c699
< <span class="comment-copy">@njzk2: it doesn&#39;t use any character encoding it returns a bytestring in Python 2. It is upto you to interpret it as a character e.g., <code>chr(ord(u&#39;й&#39;.encode(&#39;cp1251&#39;))).decode(&#39;cp1251&#39;) == u&#39;й&#39;</code>. In Python 3 (or <code>unichr</code> in Python 2), the input number is interpreted as Unicode codepoint integer ordinal: <code>unichr(0x439) == &#39;\u0439&#39;</code> (the first 256 integers has the same mapping as latin-1: <code>unichr(0xe9) == b&#39;\xe9&#39;.decode(&#39;latin-1&#39;)</code>, the first 128 -- ascii: <code>unichr(0x0a) == b&#39;\x0a&#39;.decode(&#39;ascii&#39;)</code> it is a Unicode thing, not Python).</span>
---
> <span class="comment-copy">@njzk2: it doesn&#39;t use any character encoding it returns a bytestring in Python 2. It is upto you to interpret it as a character e.g., <code>chr(ord(u&#39;i&#39;.encode(&#39;cp1251&#39;))).decode(&#39;cp1251&#39;) == u&#39;i&#39;</code>. In Python 3 (or <code>unichr</code> in Python 2), the input number is interpreted as Unicode codepoint integer ordinal: <code>unichr(0x439) == &#39;\u0439&#39;</code> (the first 256 integers has the same mapping as latin-1: <code>unichr(0xe9) == b&#39;\xe9&#39;.decode(&#39;latin-1&#39;)</code>, the first 128 -- ascii: <code>unichr(0x0a) == b&#39;\x0a&#39;.decode(&#39;ascii&#39;)</code> it is a Unicode thing, not Python).</span>
715c715
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
752c752
< <p>Note that ord() doesn't give you the ASCII value per se; it gives you the numeric value of the character in whatever encoding it's in. Therefore the result of ord('ä') can be 228 if you're using Latin-1, or it can raise a TypeError if you're using UTF-8. It can even return the Unicode codepoint instead if you pass it a unicode:</p>
---
> <p>Note that ord() doesn't give you the ASCII value per se; it gives you the numeric value of the character in whatever encoding it's in. Therefore the result of ord('a') can be 228 if you're using Latin-1, or it can raise a TypeError if you're using UTF-8. It can even return the Unicode codepoint instead if you pass it a unicode:</p>
754c754
< <pre><code>&gt;&gt;&gt; ord(u'あ')
---
> <pre><code>&gt;&gt;&gt; ord(u'a')
807c807
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
894c894
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
1005c1005
< title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”."
---
> title="Use comments to ask for more information or suggest improvements. Avoid comments like "+1" or "thanks"."
1155c1155
< <a href="/questions/linked/227459">see more linked questions…</a>
---
> <a href="/questions/linked/227459">see more linked questions...</a>
brentg@wilddog:~/scratch_sandboxes/python/parse_html$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment