Skip to content

Instantly share code, notes, and snippets.

@CMCDragonkai
Last active August 10, 2021 18:24
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save CMCDragonkai/b1a0855af2d4faf2c8ff to your computer and use it in GitHub Desktop.
Save CMCDragonkai/b1a0855af2d4faf2c8ff to your computer and use it in GitHub Desktop.
Elixir & Haskell: Unicode (UTF8) Support #unicode #elixir #haskell

Elixir & Haskell: Unicode

Most text nowadays is encoded as unicode, and yet some programming languages either don't natively support it, or require you to jump through some hoops to support it. Elixir and Haskell have good support for Unicode, however there are some issues we have to understand, in order to manipulate text well.

Introduction to Text Encodings and Numeral System Bases

Computer systems often default on a common character set called ASCII. Before we dig into Unicode, let's understand what a character set is.

A character set is just a set of characters. ASCII has 128 characters, while Unicode's UTF8 has 1,114,112 characters. Not every character is printable, some are called control characters. These characters can format or structure the text, examples include newlines, tabs and nulls.

ASCII characters are mostly English characters. So in order to support foreign languages or mathematical symbols, we need to expand our character count. Unicode tries to contain every possible character available on Earth, and still some.

A character set is enumerated by natural numbers (starting from 0) called codepoints. These codepoints serve as labels to every character in a character set. So that way, we can refer to a particular character just by using a codepoint number, and not have to actually draw it.

A text encoding is an isomorphic function that maps the characters in the character set to a particular binary/bitstring/bytestring sequence. Everything in modern computers is stored as binary. So therefore every character in a character set is also stored as some sort of binary sequence. An important realisation is that a character in a character set has nothing to do with the font you are using. The fonts provide a specific graphical rendering for some canonical character data in the character set.

It is isomorphic because given a particular encoding, it is possible to map the binary sequence to the character labeled by a codepoint, and vice versa.

ASCII

Let's see how ASCII works first.

The ASCII text encoding function maps every character in ASCII to a single byte. Remember that a byte is a 8 bit sequence. This means something like:

\0 = 0 = [0,0,0,0,0,0,0,0]
...etc
a = 97 = [0,1,1,0,0,0,0,1]
b = 98 = [0,1,1,0,0,0,1,0]
...etc
\d = 127 = [0,1,1,1,1,1,1,1]

As we said before ASCII has 128 characters. It's codepoints start from 0 and end at 127. It's text encoding function follows decimal to binary conversion. For example converting the base 10 97 decimal to base 2 gives us [1,1,0,0,0,0,1], however to convert it to a byte, we need to pad it until it reaches a multiple of 8, which gives us [0,1,1,0,0,0,0,1]. Although this is true for ASCII, codepoint numbers in other text encodings need not follow this convention.

You may notice that the most significant bit is always 0. This is why ASCII is sometimes called 7 bit text encoding. When ASCII was first introduced, 8 bits to a byte was not yet a standard. The 8th bit sometimes used for extensions to ASCII.

The fact that every character in ASCII maps to a byte 8 bits, means that the memory size of an ASCII string is equal to the character length of an ASCII string. You can therefore acquire the length or size of the string in constant time.

The reason why this works is because from 0 to 255 (the first 256 decimals), a decimal to binary mapping can keep all the codepoints inside 1 byte. At 256 and beyond, any decimal to binary mapping can no longer be kept inside 1 byte. Of course ASCII only needs 0 to 127 codepoints.

In Haskell and Elixir, ASCII is fully supported. Most characters are writable, and for the non-printable characters, you can use their escape code. See: http://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart

UTF-8

UTF-8 is an ASCII compatible text encoding that includes 1,114,112 characters. This allows us to support virtually any language on Earth. The ASCII compatibility means that the first 128 codepoint to binary mappings is exactly the same as ASCII.

It is considered a multibyte variable width encoding, meaning that a particular character codepoint can be mapped to 1, 2, 3, or 4 bytes. This means the memory size of the string will be at most 4× the number of characters that exist inside the string. This means if the string size was 4000 bytes, then you can be sure the string character length is somewhere between 1000 to 4000 characters. You cannot figure out the character length in constant time, you will need to iterate through the string to find out the real character length.

Codepoints in UTF-8 do not necessarily follow decimal to binary conversion. It certainly does for the first ASCII compatible 128 character codepoints from 0 to 127. All of these fit into 1 byte.

Beyond 127, codepoints can map to binary sequences that aren't related to decimal base 10 to binary base 2 conversions. For example, the 254 codepoint using decimal to binary would map to [1,1,1,1,1,1,1,0] which fits in 1 byte. However the 254 codepoint in UTF-8 is in fact [1,1,0,0,0,0,1,1,1,0,1,1,1,1,1,0] which is 2 bytes.

This website allows you to quickly convert UTF-8 characters to their isomorphic representations: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi

Haskell & Elixir

Both Haskell and Elixir have excellent support for UTF-8. Let's investigate.

Haskell:

-- | Straight Codepoint Syntax. Use putStrLn, as print is for debugging, not outputting strings.
> putStrLn "\87" -- ^ W
> putStrLn "\1103" -- ^ я
> print "\1103" -- ^ "\1103"

-- | Manipulating codepoint numbers
-- If you print a '\1103' or "\1103", it's not literally \1103. It's an isomorphic debugging 
-- representation of the real binary sequence.
-- `toEnum` is return type polymorphic, it takes an integer to the Enum typeclass. You can then 
-- type hint it or type constrain it to one of the possible instances.
> toEnum 1103 :: Char -- ^ '\1103'
> fromEnum 'я' :: Int -- ^ 1103
> (succ $ toEnum 35) :: Char -- ^ '$' (you can enumerate codepoints!)
> 
> let x = toEnum 1103
> x :: Char    -- ^ gives the character
> x :: Integer -- ^ gives back Integer
> x :: Int     -- ^ gives back Int
> 
> putStrLn (x:[])
> putStrLn [x]

-- | Using the Data.Char library
> import Data.Char
> 
> chr 1103 -- ^ '\1103' ~ 'я'
> ord 'я'  -- ^ 1103

-- | Haskell supports codepoint, escape code, hexadecimal, octal, escaped abbreviation, 
-- and direct source. 
-- These notations can be used inside of string constructors `""` and character constructors `''` 
-- and raw values.
> let a = "\70"
> let b = "\b"
> let c = "\xFF"
> let d = "\o777"
> let e = "\NUL"
> let f = "ł"
> putStrLn (a ++ b ++ c ++ d ++ e)

-- | Raw value notation, currently only hex or octal, no binary notation
-- These are stored as integers. That is `511 == 0o777`.
> let g = 0xFF
> let h = 0o777
> print g >> print h

-- | Using Data.Text
-- We can convert raw binary strings into the `Text` type, which is 
-- a space efficient, packed, unboxed Unicode text type. 
-- 'Text' is basically binary data equipped with a character encoding.
-- This library gives more specific functions to work with Unicode text.
-- See this for working with text or bytestrings: http://blog.ezyang.com/2010/08/strings-in-haskell/
> import Data.Text
> (unpack . pack $ "я") == "я"

Other Resources:

Haskell has an extension called OverloadedStrings, this allows literal strings to be written in a polymorphic manner. Usually any string is exactly of a String type. But with OverloadedStrings, you can do things like:

> :set -XOverloadedStrings
> "hello"
"hello"
it :: Data.String.IsString a => a

This makes literal strings polymorphic over the IsString type class. By default it only has 1 instance, which is the String type. However if you import string handling libraries, such as Data.Text or Data.ByteString, you can type hint a literal string to be of a Text type or a ByteString type.

This makes the literal string constructor "" much more flexible and extensible by third party libraries. The type of the string can be either type hinted or inferred from subsequent operations. It also means you avoid having to write conversions from the native String type whenever you want to do complex string manipulation.

Elixir

Elixir represents strings as binary sequences. It also has characters which exist as a list. So the '' syntax is actually the same as the Haskell String type. As it is in fact a [Char]. There's no individual Char in Elixir though. So it's always a character list. But normal strings exist in binary form, so you have to use functions that work on binary sequences to manipulate them. This gives rise to a pretty cool binary comprehension syntax.

## Character to Codepoint
>                # 1103 (this is a unary function, you can use it in expressions as well)
> to_char_list "я" # [1103]
> 'я'              # [1103]

## Codepoint to Character
> <<1103::utf8>>                        # "я" (it's also possible to use a variable like `<<var::utf8>>`)
> to_string [1103]                      # "я"
> "\x{44F}"                             # "я" (more like hex to character)
> IO.chardata_to_string [1103]          # "я"
> IO.chardata_to_string [0x44F]         # "я"
> IO.chardata_to_string [0b10001001111] # "я"
> Integer.to_char_list(1103, 16)        # "44F" (converts the 1103 decimal base 10 to 44F base 16 hex)
                                        # This is incredibly useful function, as it allows you convert a 
                                        # base 10 integer to any base. Remember that it will return a 
                                        # character list, because other bases aren't necessarily integers.

## Raw value notation
## These are stored as integers.
## One unique thing is that Elixir has binary syntax!
> a = 0b1010
> b = 0xFF
> c = 0o777

## Elixir supports direct UTF8-source, escape code, escaped abbreviation, hex code, 
## strange codepoint, hex code expression.
## The codepoing doesn't seem to match the UTF-8 codepoint mapping. I'm not sure what 
## Elixir is using for their codepoint mapping.
> a = "ł"
> b = "\b"
> c = "\NUL"
> d = "\xFF"
> f = "\126"
> g = "\x{332}"

Binary comprehension syntax is quite special in Elixir and deserves its own section.

> <<73 :: utf8>>
> <<10029 :: utf8>>

The value on the right of :: is the size of the left. So what that means is that 73 is to be considered as a utf8 size. Which is variable-width. But you can also used fixed sizes such as<<126 :: 8>> which is binary and is also ASCII, it convert 126 into a 8 sized bitstring, giving us a byte representing ~. The 8 bit size is the default size, and you can just use <<126>>. However there is a limit. The largest decimal integer in a single byte is 255, if you try <<256>> it doesn't fit into a single byte, and so it returns garbage.

We can easily take any binary representation in Elixir, and comprehend its constituent bits.

> for << b::1 <- <<1103::utf8>> >>, do: b 
## [1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

It takes the 256 codepoint as a UTF-8 character, and then iterates through the binary sequence 1 bit at a time and returns it as a list of bits, 1s and 0s.

It's important to realise that it actually it returns a list of base 10 decimal integers. This is because it takes each bit sequence and converts it to base 10. For the above this works like 0 => 0 and 1 => 1.

We can also change the iterated bit size to 4. This just means it iterates through the UTF-8 binary sequence 4 bits at a time. This means 4 bits will be considered during the conversion to base 10 decimal. This works like 0,0,0,0 => 0; 0,0,0,1 => 1; 0,0,0,0 => 0; 0,0,0,0 => 0.

> for << b::4 <- <<256::16>> >>, do: b
> # [0, 1, 0, 0]

As you can see this is pretty expressive way of directly manipulating binary sequences. This means you can easily work with binary protocols or other kinds of character encodings. For more information on these kind of expressions, see: http://elixir-lang.org/docs/v1.0/elixir/Kernel.SpecialForms.html#%3C%3C%3E%3E/1

Summary

Lastly, for more information working with unicode in a software context, see: http://unicodebook.readthedocs.org/en/latest/index.html

Update

Haskell has a language extension -XBinaryLiterals that allows one to write binary literals like 0b1111 or 0B1111. These are desugared into integers fromInteger ... (they arent't stored as actual binary of course!).

Also for fast binary ASCII case conversions:

  • lower - OR 32
  • upper - AND 223
  • flip - XOR 32
{-|
This script helps with converting bases between numeral systems.
Use `convertBase` to create custom base conversions.
Use `pad` to pad any bit sequence to a valid byte size (multiple of 8).
Use `byteSize` to tell what the byte size of a padded binary sequence is.
This requires the digits and split package.
The below is designed for unsigned numbers. It does not work for signed numbers or negative numbers.
-}
{-# LANGUAGE PackageImports #-}
import "digits" Data.Digits (digits, unDigits)
import "split" Data.List.Split (chunksOf)
convertBase :: (Integral a) => a -> a -> [a] -> [a]
convertBase from to = digits to . unDigits from
decToBin :: (Integral a) => a -> [a]
decToBin 0 = [0]
decToBin x = convertBase 10 2 [x]
binToDec :: (Integral a) => [a] -> a
binToDec [] = 0
binToDec x = parseDigits $ convertBase 2 10 x
where
parseDigits = foldl ((+).(* 10)) 0 -- ^ (+).(*10) multiples the first operand by 10, then adds the 2nd operand
-- we want this function to give back True ONLY if the numbers are 1, 2, 3, 4, 5
-- so we actually want to test if it is a whole number
-- also is Natural Number
-- I like to think that Whole Numbers are {1, 2, 3...} while Natural Numbers is {0, 1, 2, 3...}
-- But this is just my preference. There's positive Integers (whole) and non-Negative integers (natural).
isWholeNumber :: RealFrac a => a -> Bool
isWholeNumber x | x < 1 = False
isWholeNumber x = x == fromInteger (round x)
-- Int cannot be divided using `/`, it needs to be turned into a generic number
pad :: (Integral a) => [a] -> [a]
pad [] = []
pad bitString =
let
currentLength = length bitString :: Int
factor = (fromIntegral currentLength :: (Num a) => a) / 8 :: (Fractional a) => a
itDoesNeedPadding = not $ isWholeNumber (factor) :: Bool
in
if itDoesNeedPadding then
-- since we're manipulating the list's contents, we need to type the list's contents as Integral a
replicate ((ceiling(factor) * 8) - currentLength) 0 ++ bitString
else
bitString
byteSize :: (Integral a) => [a] -> String
byteSize bitString = (show $ length $ chunksOf 8 $ pad bitString) ++ " Bytes"
@holsee
Copy link

holsee commented Dec 12, 2016

nice write up 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment