Instantly share code, notes, and snippets.

Embed
What would you like to do?
Haskell Character Data

Overview

There are lots of representations for strings. In most languages they pick one set of tradeoffs and run with it. In haskell the "default" implementation (at least the one in the prelude) is a pretty bad choice, but unlike most other languages (really) good implementations exist for pretty much every way you can twist these things. This can be a good thing, but it also leads to confusion, and frustration to find the right types and how to convert them.

Types

Terminology

  • Packed vs Unpacked -- Contiguous memory vs linked list effectively. Java and normal C encodings of strings are packed.
  • Lazy vs Strict -- Must be held in memory all at once. Java and normal C encodings of strings are strict.
  • Encoding: UTF-8, UTF-16, ASCII, Binary -- Obvious, but docs sometimes refer to UTF-16 as unicode (even though that may mean UTF-16 or UTF-32), and sometimes refer to ASCII as 8-bit.

Data.String

Unpacked list of (UTF-32) characters, i.e. type String = [Char]. Lazy due to the fact that [] is lazy.

Never use it unless you have to deal with some library. It is an asthetically nice representation, but unpacked strings are very rarely useful.

Conversions From x to Data.String

  • Data.Text: Data.Text.unpack
  • Data.Text.Lazy: Data.Text.Lazy.unpack
  • Data.ByteString: Data.ByteString.Char8.unpack or Data.ByteString.UTF8.unpack depending on encoding.
  • Data.ByteString.Lazy: Data.ByteString.Lazy.Char8.unpackorData.ByteString.Lazy.UTF8.unpack` depending on encoding.

Data.Text

Strict, Packed, UTF-16 strings. Fast, and easy to work with. If you see Data.Text.Internal leak out, normally dealing with Data.Text)

This is the goto string rep, fast enough in most cases, easy to work with. This is closest to a java string, but it is worth noting that it is safer then the equivelant in that it can not contain invalid unicode characters.

Conversions From x to Data.Text

  • String: fromString, or Data.Text.pack
  • Data.Text.Lazy: Data.Text.concat . Data.Text.Lazy.toChunks
  • Data.ByteString: Data.Text.Encoding.decodeUtf8, or Data.Text.Encoding.decode* depending on encoding.

Data.Text.Lazy

Lazy, Packed, UTF-16 strings. Can sometimes be faster when streaming data. But you have to deal with correctness issues around resource usage, and is slower then Data.Text in a lot of cases even when you think it might be faster.

Conversions From x to Data.Text.Lazy

  • String: fromString, or Data.Text.Lazy.pack
  • Data.Text: Data.Text.Lazy.fromChunks . return
  • Data.ByteString.Lazy: Data.Text.Lazy.Encoding.decodeUtf8, or Data.Text.Lazy.Encoding.decode* depending on encoding.

Data.ByteString.Char8

Strict, Packed, ASCII. Nice characterestics, except that it is ASCII, so rarely the right choice. Also note that these are just operations for treating a ByteString as ASCII, and not its own type.

Conversions From x to Data.ByteString.Char8

  • String: Data.ByteString.Char8.pack
  • Data.Text.Lazy: As per Data.ByteString.
  • Data.ByteString: This is a Data.ByteString. No conversion.
  • Data.ByteString.Lazy: As per Data.ByteString.

Data.ByteString.Lazy.Char8

Lazy, Packed, ASCII. See issues with Data.Text.Lazy and it is ascii, so rarely the right choice. Also note that these are just operations for treating a ByteString.Lazy as ASCII, and not its own type.

Conversions From x to Data.ByteString.Lazy.Char8

  • String: Data.ByteString.Lazy.Char8.pack
  • Data.Text.Lazy: As per Data.ByteString.
  • Data.ByteString: As per Data.ByteString.Lazy.
  • Data.ByteString.Lazy: This is a Data.ByteString.Lazy. No conversion.

Data.ByteString.UTF8

Strict, Packed, UTF-8. The way to go if you want UTF-8. Also note that these are just operations for treating a ByteString as UTF8, and not its own type.

Conversions From x to Data.ByteString.UTF8

  • String: Data.ByteString.UTF8.pack
  • Data.Text.Lazy: As per Data.ByteString.
  • Data.ByteString: This is a Data.ByteString. No conversion.
  • Data.ByteString.Lazy: As per Data.ByteString.

Data.ByteString.Lazy.UTF8

Lazy, Packed, UTF-8. Same trade-off as Data.Text.Lazy. Also note that these are just operations for treating a ByteString.Lazy as UTF8, and not its own type.

Conversions From x to Data.ByteString.Lazy.UTF8

  • String: Data.ByteString.Lazy.UTF8.pack
  • Data.Text.Lazy: As per Data.ByteString.
  • Data.ByteString: As per Data.ByteString.Lazy.
  • Data.ByteString.Lazy: This is a Data.ByteString.Lazy. No conversion.

Data.ByteString

Strict, Packed, Binary. Used lots. Goto for binary data. Underpins Data.Text and all Data.ByteString.* strict types.

Conversions From x to Data.ByteString

  • String: See either Data.ByteString.Char8.pack or Data.ByteString.UTF8.pack depending on encoding.
  • Data.Text: Data.Text.Encoding.encodeUtf8, or Data.Text.Encoding.encode* depending on encoding.
  • Data.ByteString.Lazy: Data.ByteString.concat . Data.Text.ByteString.toChunks

Data.ByteString.Lazy

Lazy, Packed, Binary. Used lots. All normal laziness caveats apply. Underpins Data.Text.Lazy and all Data.ByteString.Lazy.* lazy types.

Conversions From x to Data.ByteString.Lazy

  • String: See either Data.ByteString.Lazy.Char8.pack or Data.ByteString.Lazy.UTF8.pack depending on encoding.
  • Data.Text.Lazy: Data.Text.Lazy.Encoding.encodeUtf8, or Data.Text.Lazy.Encoding.encode* depending on encoding.
  • Data.ByteString: Data.ByteString.Lazy.fromChunks . return

Also rans

Never run into the occasion to use these. But there are some libraries that use them I think.

  • Lazy, Unpacked, UTF-8: Codec.Binary.UTF8.Generic / Word8
  • Strict, Packed, ASCII, Data.CompactString.ASCII

Usage

Depends on the context.

Option 1

Most of these data types are design to be imported qualified with an alias. e.g.

import qualified Data.Text as T

len :: T.Text -> T.Text -> Int 
len a b => T.length a + T.length b

You can also use overloaded strings for literals, e.g.

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.Text as T

blah :: T.Text
blah = "blah!"

Option 2

Hiding the prelude. Use overloaded string, And use Data.Text by default. This is what I do some of the time.

For example, define some custom prelude that re-exports your view of the world.

{-# LANGUAGE NoImplicitPrelude #-}

module CustomPrelude (
  module X
) where

import Prelude as X (Int, Eq, Ord, Show)
import Data.Text as X
import Control.Applicative as X ((<$>), (<*>), (*>), (<*), pure)
import Control.Monad as X (void, when, unless, liftM)
import Data.Traversable as X (mapM)

Then use with:

{-# LANGUAGE OverloadedStrings, NoImplicitPrelude #-}

import CustomPrelude

len :: Text -> Text -> Int 
len a b => length a + length b

blah :: Text
blah = "blah!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment