There are lots of representations for strings. In most languages they pick one set of tradeoffs and run with it. In haskell the "default" implementation (at least the one in the prelude) is a pretty bad choice, but unlike most other languages (really) good implementations exist for pretty much every way you can twist these things. This can be a good thing, but it also leads to confusion, and frustration to find the right types and how to convert them.
- Packed vs Unpacked -- Contiguous memory vs linked list effectively. Java and normal C encodings of strings are packed.
- Lazy vs Strict -- Must be held in memory all at once. Java and normal C encodings of strings are strict.
- Encoding: UTF-8, UTF-16, ASCII, Binary -- Obvious, but docs sometimes refer to UTF-16 as unicode (even though that may mean UTF-16 or UTF-32), and sometimes refer to ASCII as 8-bit.
Unpacked list of (UTF-32) characters, i.e. type String = [Char]. Lazy due to the fact that [] is lazy.
Never use it unless you have to deal with some library. It is an asthetically nice representation, but unpacked strings are very rarely useful.
Data.Text
:Data.Text.unpack
Data.Text.Lazy
:Data.Text.Lazy.unpack
Data.ByteString
:Data.ByteString.Char8.unpack
orData.ByteString.UTF8.unpack
depending on encoding.Data.ByteString.Lazy:
Data.ByteString.Lazy.Char8.unpackor
Data.ByteString.Lazy.UTF8.unpack` depending on encoding.
Strict, Packed, UTF-16 strings. Fast, and easy to work with. If you see Data.Text.Internal leak out, normally dealing with Data.Text)
This is the goto string rep, fast enough in most cases, easy to work with. This is closest to a java string, but it is worth noting that it is safer then the equivelant in that it can not contain invalid unicode characters.
String
:fromString
, orData.Text.pack
Data.Text.Lazy
:Data.Text.concat . Data.Text.Lazy.toChunks
Data.ByteString
:Data.Text.Encoding.decodeUtf8
, orData.Text.Encoding.decode*
depending on encoding.
Lazy, Packed, UTF-16 strings. Can sometimes be faster when streaming data. But you have to deal with correctness issues around resource usage, and is slower then Data.Text in a lot of cases even when you think it might be faster.
String
:fromString
, orData.Text.Lazy.pack
Data.Text
:Data.Text.Lazy.fromChunks . return
Data.ByteString.Lazy
:Data.Text.Lazy.Encoding.decodeUtf8
, orData.Text.Lazy.Encoding.decode*
depending on encoding.
Strict, Packed, ASCII. Nice characterestics, except that it is ASCII, so rarely the right choice. Also note that these are just operations for treating a ByteString as ASCII, and not its own type.
String
:Data.ByteString.Char8.pack
Data.Text.Lazy
: As per Data.ByteString.Data.ByteString
: This is a Data.ByteString. No conversion.Data.ByteString.Lazy
: As per Data.ByteString.
Lazy, Packed, ASCII. See issues with Data.Text.Lazy and it is ascii, so rarely the right choice. Also note that these are just operations for treating a ByteString.Lazy as ASCII, and not its own type.
String
:Data.ByteString.Lazy.Char8.pack
Data.Text.Lazy
: As per Data.ByteString.Data.ByteString
: As per Data.ByteString.Lazy.Data.ByteString.Lazy
: This is a Data.ByteString.Lazy. No conversion.
Strict, Packed, UTF-8. The way to go if you want UTF-8. Also note that these are just operations for treating a ByteString as UTF8, and not its own type.
String
:Data.ByteString.UTF8.pack
Data.Text.Lazy
: As per Data.ByteString.Data.ByteString
: This is a Data.ByteString. No conversion.Data.ByteString.Lazy
: As per Data.ByteString.
Lazy, Packed, UTF-8. Same trade-off as Data.Text.Lazy. Also note that these are just operations for treating a ByteString.Lazy as UTF8, and not its own type.
String
:Data.ByteString.Lazy.UTF8.pack
Data.Text.Lazy
: As per Data.ByteString.Data.ByteString
: As per Data.ByteString.Lazy.Data.ByteString.Lazy
: This is a Data.ByteString.Lazy. No conversion.
Strict, Packed, Binary. Used lots. Goto for binary data. Underpins Data.Text and all Data.ByteString.* strict types.
String
: See eitherData.ByteString.Char8.pack
orData.ByteString.UTF8.pack
depending on encoding.Data.Text
:Data.Text.Encoding.encodeUtf8
, orData.Text.Encoding.encode*
depending on encoding.Data.ByteString.Lazy
:Data.ByteString.concat . Data.Text.ByteString.toChunks
Lazy, Packed, Binary. Used lots. All normal laziness caveats apply. Underpins Data.Text.Lazy and all Data.ByteString.Lazy.* lazy types.
String
: See eitherData.ByteString.Lazy.Char8.pack
orData.ByteString.Lazy.UTF8.pack
depending on encoding.Data.Text.Lazy
:Data.Text.Lazy.Encoding.encodeUtf8
, orData.Text.Lazy.Encoding.encode*
depending on encoding.Data.ByteString
:Data.ByteString.Lazy.fromChunks . return
Never run into the occasion to use these. But there are some libraries that use them I think.
- Lazy, Unpacked, UTF-8: Codec.Binary.UTF8.Generic / Word8
- Strict, Packed, ASCII, Data.CompactString.ASCII
Depends on the context.
Most of these data types are design to be imported qualified with an alias. e.g.
import qualified Data.Text as T
len :: T.Text -> T.Text -> Int
len a b => T.length a + T.length b
You can also use overloaded strings for literals, e.g.
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Text as T
blah :: T.Text
blah = "blah!"
Hiding the prelude. Use overloaded string, And use Data.Text by default. This is what I do some of the time.
For example, define some custom prelude that re-exports your view of the world.
{-# LANGUAGE NoImplicitPrelude #-}
module CustomPrelude (
module X
) where
import Prelude as X (Int, Eq, Ord, Show)
import Data.Text as X
import Control.Applicative as X ((<$>), (<*>), (*>), (<*), pure)
import Control.Monad as X (void, when, unless, liftM)
import Data.Traversable as X (mapM)
Then use with:
{-# LANGUAGE OverloadedStrings, NoImplicitPrelude #-}
import CustomPrelude
len :: Text -> Text -> Int
len a b => length a + length b
blah :: Text
blah = "blah!"