hasufell/correctness.md

## correctness.md

      
    Raw
  

              correctness.md
            
          
    Correctness issues with current String based filepath handling

Context

On unix, the base library uses
getFileSystemEncoding
and mkTextEncoding
to pick a round-trippable encoding for filepaths. E.g. if your locale returns en_US.UTF-8 you'll get
UTF-8//ROUNDTRIP TextEncoding, which is based on PEP 383.
This encoding is then used to encode and decode from CString: https://gitlab.haskell.org/ghc/ghc/-/blob/eb4fb8493823d9b962c26507d01a921dca8b8857/libraries/base/System/Posix/Internals.hs#L175-183
The documentation of mkTextEncoding already warns us:

In theory, this mechanism (<encoding>//ROUNDTRIP) allows arbitrary data to be roundtripped via a String with no loss of data. In practice, there are two limitations to be aware of:

This only stands a chance of working for an encoding which is an ASCII superset, as for security reasons we refuse to escape any bytes smaller than 128. Many encodings of interest are ASCII supersets (in particular, you can assume that the locale encoding is an ASCII superset) but many (such as UTF-16) are not.
If the underlying encoding is not itself roundtrippable, this mechanism can fail. Roundtrippable encodings are those which have an injective mapping into Unicode. Almost all encodings meet this criteria, but some do not. Notably, Shift-JIS (CP932) and Big5 contain several different encodings of the same Unicode codepoint.


The unicode consortium also has published a draft about Security and Interoperability Problems with PEP 383.
Failures

The following examples need the library file-io, which
makes use of AFPP and uses patched unix/Win32.
These examples outline various failures of the current encodin/decoding approach.
Failure 1: filesystem encoding == CP932, filepath is UTF-8

This works:
> import GHC.IO.Encoding
> getFileSystemEncoding
UTF-8
> Prelude.writeFile "experiments/ʀ�工ߧίß浇Q겿" ""

This doesn't:
> cp932Enc <- mkTextEncoding "CP932//ROUNDTRIP"
> setFileSystemEncoding cp932Enc
> Prelude.writeFile "experiments/ʀ�工ߧίß浇Q겿" ""
*** Exception: experiments/ʀ�工ߧίß浇Q겿: openFile: invalid argument (invalid character)

But with AFP it does:
> cp932Enc <- mkTextEncoding "CP932//ROUNDTRIP"
> setFileSystemEncoding cp932Enc
> import qualified System.File.AbstractFilePath as AFP
> :set -XQuasiQuotes
> AFP.writeFile [afp|experiments/ʀ�工ߧίß浇Q겿|] mempty
> :!ls experiments
ʀ�工ߧίß浇Q겿

Failure 2: filesystem encoding == CP932, filepath is CP932

CP932 has multiple byte pairs that represent the same unicode codepoint:
> import GHC.IO.Encoding
> import AbstractFilePath.Encoding
> cp932Enc <- mkTextEncoding "CP932//ROUNDTRIP"
> decodeWith cp932Enc (pack [0x87, 0x90])
Right "\8786"
> decodeWith cp932Enc (pack [0x81, 0xE0])
Right "\8786"

Lets write a file with both pairs:
> import qualified System.AbstractFilePath as AFP
> import qualified System.File.AbstractFilePath as AFP
> import Data.Char
> :set -XQuasiQuotes
> :set -XOverloadedStrings
> AFP.writeFile ("experiments" AFP.</> AFP.packAFP (AFP.unsafeFromChar . chr <$> [0x81, 0xE0])) "lol1"
> AFP.writeFile ("experiments" AFP.</> AFP.packAFP (AFP.unsafeFromChar . chr <$> [0x87, 0x90])) "lol2"

And lets try to read it with directory:
> import GHC.IO.Encoding
> import System.Directory
> cp932Enc <- mkTextEncoding "CP932//ROUNDTRIP"
> setFileSystemEncoding cp932Enc
> listDirectory "experiments"
["\8786","\8786"]
> listDirectory "experiments" >>= \[fp1, fp2] -> Prelude.readFile ("experiments" System.FilePath.</> fp1) >>= \c1 -> Prelude.readFile ("experiments" System.FilePath.</> fp2) >>= \c2 -> print c1 >> print c2
"lol2"
"lol2"

We can't read the first file...
Using the new unix modules in conjunction with AFP, this works:
> :set -XOverloadedStrings
> import qualified System.File.PlatformFilePath as PFP
> import qualified System.AbstractFilePath.Posix as PFP
> import System.Posix.Directory.PosixFilePath
> ds <- openDirStream "experiments"
> fp <- readDirStream ds
> fp
"."
> fp <- readDirStream ds
> fp
"��"
> PFP.readFile ("experiments" PFP.</> fp)
"lol1"
> fp <- readDirStream ds
> fp
"��"
> PFP.readFile ("experiments" PFP.</> fp)
"lol2"

Failure 3: Comparing strings

If you compare strings (e.g. filepaths returned from base) with strings that were obtained via other means,
it is possible that the other strings don't use PEP 383 style roundtrip encoding and the equality tests will
not work as expected.
An example of this in python can be seen here: http://blog.omega-prime.co.uk/2011/03/29/security-implications-of-pep-383/
More interoperability issues when running multiple converters over Unicode strings with high surrogate pairs are explained here: https://unicode.org/L2/L2009/09236-pep383-problems.html
Failure 4: intermediate calls to setFileSystemEncoding

Continuing from "Failure 2":
> import GHC.IO.Encoding
> import System.Directory
> cp932Enc <- mkTextEncoding "CP932//ROUNDTRIP"
> setFileSystemEncoding cp932Enc
> fps <- listDirectory "experiments"
> fps
["\8786","\8786"]
> utf8enc <- mkTextEncoding "UTF-8//ROUNDTRIP"
> setFileSystemEncoding utf8enc
> let [fp1, fp2] = fps
> Prelude.readFile ("experiments" System.FilePath.</> fp1)
*** Exception: experiments/≒: openFile: does not exist (No such file or directory)
> Prelude.readFile ("experiments" System.FilePath.</> fp2)
*** Exception: experiments/≒: openFile: does not exist (No such file or directory)

Failure 5: serializing filepaths

When serializing String based filepaths, we lose the underlying encoding that was used.
This is a combination of 3 and 4.
Failure 6: re-encoding PEP-383 UTF-8 based filepath

The high surrogate pairs are "not quite unicode". E.g. you can't encode such a filepath as UTF-16 or
even roundtrip them through a strict UTF8 converter:
> import GHC.IO.Encoding
> utf8_RT <- mkTextEncoding "UTF-8//ROUNDTRIP"
> utf8_STRICT <- mkTextEncoding "UTF-8"
> decodeWith utf8_RT (pack [0x87, 0x90]) >>= encodeWith utf8_STRICT
Left Cannot decode input: recoverEncode: invalid argument (invalid character)

> decodeWith utf8_STRICT (pack [0x70, 0x70]) >>= encodeWith utf16le
Right "p\NULp\NUL"
> decodeWith utf8_RT (pack [0x87, 0x90]) >>= encodeWith utf16le
Left Cannot decode input: recoverEncode: invalid argument (invalid character)

Alternatives

WTF-8

This is a private spec used in rust: https://simonsapin.github.io/wtf-8/
It's not meant to be exposed, but only be an internal encoding.
Enforcing UTF-8 on unix (with PEP-383)

If UTF-8 is enforced, then PEP-383 is mostly total and roundtripping system filepaths always works, regardless
of locale configuration.
However, encoding arbitrary Haskell Char is not total:
> import GHC.IO.Encoding
> utf8_RT <- mkTextEncoding "UTF-8//ROUNDTRIP"
> encodeWith utf8_RT ([toEnum 0xDFF0, toEnum 0xDFF2])
Left Cannot decode input: recoverEncode: invalid argument (invalid character)

Benefits of AbstractFilePath over alternatives


doesn't use private (or any) encoding: it receives and sends the bytes to and from system API as-is
serialization can be well-defined with no ambiguity or need of interpretation on the other end
We don't lose the underlying encoding information
library authors don't have to know about custom language specific encodings, PEP-383 or other things

if you process/manipulate filepaths with PEP-383 UTF-8, you have to be aware of the meaning of high surrogate pairs
you have to be aware that it isn't strict UTF-8


forcing UTF-8 for filepath encoding is a breaking change and could cause subtle bugs