glessard/0405-string-validating-initializers.md

## 0405-string-validating-initializers.md

      
    Raw
  

              0405-string-validating-initializers.md
            
          
    String initializers with encoding validation


Proposal: SE-0405 String initializers with encoding validation
Author: Guillaume Lessard
Review Manager: TBD
Status: Pitch
Bugs: rdar://99276048, rdar://99832858
Implementation: (pending)


Note

This proposal was reviewed and accepted. The accepted version of the proposal is here. Below is an earlier version from the pitch phase.
Introduction

We propose adding new String failable initializers that validate encoded input, and return nil when the input contains any invalid elements.
Motivation

The String type guarantees that it represents well-formed Unicode text. When data representing text is received from a file, the network, or some other source, it may be relevant to store it in a String, but that data must be validated first. String already provides a way to transform data to valid Unicode by repairing invalid elements, but such a transformation is often not desirable, especially when dealing with untrusted sources. For example a JSON decoder cannot transform its input; it must fail if a span representing text contains any invalid UTF-8.
This functionality has not been available directly from the standard library. It is possible to compose it using existing public API, but only at the cost of extra memory copies and allocations. The standard library is uniquely positioned to implement this functionality in a performant way.
Proposed Solution

We will add a new String initializer that can fail, returning nil, when its input is found to be invalid according the encoding represented by a type parameter that conforms to Unicode.Encoding.
extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>, as: Encoding.Type
  )
}
For convenience and discoverability, we will also provide initializers that specify the input encoding as part as an argument label:
extension String {
  public init?(validatingFromUTF8 codeUnits: some Sequence<UTF8.CodeUnit>)

  public init?(validatingFromUTF16 codeUnits: some Sequence<UTF16.CodeUnit>)

  public init?(validatingFromUTF32 codeUnits: some Sequence<UTF32.CodeUnit>)
}
These will construct a new String, returning nil when their input is found invalid according to the encoding specified by the label.
When handling with data obtained from C, it is frequently the case that UTF-8 data is represented by CChar rather than UInt8. We will provide a convenience initializer for this use case, noting that it typically involves contiguous memory, and as such is well-served by explicitly using an abstraction for contiguous memory (UnsafeBufferPointer<CChar>):
extension String {
  public init?(validatingFromUTF8 codeUnits: UnsafeBufferPointer<CChar>)
}
String already features a validating initializer for UTF-8 input. Is is intended for C interoperability,  but its argument label does not convey the expectation that its input is a null-terminated C string. We propose to rename it in order to clarify this:
extension String {
  public init?(validatingCString nullTerminatedUTF8: UnsafePointer<CChar>)

  @available(Swift 5.XLIX, deprecated, renamed:"String.init(validatingCString:)")
  public init?(validatingUTF8 cString: UnsafePointer<CChar>)
}
Detailed Design

We want these new initializers to be performant. As such, their implementation should minimize the number of memory allocations and copies required. We achieve this performance with @inlinable implementations that leverage withContiguousStorageIfAvailable to provide a concrete (internal) code path for the validation cases. The concrete internal initializer itself calls a number of functions internal to the standard library.
extension String {
  /// Create a new `String` by copying and validating the sequence of
  /// code units passed in, according to the specified encoding.
  ///
  /// This initializer does not try to repair ill-formed code unit sequences.
  /// If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with the contents of two
  /// different arrays---first with a well-formed UTF-8 code unit sequence and
  /// then with an ill-formed UTF-16 code unit sequence.
  ///
  ///     let validUTF8: [UInt8] = [67, 97, 102, 195, 169]
  ///     let valid = String(validating: validUTF8, as: UTF8.self)
  ///     print(valid)
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF16: [UInt16] = [0x41, 0x42, 0xd801]
  ///     let invalid = String(validating: invalidUTF16, as: UTF16.self)
  ///     print(invalid)
  ///     // Prints "nil"
  ///
  /// - Parameters
  ///   - codeUnits: A sequence of code units that encode a `String`
  ///   - encoding: An implementation of `Unicode.Encoding` that should be used
  ///               to decode `codeUnits`.
  @inlinable
  public init?<Encoding>(
    validating codeUnits: some Sequence<Encoding.CodeUnit>, as: Encoding.Type
  ) where Encoding: Unicode.Encoding
}
extension String {
  /// Create a new `String` by copying and validating the sequence of
  /// UTF-8 code units passed in.
  ///
  /// This initializer does not try to repair ill-formed code unit sequences.
  /// If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with the contents of two
  /// different arrays---first with a well-formed UTF-8 code unit sequence and
  /// then with an ill-formed code unit sequence.
  ///
  ///     let validUTF8: [UInt8] = [67, 97, 102, 195, 169]
  ///     let valid = String.init(validatingFromUTF8: validUTF8)
  ///     print(valid)
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF8: [UInt8] = [67, 195, 0]
  ///     let invalid = String.init(validatingFromUTF8: invalidUTF8)
  ///     print(invalid)
  ///     // Prints "nil"
  ///
  /// - Parameters
  ///   - codeUnits: A sequence of code units that encode a `String`
  public init?(validatingFromUTF8 codeUnits: some Sequence<UTF8.CodeUnit>)

  /// Create a new `String` by copying and validating the sequence of
  /// UTF-16 code units passed in.
  ///
  /// This initializer does not try to repair ill-formed code unit sequences.
  /// If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with the contents of two
  /// different arrays---first with a well-formed UTF-16 code unit sequence and
  /// then with an ill-formed code unit sequence.
  ///
  ///     let validUTF16: [UInt16] = [67, 97, 102, 233]
  ///     let valid = String(validatingFromUTF16: validUTF16)
  ///     print(valid)
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF16: [UInt16] = [0x41, 0x42, 0xd801]
  ///     let invalid = String(validatingFromUTF16: invalidUTF16)
  ///     print(invalid)
  ///     // Prints "nil"
  ///
  /// - Parameters
  ///   - codeUnits: A sequence of code units that encode a `String`
  public init?(validatingFromUTF16 codeUnits: some Sequence<UTF16.CodeUnit>)

  /// Create a new `String` by copying and validating the sequence of
  /// UTF-32 code units passed in.
  ///
  /// This initializer does not try to repair ill-formed code unit sequences.
  /// If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with the contents of two
  /// different arrays---first with correct UTF-32 code units and then with
  /// a sequence containing an ill-formed code unit.
  ///
  ///     let validUTF32: [UInt32] = [67, 97, 102, 233]
  ///     let valid = String(validatingFromUTF32: validUTF32)
  ///     print(valid)
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF32: [UInt32] = [0x41, 0x42, 0xd801]
  ///     let invalid = String(validatingFromUTF32: invalidUTF32)
  ///     print(invalid)
  ///     // Prints "nil"
  ///
  /// - Parameters
  ///   - codeUnits: A sequence of code units that encode a `String`
  public init?(validatingFromUTF32 codeUnits: some Sequence<UTF32.CodeUnit>)
}
extension String {
  /// Create a new `String` by copying and validating the sequence of `CChar`
  /// passed in, by interpreting them as UTF-8 code units.
  ///
  /// This initializer does not try to repair ill-formed code unit sequences.
  /// If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with pointers to the
  /// contents of two different `CChar` arrays---first with a well-formed UTF-8
  /// code unit sequence and then with an ill-formed code unit sequence.
  ///
  ///     let validUTF8: [CChar] = [67, 97, 102, -61, -87]
  ///     validUTF8.withUnsafeBufferPointer {
  ///         let s = String.init(validatingFromUTF8: $0)
  ///         print(s)
  ///     }
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF8: [CChar] = [67, -61, 0]
  ///     invalidUTF8.withUnsafeBufferPointer {
  ///         let s = String.init(validatingFromUTF8: $0)
  ///         print(s)
  ///     }
  ///     // Prints "nil"
  ///
  /// - Parameters
  ///   - codeUnits: A sequence of code units that encode a `String`
  public init?(validatingFromUTF8 codeUnits: UnsafeBufferPointer<CChar>)
}
extension String {
  /// Create a new string by copying and validating the null-terminated UTF-8
  /// data referenced by the given pointer.
  ///
  /// This initializer does not try to repair ill-formed UTF-8 code unit
  /// sequences. If any are found, the result of the initializer is `nil`.
  ///
  /// The following example calls this initializer with pointers to the
  /// contents of two different `CChar` arrays---first with well-formed
  /// UTF-8 code unit sequences and the second with an ill-formed sequence at
  /// the end.
  ///
  ///     let validUTF8: [CChar] = [67, 97, 102, -61, -87, 0]
  ///     validUTF8.withUnsafeBufferPointer { ptr in
  ///         let s = String(validatingUTF8: ptr.baseAddress!)
  ///         print(s)
  ///     }
  ///     // Prints "Optional("Café")"
  ///
  ///     let invalidUTF8: [CChar] = [67, 97, 102, -61, 0]
  ///     invalidUTF8.withUnsafeBufferPointer { ptr in
  ///         let s = String(validatingUTF8: ptr.baseAddress!)
  ///         print(s)
  ///     }
  ///     // Prints "nil"
  ///
  /// - Parameter cString: A pointer to a null-terminated UTF-8 code sequence.
  @_silgen_name("sSS14validatingUTF8SSSgSPys4Int8VG_tcfC")
  public init?(validatingCString nullTerminatedCodeUnits: UnsafePointer<CChar>)
  
  @available(Swift 5.XLIX, deprecated, renamed:"String.init(validatingCString:)")
  @_silgen_name("_swift_stdlib_legacy_String_validatingUTF8")
  @_alwaysEmitIntoClient
  public init?(validatingUTF8 cString: UnsafePointer<CChar>)
}
Source Compatibility

This proposal is strictly additive.
ABI Compatibility

This proposal adds new functions to the ABI.
Implications on adoption

This feature requires a new version of the standard library.
Alternatives considered

The validatingUTF8 argument label

The argument label validatingUTF8 seems like it would have been preferable to validatingFromUTF8, but using the former would have been source-breaking. The C string validation initializer takes an UnsafePointer<UInt8>, but that is also valid with [UInt8] via implicit pointer conversion. Any use site that passes an [UInt8] to the C string validation initializer would have changed behaviour upon recompilation, from considering a null character (\0) as the termination of the C string to considering it as a valid character.
Have the CChar-validating function take a parameter of type some Sequence<CChar>

This would produce a compile-time ambiguity on platforms where CChar is typealiased to UInt8 rather than Int8. Using UnsafeBufferPointer<CChar> as the parameter type will avoid such a compile-time ambiguity.
Future directions

Throw an error containing details of a validation failure

When decoding a byte stream, obtaining the details of a validation failure would be useful in order to diagnose issues. We would like to provide this functionality, but the current input validation functionality is not well-suited for it. This is left as a future improvement.
Acknowledgements

Thanks to Michael Ilseman, Tina Liu and Quinn Quinn for discussions about input validation issues.
SE-0027 by Zachary Waldowski was reviewed in February 2016, covering similar ground. It was rejected at the time because the design of String had not been finalized. The name String.init(validatingCString:) was suggested as part of SE-0027. Lily Ballard later pitched a renaming of String.init(validatingUTF8:), citing consistency with other String API involving C strings.