erica/trim.md

## trim.md

      
    Raw
  

              trim.md
            
          
    Add a trim() method to String


Proposal: SE-0TBD
Author(s): Andrew McKnight, Erica Sadun
Status: TBD


Introduction

This proposal adds a new trim() method to the standard library. It removes leading and trailing whitespaces using the Regex and Unicode notion of whitespace.
This proposal was first discussed on the Swift Evolution list in the
Surveying how Swift evolves, String Hygiene, and Corner-cases in Character classification of whitespace threads.
Motivation

Surveying Swift utility libraries on GitHub revealed many interesting trends. String trimming was by far the most popular third party customization for Swift.
These are the top 10 function declarations from extensions on String, in their canonical form from swiftc:

24 trim() -> String
13 substring(from: Int) -> String
12 substring(to: Int) -> String
11 isValidEmail() -> Bool
10 trimmed() -> String
10 toBool() -> Bool?
10 height(withConstrainedWidth width: CGFloat, font: UIFont) -> CGFloat
9 trim()
9 toDouble() -> Double?
9 isNumber() -> Bool

The count for the first of these, 24, is misleading, as trimming also appears in the #5 and #8 spots, including a mutating variation. It appears in many forms further down the list, with 84 methods in just this sample:

24 trim() -> String
10 trimmed() -> String
9 trim()
3 trimPhoneNumberString() -> String
3 trimNewLine() -> String
3 trimForNewLineCharacterSet() -> String
2 trimmedRight(characterSet set: NSCharacterSet = default) -> String
2 trimmedLeft(characterSet set: NSCharacterSet = default) -> String
1 trimmingWhitespacesAndNewlines() -> String
1 trimmedStart(characterSet set: CharacterSet = default) -> String
1 trimmedRight() -> String
1 trimmedLeft() -> String
1 trimmedEnd(characterSet set: CharacterSet = default) -> String
1 trimWhitespace() -> String
1 trimPrefix(prefix: String)
1 trimInside() -> String
1 trimDuplicates() -> String
1 trim(trim: String) -> String
1 trim(_ characters: String) -> String
1 trim(_ characterSet: CharacterSet) -> <>
1 stringByTrimmingTailCharactersInSet(_ set: CharacterSet) -> String
1 sk4TrimSpaceNL() -> String
1 sk4TrimSpace() -> String
1 sk4Trim(str: String) -> String
1 sk4Trim(charSet: NSCharacterSet) -> String
1 prefixTrimmed(prefix: String) -> String
1 omTrim()
1 m_trimmed() -> String
1 m_trim()
1 jjs_trimWhitespaceAndNewline() -> String
1 jjs_trimWhitespace() -> String
1 jjs_trimNewline() -> String
1 jjs_emptyOrStringAndTrim(str: String?) -> String
1 hyb_trimRight(trimNewline: Bool = default) -> String
1 hyb_trimLeft(trimNewline: Bool = default) -> String
1 hyb_trim(trimNewline: Bool = default) -> String

Lots of people are solving the same problem the same way, a function that is sufficiently universal to justify inclusion in the Standard Library in Swift 5.
Detailed Design

This proposal trailblazes a new area of community-driven design. Because of that, it has had to take several challenges, moving both with and against conventional wisdom, into account in developing its design.
Wrapping NSString

This implementation does not wrap NSString's trimmingCharacters(in:) API, ensuring that it can be decoupled from Cocoa and Cocoa Touch for use on other platforms.
Character Sets

This implementation uses NSString's categorization of newlines and white spaces, specifically Unicode General Category Z*, U+000A ~ U+000D, and U+0085. This is not a user-facing detail and the discussion and implementation of Swift-only standards-based character sets can be resolved at a future time.
Return Type

This implementation offers the simplest tooling and returns a String rather than a Substring, following StringProtocol's existing art, to best match the community-sourced problem space it is trying to satisfy. StringProtocol declares func trimmingCharacters(in set: CharacterSet) -> String, which returns a string.

The API should be as useful as possible and as Swifty as possible but if it returns substrings, third party libraries will start implementing var trimmedAsString because the API is not giving people the tool that does what they want and need.
Producing a string isn't the most efficient approach nor is it the most general but it provides tooling that expresses the task common to an overwhelming number of use cases.
A full trimming API, that enables you to select direction and exclusion set, lies outside the scope of this proposal. That full API might be able to work on any bidirection collection and any element set. Or it might simply cover StringProtocol, UnicodeScalarView, String, and Substring.

Enumerations versus Option Sets

Quite a lot of the preliminary discussion of this proposal covered whether it was better to use enumerations or option sets to provide an affordance that allows trimming from one side or the other. This proposal uses an enumeration for the following reasons:

There is no precedent in the standard library for using option set arguments.
Option set call-site vocabulary is overly large for the needs of the API. You can call the function using static members (for example, .start), with set-array notation (for example, [.start]), and raw value initialization.
Raw value initialization, in particular, permits call-sites to use meaningless values that are legal, sanctioning poor call-site hygiene.
The number of customizations will never be more than 2 and call-sites should use either none or one. Calling with no options is preferable to [.start, .end] or even [.end, .start].

Preliminary Implementation

extension String {
  /// The direction from which a string is trimmed, where `full`
  /// (the typical default) trims from the `start` and `end`.
  public enum Trimming { case start, end, full }

  /// Whitespace and newline characters, which are defined as Unicode General
  /// Category Z* (Zl, Zp, Zs), U+000A ~ U+000D, and U+0085.
  public static var whitespaceAndNewlineCharacters: Set<Character> = [
    // [Zl]: Unicode Characters Category 'Separator, Line'
    "\u{2028}", // LINE SEPARATOR
    
    // [Zp]: Unicode Character Category 'Separator, Paragraph'
    "\u{2029}", // PARAGRAPH SEPARATOR
    
    // [Zs]: Unicode Character Category 'Separator, Space'
    "\u{0020}", // SPACE
    "\u{00A0}", // NO-BREAK SPACE
    "\u{1680}", // OGHAM SPACE MARK
    "\u{2000}", // EN QUAD
    "\u{2001}", // EM QUAD
    "\u{2002}", // EN SPACE
    "\u{2003}", // EM SPACE
    "\u{2004}", // THREE-PER-EM SPACE
    "\u{2005}", // FOUR-PER-EM SPACE
    "\u{2006}", // SIX-PER-EM SPACE
    "\u{2007}", // FIGURE SPACE
    "\u{2008}", // PUNCTUATION SPACE
    "\u{2009}", // THIN SPACE
    "\u{200A}", // HAIR SPACE
    "\u{202F}", // NARROW NO-BREAK SPACE
    "\u{205F}", // MEDIUM MATHEMATICAL SPACE
    "\u{3000}", // IDEOGRAPHIC SPACE
    
    // U+000A ~ U+000D, and U+0085, per Foundation documentation
    // for
    "\u{000A}",
    "\u{000B}",
    "\u{000C}",
    "\u{000D}",
    "\u{0085}",
    ]
  
  /// Returns a new string removing whitespace from
  /// both ends of the source string. Whitespace characters
  /// are defined as Unicode General Category Z*,
  /// U+000A ~ U+000D, and U+0085.
  ///
  /// Trimming takes place over the characters of a string,
  /// so that the unicode grapheme clusters have already been
  /// resolved. The grapheme clustering pass will happen
  /// before escape sequences like `\n` are processed.
  ///
  /// - Parameter trim: A direction from which to trim, legal values
  ///   are `.left` and `.right`. If omitted, the string is trimmed
  ///   from both sides.
  /// - Returns: A string trimmed of its whitespace on both
  ///   the leading and trailing text.
  public func trimmed(from trim: String.Trimming = .full) -> String {
    // Ensure that this implementation does not rely on the
    // NSString implementation of trimmingCharacters(in: .whitespacesAndNewlines)
    
    guard !isEmpty else { return String(self[...]) }
    var (trimStart, trimEnd) = (startIndex, index(before: endIndex))
    
    if [.start, .full].contains(trim) {
      guard let start = indices.first(where: {
        !String.whitespaceAndNewlineCharacters.contains(self[$0])
      }) else { return String(self[endIndex ..< endIndex]) }
      trimStart = start
    }
    
    if [.end, .full].contains(trim) {
      guard let end = indices.reversed().first(where: {
        !String.whitespaceAndNewlineCharacters.contains(self[$0])
      }) else { return String(self[endIndex ..< endIndex]) }
      trimEnd = end
    }
    
    return String(self[trimStart ... trimEnd])
  }
  
  /// Trims a string in-place by removing whitespace from
  /// both ends of the source string. Whitespace characters
  /// are defined as Unicode General Category Z*,
  /// U+000A ~ U+000D, and U+0085.
  ///
  /// - Parameter trim: A direction from which to trim, legal values
  ///   are `.left` and `.right`. If omitted, the string is trimmed
  ///   from both sides.
  public mutating func trim(from trim: String.Trimming = .full) {
    self = self.trimmed(from: trim)
  }
}
Alternatives Considered

Not adopting this proposal.
Source compatibility

This proposal is strictly additive.
Effect on ABI stability

This proposal does not affect ABI stability.
Effect on API resilience

This proposal does not affect ABI resilience.