Multiline string literals
- Proposal: SE-NNNN
- Author(s): Brent Royal-Gordon, John Holdsworth, Tyler Cloutier
- Status: Third Draft
- Review manager: TBD
In Swift 2.2, the only means to insert a newline into a string literal
\n escape. String literals specified in this way are generally
ugly and unreadable. We propose a multiline string feature emphasizing
code readability which is a straightforward extension of our existing
string literals and appears to be exceptionally easy to implement.
This proposal is the first step in a larger plan to improve how string literals address various challenging use cases. By itself, it is not meant to solve all problems with string escaping or the representation of long string literals. However, it is an important step in that direction, and it does modestly improve on the status quo even for those use cases which we intend to address more directly later.
See the "Motivation" section for the overall goals of this project, and the "Future directions for string literals in general" section for a sketch of how we might achieve those goals.
Third draft notes
Expands the "Motivation" section to discuss the overall string improvement project, including a list of goals, and reworks the "Future directions for string literals in general" section to theme it around the new list of goals and to discuss the sketched designs and some alternatives in more detail.
Added a change to the formal grammar.
Added discussion of John Holdsworth's prototype and the implementation lessons from it.
Fills in co-authors. (I did the proposal drafting, John did the prototyping, and Tyler had the original idea and offered particularly detailed critiques.)
Swift's string literals include the minimum viable feature set: they have quoting and escaping features sufficient to embed any Unicode string in a Swift source code file. They also support interpolation, a feature which makes constructing strings from dynamic data much easier. This is a great foundation for Swift string handling.
However, as Swift begins to move into roles beyond native app development, code which needs to generate text becomes a more important use case. Whether you're emitting HTML or XML, writing configuration files, generating source code in Swift or another language, or just showing long textual messages to the user, you need string literals to be more than just "minimum viable". We must move beyond that, making it easy and even pleasant to embed long, complex, and special-character-ridden strings into Swift code.
There are four very important areas where we think string literals need to improve:
- Putting newlines in string literals.
- Putting backslashes in string literals.
- Putting quote marks in string literals.
- Putting very large quantities of text (more than, say, twenty lines) in string literals.
Swift's design principles call for an incremental approach to design, so this proposal only considers the first of these four goals. We intend to address the others in separate but compatible future proposals, with the goal of fixing at least two in Swift 3 and all four by the time Swift 4 is released. A sketch of one possible design for these features is included in the "Future directions for string literals in general" section below.
An aside: Small and large multiline strings
The first entry in the list above, "Putting newlines in string literals", might be thought of as a subset of the fourth, "putting very large quantities of text in string literals". (In fact, all of the entries could perhaps be subsumed by the fourth.) However, we believe they are best addressed as separate goals, using separate features.
When you are embedding enormous string literals in source code, you must put undistorted representation of the string above all other considerations. If the design which best permits the string to be written verbatim is ugly, bulky, unlike other language constructs, disruptive to code readability, error-prone, arbitrary, difficult to parse, or otherwise a wart on the language, that is simply the price we have to pay for that feature.
But it's a different story for short multiline strings. When you are writing a little bit of text, but still more than one line, you don't want to disrupt your code's indentation, add whole lines just for delimiters, insert bizarre or cryptic tokens into your code, or create syntax errors which take ten minutes to trace back to their source. You want a different feature, with different tradeoffs.
It is that feature which we propose.
The goal: Newlines in string literals
Consider a piece of code which generates a small XML string:
let xml = "<?xml version=\"1.0\"?>\n<catalog>\n\t<book id=\"bk101\" empty=\"\">\n\t\t<author>\(author)</author>\n\t</book>\n</catalog>"
The string is practically unreadable, its structure drowned in escapes and
run-together lines; it looks like little more than line noise. We can
improve its readability somewhat by concatenating separate strings for
each line and using real tabs instead of
let xml = "<?xml version=\"1.0\"?>\n" + "<catalog>\n" + " <book id=\"bk101\" empty=\"\">\n" + " <author>\(author)</author>\n" + " </book>\n" + "</catalog>"
However, this creates a more complex expression for the type checker, and there's still far more punctuation than ought to be necessary. If the most important goal of Swift is making code readable, this kind of code falls far short of that goal.
The example above generates XML, but there are many similar cases where a short fragment of text including newlines must be included in a string literal:
- Generating HTML, a task which will hopefully become more common
- Generating error messages and other user-facing text, in both graphical and command-line interfaces
- Generating configuration files and other "scripty" tasks
- Generating messages for other text-based protocols and formats
- Generating Swift code (common to work around the current lack of metaprogramming features other than generics)
- Generating code in general
We propose that, when Swift is parsing a string literal, if it reaches the end of the line without encountering an end quote, it should look at the next line. If it sees a quote at the beginning (a "continuation quote"), the string literal contains a newline and then continues on that line. Otherwise, the string literal is unterminated and syntactically invalid.
Our sample above could thus be written as:
let xml = "<?xml version=\"1.0\"?> "<catalog> " <book id=\"bk101\" empty=\"\"> " <author>\(author)</author> " </book> "</catalog>"
If the second or subsequent lines had not begun with a quotation mark,
or the trailing quotation mark after the
</catalog> tag had not been
included, Swift would have emitted an error.
This design is rather unusual, and it's worth pausing for a moment to explain why it has been chosen.
The traditional design for this feature, seen in languages like Perl and Python, simply places one delimiter at the beginning of the literal and another at the end. Individual lines in the literal are not marked in any way.
We think continuation quotes offer several important advantages over the traditional design:
They help the compiler pinpoint errors in string literal delimiting. Traditional multiline strings have a serious weakness: if you forget the closing quote, the compiler has no idea where you wanted the literal to end. It simply continues on until the compiler encounters another quote (or the end of the file). If you're lucky, the text after that quote is not valid code, and the resulting error will at least point you to the next string literal in the file. If you're unlucky, you'll get a seemingly unrelated error several literals later, an unbalanced brace error at the end of the file, or perhaps even code that compiles but does something totally wrong.
(This is not a minor concern. Many popular languages, including C and Swift 2, specifically reject newlines in string literals to prevent this from happening.)
Continuation quotes provide the compiler with redundant information about your intent. If you forget a closing quote, the continuation quotes give the compiler a very good idea of where you meant to put it. The compiler can point you to (or at least very near) the end of the literal, where you want to insert the quote, rather than showing you the beginning of the literal or even some unrelated error later in the file that was caused by the missing quote.
Temporarily unclosed literals don't make editors go haywire. The syntax highlighter has the same trouble parsing half-written, unclosed string literals that the compiler does: It can't tell where the literal is supposed to end and the code should begin. It must either apply heuristics to try to guess where the literal ends, or incorrectly color everything between the opening quote and the next closing quote as a string literal. This can cause the file's coloring to alternate distractingly between "string literal" and "running code".
Continuation quotes give the syntax highlighter enough context to guess at the correct coloration, even when the string isn't complete yet. Lines with a continuation quote are literals; lines without are code. At worst, the syntax highlighter might incorrectly color a few characters at the end of a line, rather than the remainder of the file.
They separate indentation from the string's contents. Traditional multiline strings usually include all of the content between the start and end delimiters, including leading whitespace. This means that it's usually impossible to indent a multiline string, so including one breaks up the flow of the surrounding code, making it less readable. Some languages apply heuristics, either at compile time or through runtime string manipulation functions, to try to remove indentation, but like all heuristics, these are mistake-prone and murky.
Continuation quotes neatly avoid this problem. Whitespace before the continuation quote is indentation used to format the source code; whitespace after the continuation quote is part of the string literal. The interpretation of the code is perfectly clear to both compiler and programmer.
They improve the ability to quickly recognize the literal. Traditional multiline strings don't provide much visual help. To find the end, you must visually scan until you find the matching delimiter, which may be only one or a few characters long. When looking at a random line of source, it can be hard to tell at a glance whether it's code or literal. Syntax highlighting can help with these issues, but it's often unreliable, especially with advanced, idiosyncratic string literal features like multiline strings.
Continuation quotes solve these problems. To find the end of the literal, just scan down the column of continuation characters until they end. To figure out if a given line of source is part of a literal, just see if it starts with a quote mark. The meaning of the source becomes obvious at a glance.
Nevertheless, the traditional design does has a few advantages:
It is simpler. Although continuation quotes are more complex, we believe that the advantages listed above pay for that complexity.
There is no need to edit the intervening lines to add continuation quotes. While the additional effort required to insert continuation quotes is an important downside, we believe that tool support, including both compiler fix-its and perhaps editor support for commands like "Paste as String Literal", can address this issue. These features could also address other issues like escaping. And many editors already support features which permit you to insert the same character at the same column in many adjacent lines—the exact task required to add continuation quotes.
Simple syntax highlighters may not support this syntax. This is true, but simple, generic syntax highlighters generally have terrible trouble with advanced string literal constructs; some struggle with even basic ones. While there are some designs (like Python's
"""strings) which trick some syntax highlighters into working with some contents, we think the code formatting and visual recognition gains described above assist code reading more than the loss of finicky partial syntax highlighting compatibility hinders it.
It looks funny—quotes should always be in matched pairs. We aren't aware of another programming language which uses unbalanced quotes in string literals, but there is one very important precedent for this kind of formatting: natural languages. English, for instance, uses a very similar format for quoting multiple lines of dialog by the same speaker. (Nor is this an English-only quirk; Spanish and French, the other two languages I checked, seem to have similar conventions.)
“That seems like an odd way to use punctuation,” Tom said. “What harm would there be in using quotation marks at the end of every paragraph?”
“Oh, that’s not all that complicated,” J.R. answered. “If you closed quotes at the end of every paragraph, then you would need to reidentify the speaker with every subsequent paragraph.
“Say a narrative was describing two or three people engaged in a lengthy conversation. If you closed the quotation marks in the previous paragraph, then a reader wouldn’t be able to easily tell if the previous speaker was extending his point, or if someone else in the room had picked up the conversation. By leaving the previous paragraph’s quote unclosed, the reader knows that the previous speaker is still the one talking.”
“Oh, that makes sense. Thanks!”
In English, omitting the ending quotation mark tells the text's reader that the quote continues on the next line, while including a quotation mark at the beginning of the next line reminds the reader that they're in the middle of a quote.
Similarly, in this proposal, omitting the ending quotation mark tells the code's reader (and compiler) that the string literal continues on the next line, while including a quotation mark at the beginning of the next line reminds the reader (and compiler) that they're in the middle of a string literal.
On balance, we think continuation quotes are the best design for this problem.
When Swift is parsing a string literal and reaches the end of a line without finding an end quote, it examines the next line, applying the following rules:
If the next line begins with a quote mark (skipping horizontal whitespace), then the string literal contains a newline followed by the contents of the string literal starting after the quote mark. (This line may itself have no end quote, in which case the same rules apply to the line which follows.)
If the next line begins with anything else, Swift raises a syntax error for an unterminated string literal.
Formally, the following grammar productions (from The Swift Programming Language):
static-string-literal → '"' quoted-text(opt) '"' interpolated-string-literal → '"' interpolated-text(opt) '"'
Become something like:
discarded-whitespace → (zero or more horizontal whitespace characters) static-string-literal → '"' quoted-text(opt) '"' | '"' quoted-text(opt) \n discarded-whitespace static-string-literal interpolated-string-literal → '"' interpolated-text(opt) '"' | '"' interpolated-text(opt) \n discarded-whitespace interpolated-string-literal
The exact error messages and diagnostics provided for unterminated quotes are left to the implementers to determine, but we believe it should be possible to provide two fix-its which will help users learn the syntax and correct string literal mistakes:
"at the end of the current line to terminate the quote.
"at the beginning of the next line (with some indentation heuristics) to continue the quote on the next line.
Prototypes and samples
John Holdsworth's pull request includes prototypes of several of the features we are planning, including this multiline string proposal. The pull request includes a link to a pre-built Swift toolchain containing the prototypes, which you can download and install to try these features for yourself.
The multiline strings prototype seems to indicate that the footprint of this feature will
be very small. The changes for this proposal are confined entirely to
getEncodedStringSegment(). (The string literal modifier feature
impacts other parts of the lexer, but this proposal doesn't require those
changes.) The prototype's diagnostics are primitive, and improving them
would probably increase the impacted code, but it seems likely
that the required changes will be very localized.
A very small ad-hoc set of tests for parts of the prototyped features is available in this gist. It is more demo than comprehensive test suite, but it shows a couple of ways this proposal and the other proposed features might be used, together and separately.
Impact on existing code
Failing to close a string literal before the end of the line is currently a syntax error, so no valid Swift code should be affected by this change.
Future directions for multiline string literals
We could permit comments before encountering a continuation quote to be counted as whitespace, and permit empty lines in the middle of string literals. This would allow you to comment out whole lines in the literal.
We could allow you to put a trailing backslash on a line to indicate that the newline isn't "real" and should be omitted from the literal's contents. Holdsworth's prototype includes this feature.
Future directions for string literals in general
In the "Motivation" section, we identified four goals for improvements in Swift string literals:
- Putting newlines in string literals.
- Putting backslashes in string literals.
- Putting quote marks in string literals.
- Putting very large quantities of text (more than, say, twenty lines) in string literals.
This proposal addresses #1. Let's sketch some solutions for the other three, as well as related features we might enable along the way.
Please note that these are simply sketches of hypothetical future designs; they may radically change before proposal, and some may never be proposed at all. Many, perhaps most, will not be proposed for Swift 3. We are sketching these designs not to propose and refine these features immediately, but merely to show how we think they might be solved in ways which complement this proposal.
What is most important about these designs is how they all work together with each other, with the current proposal, and with existing Swift string literals. Rather than inventing a new kind of string literal for each feature, we want to, where possible, extend and reuse existing syntax features.
A general mechanism: String literal modifiers
We may introduce the concept of string literal modifiers to alter the interpretation of string literals. These would become the basis for many future string literal features.
A string literal modifier is a cluster of identifier characters which goes before a string literal and adjusts the way it is parsed. Modifers only alter the interpretation of the text in the literal, not the type of data it produces; for instance, there will never be something like the UTF-8/UTF-16/UTF-32 literal modifiers in C++.
Modifiers can be attached to both single-line and multiline literals, and could also be attached to other literal syntaxes which might be introduced in the future. When used with multiline strings, only the starting quote needs to carry the modifiers, not the continuation quotes.
In one potential design, uppercase modifier characters enable a feature; lowercase characters disable a feature.
Our prototype also includes basic support for string modifiers, although the specific behavior of the modifiers in the prototype doesn't precisely match this sketch.
Goal 2: Backslashes
In the simplest version of this feature, we could add an
literal modifier which enables all backslash-based escaping, including
interpolation, double backslash, and backslash-quote. An
would treat all backslashes literally, while an
E string would treat
all backslashes as escapes of some sort.
E (the current behavior)
would be the default. Thus, these would print:
print(e"\\\") => \\\ print(e"C:\Program Files\Microsoft Word") => C:\Program Files\Microsoft Word print(e"\w+") => \w+ print(e"Interpolation looks like "\(this)") => Interpolation looks like => \(this)
We might also allow you to enable or disable individual features. For
i might control interpolation,
q might control quotes,
b might control other backslashes. Thus:
print(i"Interpolation looks like \"\(this)\"") => Interpolation looks like "\(this)" print(eI"C:\Program Files\\(programName)") => C:\Program Files\Microsoft Word print(b"\w+\n") => \w+ =>
These could be proposed separately, with
e coming in Swift 3 and
more nuanced modifiers potentially waiting for Swift 4.
Goal 3: Quote Marks
One possibility is to allow the user to put any number of
characters—a valid identifier character with no uppercase
equivalent—between the modifier (if any) and the opening quote. The
parser would then look for a matching number of
_ characters after
any closing quote, and if it did not find them, it would treat the
as a character in the string literal. Thus:
print(_"<a href="\(url)">"_) => <a href="http://www.swift.org/"> print(e_"print("Hello, world!\n")"_) => print("Hello, world!\n") print(b_""[^"\\]*(\\.[^"\\]*)*+""_) => "[^"\\]*(\\.[^"\\]*)*+"
Here are some other ways this feature could be implemented:
- Use a different identifier-but-not-capitalizable modifier, like
- Use a different single-character ASCII delimiter, like
- Use a different multi-character delimiter, like
- Use a different Unicode delimiter, like smart quotes (
“foo”) or French quotes (
«foo») or Japanese quotes (
- Permit arbitrary delimiters bounded by some specific, known
Goal 4: Very large quantities of text
As discussed in the "Motivation" section, when a string gets long enough, the most important feature of the quoting construct becomes fidelity to the text represented. When your string literal is 50 or 100 lines long, you don't want to mess around with prefixing lines with continuation quotes—you've already disrupted the flow of your program far more than outdenting the literal would, you're not likely to miss the enormous delimiter the feature provides, and the size of the literal will be obviously more than "a screenful" (and more obvious because it's not indented properly). You just want something that makes it as easy as possible to get the text into your source without fuss.
(Note: To keep this brief, I'll be using examples which are shorter than you would normally use this feature with.)
There are two main options here, which might broadly be thought of as "the Python way" (which we will call verbatim strings) and "the Perl way" (heredocs).
Verbatim strings are, quite simply, string literals bounded by
Between the two pairs of
""" delimiters, you may put any character,
print("""<?xml version="1.0"?> <catalog> <book id="bk101" empty=""> <author>\(author)</author> </book> </catalog>""") print("""It was a dark and stormy \(timeOfDay) when """ + e"""the Swift core team invented the \(interpolation) syntax.""")
A variation on this would require the delimiters to be on separate lines from the string contents:
print(""" <?xml version="1.0"?> <catalog> <book id="bk101" empty=""> <author>\(author)</author> </book> </catalog> """) print(""" It was a dark and stormy \(timeOfDay) when """ + e""" the Swift core team invented the \(interpolation) syntax. """)
A very simple approach with generally similar features would be to introduce a modifier which disabled continuation quotes; in other words, it would turn this proposal's multiline strings into a more traditional version. If you needed an alternate delimiter, you could then use whatever alternate delimiter mechanism we introduce for normal string literals.
print(c_"<?xml version=\"1.0\"?> <catalog> <book id=\"bk101\" empty=\"\"> <author>\(author)</author> </book> </catalog>"_) print(c"It was a dark and stormy \(timeOfDay) when " + ec"the Swift core team invented the \(interpolation) syntax.")
Heredocs have you put a placeholder token in one line for a string literal whose contents begin on the next line. Traditionally, heredocs have allowed you to specify an arbitrary string as a delimiter, which must appear on its own line. The traditional syntax for heredocs would look something like:
print(<<"END") <?xml version=\"1.0\"?> <catalog> <book id=\"bk101\" empty=\"\"> <author>\(author)</author> </book> </catalog> END print(<<"---" + <<e"END") It was a dark and stormy \(timeOfDay) when --- the Swift core team invented the \(interpolation) syntax. END
A more Swift-style syntax might use a
print(#to("END")) <?xml version=\"1.0\"?> <catalog> <book id=\"bk101\" empty=\"\"> <author>\(author)</author> </book> </catalog> END print(#to("---") + e#to("END")) It was a dark and stormy \(timeOfDay) when --- the Swift core team invented the \(interpolation) syntax. END
Or we might even borrow Python's
""" delimiter, creating an unholy
union of the two languages:
print(""") <?xml version=\"1.0\"?> <catalog> <book id=\"bk101\" empty=\"\"> <author>\(author)</author> </book> </catalog> """ print(""" + e""") It was a dark and stormy \(timeOfDay) when """ the Swift core team invented the \(interpolation) syntax. """
Although heredocs could make a good addition to Swift eventually, there are good reasons to defer them for now. Please see the "Alternatives considered" section for details.
Other potential modifier features
Whitespace normalization: Changes all runs of whitespace in the literal to single space characters; this would allow you to use multiline strings and other spacing purely to improve code formatting.
alert.informativeText = W"\(appName) could not typeset the element “\(title)” because "it includes a link to an element that has been removed from this "book."
Localization: Passes the string through Foundation's localization APIs; interpolations would be represented as format strings.
alert.informativeText = LW"\(appName) could not typeset the element “\(title)” because "it includes a link to an element that has been removed from this "book."
Comments: Embedding comments in string literals which were not included in their contents might be useful for literals containing regular expressions or other code.
Eventually, user-specified string modifiers could be added to Swift, perhaps as part of a hygienic macro system. It might also become possible to change the default modifiers applied to literals in a particular file or scope.
A note on regular expressions
Members of the core team are interested in regular expressions, but they don't want to just build a literal that wraps PCRE or ICU; rather, they aim to integrate regexes into the pattern matching system and give them a deep, Perl 6-style rethink. This would be a major effort, far beyond the scope of Swift 3.
In the meantime, the
e modifier and perhaps other string literal
modifiers will make it easier to specify regular expressions in string
literals for use with
NSRegularExpression and other libraries
accessible from Swift.
Don't require a continuation quote
The main alternative is to not require a continuation quote at the beginning of each subsequent line, and simply extend the string literal from the starting quote to the ending quote, including all newlines between them. For example:
let xml = "<?xml version=\"1.0\"?> <catalog> <book id=\"bk101\" empty=\"\"> <author>\(author)</author> </book> </catalog>"
This alternative is extensively discussed in the "Rationale" section above.
Use verbatim strings or heredocs instead
While these constructs have their place, a feature with lighter syntactic weight, better code formatting, and improved diagnostics is more appropriate for shorter multiline strings. See "An aside: Small and large multiline strings" above.
Introduce verbatim strings or heredocs first, or at the same time
Verbatim strings could probably be implemented relatively easily, but heredocs are probably too complex for the Swift 3 timeframe. We don't want to choose between these two approaches merely on the basis that one can be implemented sooner.
Don't require the end quote
Since each line is marked with a continuation quote, in theory, the end quote is redundant; the string could simply end after the last line with a continuation quote.
// Something like: let xml = M"<?xml version="1.0"?> "<catalog> " <book id="bk101" empty=""> " <author>\(author)</author> " </book> "</catalog>
M modifier could be left out (which would require
quotes on that line to be escaped), or a different
character or character sequence could be used. There was a fair bit of
bikeshedding on this; in some cases, a single post suggested several
syntaxes with slightly different semantics (such as different escaping
rules). Some marked the first and/or last line differently from the
other lines. What they all have in common is that the beginning of each
line is marked in some way, but the end is not, even at the end of the
Because there is no end delimiter—only a start-of-line marker—these designs may not require you to escape quotes; thus, they could potentially obviate the need for an alternate delimiter feature as well. Depending on the design, however, many of them have issues:
In most designs, it is possible to create a single-line string with the feature, but the resulting code tends to be ugly and awkward.
If the last line is marked the same as the others and the user forgets the marker on a line, the compiler has no way to notice, except by diagnosing errors caused by treating a line of a string literal as code. Since some lines of string content will be valid code (such as blank lines or C-style comments), these mistakes may pass unnoticed.
If the last line is marked the same as the others, then commenting out a line of a string literal, inserting a blank line in the middle of a string literal, or just in general inserting some sort of valid Swift code in the middle of a string literal would break the literal in half, once again potentially forming syntactically valid but incorrect Swift code.
Generally, the more these constructs work to avoid the above problems, the uglier and less quote-like they end up looking, and the more complex they will be for the parser.
Finally, all approaches share one fundamental issue.
String literals are expressions, and so they ought to have a syntax which can be nested inside other expressions. Line-oriented features like these don't work well as expressions, because you normally place several expressions on a single line, nesting them inside one another. Thus, these features may be awkward to use in any but the simplest ways.