Skip to content

Instantly share code, notes, and snippets.

@GrabYourPitchforks
Last active October 28, 2020 17:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GrabYourPitchforks/d318422903457371c26fa8ab0195e213 to your computer and use it in GitHub Desktop.
Save GrabYourPitchforks/d318422903457371c26fa8ab0195e213 to your computer and use it in GitHub Desktop.
Improving the developer experience with regard to default string globalization

Improving the developer experience with regard to default string globalization

Note: This is a companion document to dotnet/docs#21249, which discusses the behaviors as they currently exist. It's assumed the reader is generally familiar with that document.

Summary

Since .NET 5 standardized on using ICU instead of NLS for globalization across all supported platforms (see breaking change notice), we've received a few reports of string and globalization APIs not behaving as expected.

These reports generally fall into one of two buckets:

  1. The developer wasn't intending to make their code globalization-aware, and the switch to ICU exposed an unintentional dependency in the developer's code which led to an unwanted behavioral change. See: dotnet/runtime#43736, dotnet/runtime#42234, dotnet/runtime#40922, dotnet/runtime#40258, dotnet/runtime#36177, dotnet/runtime#33997, dotnet/runtime#43802, dotnet/runtime#36891, dotnet/runtime#43772.

  2. The developer intended to use a globalization feature, and the switch from NLS to ICU introduced an unexpected behavior. See: dotnet/runtime#43795, dotnet/runtime#39523.

For scenarios which fall into the second bucket, the runtime offers a compat switch to restore the old behavior. The remainder of this document will focus on the first bucket. This bucket is where most of the reports seem to fall.

To address these, we plan a three-pronged approach: improve documentation in this area, audit existing tutorials and code samples, and change the new project experience to reduce the "pit of failure" surface for .NET developers. We are soliciting the community's feedback on all of these. Please use this issue to discuss.

Improving documentation regarding string APIs

There are currently breaking change and compatibility noticed posted at the following locations:

We are additionally tracking through dotnet/docs#21249 improvements to the string docs all-up, including recommendations for which Roslyn analyzer rules to enable and updating the string docs to include a table of the default globalization mode for each of the APIs.

This work alone won't make the experience better, but it should help make information more complete and accessible to developers who are searching for it. This does not solve the problem of "How would somebody know to seek out this information in the first place?" - later sections of this proposal should address those.

Reviewing samples and tutorials for proper string handling patterns

We should review samples and tutorials to ensure they're not ingraining incorrect code patterns in our audience's minds. This is potentially a very large undertaking due to .NET samples being scattered across many different sites, some of which haven't been updated in over a decade.

At the very least, the samples that accompany the API documentation should be clarified so that they avoid performing linguistic operations when ordinal operations were likely more appropriate. A not at all exhaustive list is provided below.

A simple search for these patterns will likely produce many false positives. We also shouldn't assume that every such instance of a culture-aware comparison is incorrect. More on this later.

Dev experience changes to reduce the "pit of failure"

The document dotnet/docs#21249 suggests that developers manually enable the Roslyn analyzer rules CA1307 and CA1309 in their code bases. That rule will flag calls to string.IndexOf(string) and other culture-aware APIs, requesting that the developer explicitly pass StringComparison.CurrentCulture to indicate "yes, I really did intend for this to be culture-aware."

This helps, but it requires an active gesture on the part of the developer. Ideally we would instead alert developers to potential problems (or even fix these problems automatically!) without requiring the developer to have first sought help.

There are some various options we can take here, each with their own pros and cons. I'll describe some potential paths in a section below.

A bit of historical context

When .NET Framework was introduced two decades ago (!!!), the killer app was creating rich UI-based applications. .NET Framework introduced WinForms as the successor to VB6's rapid application development model. It also introduced WebForms as a way to create web-based GUIs with similar fidelity to native WinForms apps. End users interface directly with these app models, which led to rich localization and globalization support being weaved throughout these app models from a very early stage.

Part of this early work involved ensuring that string instances could unambiguously hold data in any supported language. Historically this had been accomplished by storing the string as a sequence of 8-bit C-style chars (LPSTR), leaving their intepretation up to the active Windows code pages. .NET uses UTF-16 for its string representation, removing the reliance on code pages.

At the same time, since user interaction was such a crucial component of early .NET applications, it was important that applications behave according to the user's expectations. This is especially pronounced in applications that perform searching and collation, such as a personnel system which lists all employees' names alphabetically. The end user might expect ordering to be performed according to the conventions of U.S. English, or of Hungarian, or of Turkish, or of another language, depending on how they've configured their system. (The rules for performing Hungarian or Turkish collation are non-trivial.)

To support these scenarios, the .NET Framework APIs which search for one substring within another string or which compare two strings use the thread's current culture (StringComparison.CurrentCulture) by default. This includes APIs like string.IndexOf(string), string.CompareTo(string), and similar. Contrarily, .NET Framework APIs which search for individual chars within a string use ordinal (StringComparison.Ordinal) searching by default. This includes APIs like string.IndexOf(char) and string.StartsWith(char).

string.Contains is the exception to this rule. It was introduced in .NET Framework 2.0 - after the other string APIs - and does not follow the same convention. For string.Contains, both the string-based and the char-based overloads use ordinal behavior by default.

An important aspect of globalized behavior is that it's not stable across platforms. Language itself is fluid, and conventions change. The globalization data that ships with the operating system encompasses not just language conventions, but also geopolitical concerns such as the default currency symbol, and the OS regularly updates this data. While these updates are not intended to be breaking, they make no guarantee of behavioral compatibility.

This globalized-by-default behavior might be appropriate for UI-based applications where an end user is interacting directly with the app, it's often not appropriate for all other scenarios. Web and backend services usually need to process data in a manner that remains consistent across runs and is not influenced by any linguistic conventions. Command-line tools similarly should usually exhibit consistent behavior regardless of the language of the user who launched the tool. Even within a GUI app running on a user's local machine, any underlying business logic should usually run uninfluenced by the user's culture settings.

Now that .NET has adopted Span<T> as a first-class citizen (and ReadOnlySpan<char> as the convention for a cheap string slice), there are also consistency issues to deal with. All Span<T>-based extension methods (including extension methods that operate on ReadOnlySpan<char>) are ordinal by default, unless an explicit StringComparison has been provided. As developers begin using span-based code more frequently, the risk of mixing and matching linguistic and non-linguistic operations on the same text increases.

string str = GetString();
bool b1 = str.StartsWith("Hello"); // uses 'CurrentCulture' by default

ReadOnlySpan<char> span = str.AsSpan();
bool b2 = span.StartsWith("Hello"); // uses 'Ordinal' by default

This mismatch of expectations could cause developers to introduce latent bugs into their code bases.

Possible paths forward

Option A: Enable Roslyn analyzer warnings by default

As mentioned earlier, the Roslyn analyzer rules CA1307 and CA1309 are intended to alert developers when they're invoking an string-based API that uses linguistic behavior by default. We can go further and enable these rules by default in applications targeting .NET 6+, producing compiler warnings when these patterns are observed. We can also mark APIs like string.IndexOf(string) as [EditorBrowsable(Never)], effectively hiding them from Intellisense and guiding developers toward the StringComparison-consuming overloads.

The developer would see the warnings both within the Visual Studio IDE and on the console during compilation.

string str = GetString();
if (str.StartsWith("Hello")) // This line produces warning CA1307.
{
    /* do something */
}

if (str.StartsWith("Hello", StringComparison.CurrentCulture)) // Explicit comparison specified, no warning produced.
{
    /* do something */
}

Pros: The developer is alerted to the problem early, potentially before they even observe the problem in production.

Cons: This may introduce noise in code bases where the developer truly did intend to call globalization-aware APIs, including within enterprise code bases which have been brought forward across several .NET Framework versions. It also risks introducing a very steep learning curve for new .NET developers, who are now confronted with globalization-related issues while still within the first few minutes of writing their first "Hello, world!" application.

Alternative proposal: Enable these rules in all application types except WinForms and WPF. This assumes that calls to methods like string.IndexOf(string) where the user intended the default globalization behavior are very rare outside of WinForms and WPF projects.

Option B: Provide a compatibilty switch to change the string defaults

Under this proposal, we provide a switch which forces all string APIs to default to Ordinal unless an explicit StringComparison has been provided. This encompasses string.IndexOf(string), string.Compare, and similar APIs. Globalization-specific APIs like System.Globalization.CompareInfo would be unaffected by this switch.

This switch would be application-wide, just like the existing globalization switches. There would be no facility for individual libraries to control this behavior. Library developers would still need to call APIs which take a StringComparison parameter if they want a strong guarantee on what behavior they'll get. (Library devs may want to enable the Roslyn analyzer rules to help flag non-compliant call sites.)

Defaulting string APIs to Ordinal matches how strings behave in other languages like C/C++, Java, Python, Rust, and others. Interestingly, Silverlight 2 and 3 also shipped with "string defaults to Ordinal" behavior, but this was later reverted with Silverlight 4. This switch would also mean that string.ToUpper and string.ToLower become equivalent to string.ToUpperInvariant and string.ToLowerInvariant.

Underlying this proposal is an assumption that stringy operations should be ordinal unless the call site explicitly requests otherwise. This makes writing globalization-friendly code a deliberate action rather than an automatic behavior. WinForms UI controls like list boxes could still behave in a manner appropriate for their own scenarios.

Pros: Provides uniformity across the API surface. Also provides significant performance increases since ordinal operations are considerably faster than linguistic operations.

Cons: This could be a substantial breaking change, especially for large applications which can't audit every line of code within third-party dependencies. It also deviates from documented defaults. This could cause confusion if somebody is following an older tutorial or if somebody really did intend to invoke a linguistic operation.

Option C: Shenanigans with reference assemblies

This is akin to Option B above but is intended to be less breaking to the .NET ecosystem. Here, we introduce no globalization switch, and we don't change any existing runtime behavior. Instead, we make two changes to .NET 6's reference assemblies.

  1. Remove string API overloads that don't take StringComparison.
  2. Change existing string API overloads which take StringComparison to default these parameters to Ordinal.

Consider overloads of string.StartsWith. Here is how the overloads currently appear in the reference assemblies and how they would appear after this proposal.

//
// .NET 5 reference assemblies
//
public sealed class string
{
    public bool StartsWith(char value);
    public bool StartsWith(string value);
    public bool StartsWith(string value, bool ignoreCase, CultureInfo? culture);
    public bool StartsWith(string value, StringComparison comparisonType);
}

//
// .NET 6 proposed reference assemblies
//
public sealed class string
{
    public bool StartsWith(char value);
    // public bool StartsWith(string value); // (REMOVED)
    public bool StartsWith(string value, bool ignoreCase); // (ADDED, to accelerate OrdinalIgnoreCase scenarios)
    public bool StartsWith(string value, bool ignoreCase, CultureInfo? culture);
    public bool StartsWith(string value, StringComparison comparisonType = StringComparison.Ordinal); // default value added
}

The end effect of this is that if a call site reads someString.StartsWith("Hello"), the .NET 6 compiler will no longer bind the call site to string.StartsWith(string). It will instead be bound to string.StartsWith(string, StringComparison) with the value Ordinal burned in at the call site. Existing assemblies which were compiled against .NET 5 or earlier will continue to call the original method, which still exists within the runtime and still has its old behavior.

Pros: Provides uniformity across the API surface, while retaining behavioral compatibility for assemblies which don't target .NET 6.

Cons: This feels like an abuse of the reference assembly system. It also means that if you're inspecting code, you need to know its target framework (by cracking open the .csproj!) to deduce what the runtime behavior will be. There may also be issues with dynamic compilation and other scenarios where the runtime assemblies are used directly instead of using reference assemblies.

Option D: Revert back to NLS by default when running on Windows

We flip the switches so that ICU is no longer the default globalization stack when .NET apps run on Windows. This does not back out the "ICU everywhere" feature; apps running on Windows can still opt-in to using ICU if desired.

This needn't be exclusive of other options. For example, this can be undertaken jointly with obsoleting APIs which are culture-aware by default. The goal of this proposal is to act as a compat shim rather than to address any latent bugs which might exist in today's callers.

Pros: .NET Framework and .NET Core applications which were built and tested on Windows will continue to work the same way on .NET on Windows.

Cons: Like .NET Core, .NET applications will behave differently across different OSes. Without compile-time alerts, it does not prevent new incorrect call sites from being introduced into the wider .NET ecosystem.

Option E: Do nothing

We take no proactive measures regarding the developer experience. All of our efforts are focused solely on documentation, samples, and similar developer education. Basically, leave the world as it exists today in .NET 5.

Pros: We understand the world as it exists today. Once developers observe a misbehavior in their applications, they can consult our documentation or third-party channels like StackOverflow to self-assist.

Cons: It leaves the "pit of failure" fairly wide and relies on developers to experience a problem before seeking assistance. This potentially leads to the continued introduction of fragile code into the wider .NET ecosystem.

Follow-up work

If we could answer the following questions, that might help inform our decision on what path to take. This issue does not propose a way to discover the answers to these questions.

  • How often are developers writing UI-layer code vs. business logic or other non-UI code? How can we detect this layering even within a single project?

  • What percentage of calls to string.IndexOf(string) would in practice return different results if we were to flip the default from CurrentCulture to Ordinal?

  • Do we need to address APIs like int.Parse at the same time? Example: Does a proposed 'ordinal by default' switch also mean that int.Parse and decimal.ToString are invariant by default?

  • What other options are missing from the above list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment