This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
This article has moved to the official .NET Docs site.
See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.
Utf8String
and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.
A naive way to accomplish this would be to represent UTF-8 data as byte[]
/ Span<byte>
, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[]
instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();
. This defeats the purpose of using a typed language.
We want to expose enough functionality to make the Utf8String
type usable and desirable by our developer audience, but it's not intended to serve as a
// In a loop, try reading a natural word at a time. | |
const int CharsPerNuint = sizeof(nuint) / sizeof(char); | |
for (; inputLength >= CharsPerNuint; pInputBuffer += CharsPerNuint, inputLength -= CharsPerNuint) | |
{ | |
nuint utf16Data = Unsafe.ReadUnaligned<nuint>(pInputBuffer); | |
utf16Data &= unchecked((nuint)0xFF80_FF80_FF80_FF80ul); | |
if (utf16Data == 0) | |
{ |
Utf8Char
is synonymous with Char
: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral types Byte
and UInt16
in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.
Drawing this distinction is important. With UTF-16 data (String
, Char[]
), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data int
This tests the performance of MemoryExtensions.ToUpperInvariant(this ReadOnlySpan<char>, Span<char>)
, String.GetHashCode()
, and String.GetHashCode(StringComparison.OrdinalIgnoreCase)
.
In below table:
Method | Toolchain | StringLength | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|---|---|
ToUpperInvariant | baseline coreclr | 0 | 27.112 ns | 0.7416 ns | 1.1763 ns | 1.00 | 0.00 |
/* | |
* !! WARNING !! | |
* | |
* COMPLETELY UNTESTED CODE | |
*/ | |
using Microsoft.Win32.SafeHandles; | |
using System.Diagnostics; | |
using System.Runtime.CompilerServices; | |
using System.Runtime.ConstrainedExecution; |
This document describes the APIs of Memory<T>
, IMemoryOwner<T>
, and MemoryManager<T>
and their relationships to each other.
See also the Memory<T>
usage guidelines document for background information.
Memory<T>
is the basic type that represents a contiguous buffer. This type is a struct, which means that developers cannot subclass it and override the implementation. The basic implementation of the type is aware of contigious memory buffers backed by T[]
and System.String
(in the case of ReadOnlyMemory<char>
).This document describes the relationship between Memory<T>
and its related classes (MemoryPool<T>
, IMemoryOwner<T>
, etc.). It also describes best practices when accepting Memory<T>
instances in public API surface. Following these guidelines will help developers write clear, bug-free code.
Span<T>
is the basic exchange type that represents contiguous buffers. These buffers may be backed by managed memory (such as T[]
or System.String
). They may also be backed by unmanaged memory (such as via stackalloc
or a raw void*
). The Span<T>
type is not heapable, meaning that it cannot appear as a field in classes, and it cannot be used across yield
or await
boundaries.
Memory
is a wrapper around an object that can generate a Span
. For instance, Memory
instances can be backed by T[]
, System.String
(readonly), and even SafeHandle
instances. Memory
cannot be backed by "transient" unmanaged me
The goal of this project is to make a type that mirrors System.String
as much as practical. It should be a heapable, immutable, indexable, and pinnable type. The data may contain embedded null characters. When pinned, the pointer should represent a null-terminated UTF-8 string.
We should provide conversions between String
and Utf8String
, though due to the expense of conversion we should avoid these operations when possible. There are a few ways to avoid these, including:
Utf8String
-based overloads to existing APIs like Console.WriteLine
, File.WriteAllText
, etc.ToUtf8String
methods on existing types like Int32
.Utf8StringBuilder
.using System; | |
using System.Diagnostics; | |
using System.Runtime.CompilerServices; | |
using System.Threading; | |
namespace ConsoleApp3 | |
{ | |
class Program | |
{ | |
static void Main(string[] args) |