Skip to content

Instantly share code, notes, and snippets.

View GrabYourPitchforks's full-sized avatar
😵
On other projects, not checking GitHub notifications - ping via Teams if urgent.

Levi Broderick GrabYourPitchforks

😵
On other projects, not checking GitHub notifications - ping via Teams if urgent.
View GitHub Profile
@GrabYourPitchforks
GrabYourPitchforks / utf8_ldm_design.md
Last active September 14, 2019 17:38
UTF8 design for LDM

Utf8String design overview

Audience and scenarios

Utf8String and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.

A naive way to accomplish this would be to represent UTF-8 data as byte[] / Span<byte>, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[] instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();. This defeats the purpose of using a typed language.

We want to expose enough functionality to make the Utf8String type usable and desirable by our developer audience, but it's not intended to serve as a

// In a loop, try reading a natural word at a time.
const int CharsPerNuint = sizeof(nuint) / sizeof(char);
for (; inputLength >= CharsPerNuint; pInputBuffer += CharsPerNuint, inputLength -= CharsPerNuint)
{
nuint utf16Data = Unsafe.ReadUnaligned<nuint>(pInputBuffer);
utf16Data &= unchecked((nuint)0xFF80_FF80_FF80_FF80ul);
if (utf16Data == 0)
{
@GrabYourPitchforks
GrabYourPitchforks / utf8char_ecosystem.md
Created December 13, 2018 02:31
Utf8Char and the .NET ecosystem

Motivations and driving principles behind the Utf8Char proposal

Utf8Char is synonymous with Char: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral types Byte and UInt16 in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.

Drawing this distinction is important. With UTF-16 data (String, Char[]), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data int

@GrabYourPitchforks
GrabYourPitchforks / string_comp.md
Last active August 15, 2018 01:01
String performance optimizations

This tests the performance of MemoryExtensions.ToUpperInvariant(this ReadOnlySpan<char>, Span<char>), String.GetHashCode(), and String.GetHashCode(StringComparison.OrdinalIgnoreCase).

In below table:

  • baseline coreclr = 3.0.0-preview1-26808-05
  • local build (6) = local build from private dev Utf8String branch, 6th rev.
  • local build (7) = local build from private dev Utf8String branch, 7th rev.
Method Toolchain StringLength Mean Error StdDev Scaled ScaledSD
ToUpperInvariant baseline coreclr 0 27.112 ns 0.7416 ns 1.1763 ns 1.00 0.00
@GrabYourPitchforks
GrabYourPitchforks / validating_pool.cs
Created April 2, 2018 20:21
Validating MemoryPool<T>
/*
* !! WARNING !!
*
* COMPLETELY UNTESTED CODE
*/
using Microsoft.Win32.SafeHandles;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.ConstrainedExecution;
@GrabYourPitchforks
GrabYourPitchforks / memory_docs_samples.md
Last active January 20, 2024 13:29
Memory<T> API documentation and samples

Memory<T> API documentation and samples

This document describes the APIs of Memory<T>, IMemoryOwner<T>, and MemoryManager<T> and their relationships to each other.

See also the Memory<T> usage guidelines document for background information.

First, a brief summary of the basic types

  • Memory<T> is the basic type that represents a contiguous buffer. This type is a struct, which means that developers cannot subclass it and override the implementation. The basic implementation of the type is aware of contigious memory buffers backed by T[] and System.String (in the case of ReadOnlyMemory<char>).
@GrabYourPitchforks
GrabYourPitchforks / memory_guidelines.md
Last active April 21, 2024 07:45
Memory usage guidelines

Memory<T> usage guidelines

This document describes the relationship between Memory<T> and its related classes (MemoryPool<T>, IMemoryOwner<T>, etc.). It also describes best practices when accepting Memory<T> instances in public API surface. Following these guidelines will help developers write clear, bug-free code.

First, a tour of the basic exchange types

  • Span<T> is the basic exchange type that represents contiguous buffers. These buffers may be backed by managed memory (such as T[] or System.String). They may also be backed by unmanaged memory (such as via stackalloc or a raw void*). The Span<T> type is not heapable, meaning that it cannot appear as a field in classes, and it cannot be used across yield or await boundaries.

  • Memory is a wrapper around an object that can generate a Span. For instance, Memory instances can be backed by T[], System.String (readonly), and even SafeHandle instances. Memory cannot be backed by "transient" unmanaged me

@GrabYourPitchforks
GrabYourPitchforks / utf8string.md
Created March 23, 2018 20:55
Utf8String design philosophy

Usage, usability, and behaviors

The goal of this project is to make a type that mirrors System.String as much as practical. It should be a heapable, immutable, indexable, and pinnable type. The data may contain embedded null characters. When pinned, the pointer should represent a null-terminated UTF-8 string.

We should provide conversions between String and Utf8String, though due to the expense of conversion we should avoid these operations when possible. There are a few ways to avoid these, including:

  • Adding Utf8String-based overloads to existing APIs like Console.WriteLine, File.WriteAllText, etc.
  • Adding ToUtf8String methods on existing types like Int32.
  • Implement utility classes like Utf8StringBuilder.
  • Not having implicit or explicit conversion operators that could perform expensive transcodings, but instead having constructor overloads or some other obvious "this may be expensive" mechanism.
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Threading;
namespace ConsoleApp3
{
class Program
{
static void Main(string[] args)