Skip to content

Instantly share code, notes, and snippets.

@rmartinho
Last active October 7, 2015 12:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmartinho/3166256 to your computer and use it in GitHub Desktop.
Save rmartinho/3166256 to your computer and use it in GitHub Desktop.
Design ideas for ogonek::text

Motivation

ogonek::text is intended as a Unicode-based string class. Not as a glorified container of characters, like std::basic_string, but as an actual piece of Unicode text.

Storage

text is not about storage, so it delegates that to another container. That container can be customized, yielding a varied range of performance characteristics suitable to any situations. One could have a Unicode text array, similar to std::basic_string, or one could have a Unicode text deque, or even a rope.

Encoding

text is to be seen as a sequence of codepoints, not as a sequence of code units. The encoding is also not fixed and customizable by the user. So one can have a Unicode rope on UTF-16, or a Unicode deque on UTF-8.

Validation

text has strong validity invariants. Attempting to construct an instance from an invalid sequence of code units is an error unless a replacement strategy is provided.

Iteration

text is a range of codepoints with functionality depending on the underlying container and encoding. It supports at least forward iteration, but can support all the other iteration features giving the right underlying encoding and container (using utf32 and a std::vector would give random-access iteration; utf16 and a std::deque would give bidirectional iteration; and utf7 and std::deque would only give forward iteration).

Customization hooks

The customization points (container and encoding) are to be template parameters, but it may or may not be desirable to provide type erased alternatives.

Platform-specific variants

There's a platform dependent preferred_host_encoding alias for some encoding that is preferrable on the host (utf16 on Windows, and utf8 on Linux), and aliases for the text variants using that encoding.

Direct manipulation

text provides controlled access to the underlying code units. If the user wants to manipulate code units directly, they can simply use a container directly. The user can move the underlying container out of an instance text, manipulate it, and then move it back in or create a new instance of text from it. This last operation can enforce the validity invariants by rechecking the data.

Legacy interoperation

Interoperation with APIs operating on null-terminated arrays of code units can be done using a container that stores such a null-terminated array, like std::basic_string.

ICU interoperation

Interoperation between ogonek::text and ICU's UnicodeString is intended, but requires further study.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment