Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save alf-p-steinbach/c53794c3711eb74e7558bb514204e755 to your computer and use it in GitHub Desktop.
Save alf-p-steinbach/c53794c3711eb74e7558bb514204e755 to your computer and use it in GitHub Desktop.
Why COW was deemed ungood for std::string

Why COW was deemed ungood for std::string.

COW, short for copy on write, is a way to implement mutable strings so that creating strings and logically copying strings, is reduced to almost nothing; conceptually they become free operations like no-ops.

Basic idea: to share a data buffer among string instances, and only make a copy for a specific instance (the copy on write) when that instance's data is modified. The general cost of this is only an extra indirection for accessing the value of a string, so a COW implementation is highly desirable. And so the original C++ standard, C++98, and its correction C++03, had special support for COW implementations, and e.g. the g++ compiler's std::string implementations used COW.

So why was that support dropped in C++11?

In particular, would the same reason or reasons apply to a reference counted immutable string value class?

As we'll see it does not, it's just a severe mismatch between the std::string design and the ideal COW requirements. But it took a two hour car trip, driving 120 kms on winter roads, for my memory to yet again cough up the relevant scenario where Things Go Wrong™. I guess it's like the “why can't I assign a T** to a T const** question; it's quite counter-intuitive.

Basic COW string theory: the COPOW principle.

A COW string has two possible states: exclusively owning the buffer, or sharing the buffer with other COW strings.

It starts out in state owning. Assignments and copying initializations can make it sharing. Before executing a “write” operation it must ensure that it's in owning state, and a transition from sharing to owning involves copying the buffer contents to a new and for now exclusively owned buffer.

With a string type designed for COW any operation will be either non-modifying, a “read” operation, or directly modifying, a “write” operation, which makes it is easy to determine whether the string must ensure state owning before executing the operation.

With a std::string, however, references, pointers and iterators to mutable contents are handed out with abandon. Even a simple value indexing of a non-const string, s[i], hands out a reference that can be used to modify the string. And so for a non-const std::string every such hand-out-free-access operation can effectively be a “write” operation, and would have to be regarded as such for a COW implementation (if the current C++ standard had permitted a COW implementation, which it doesn't).

I call this the principle of copy on possibility of write, or COPOW for short. It's for strings that aren't designed for COW. For a COW-oriented design applying COPOW reduces to pure COW.

A code example showing how COW works.

To keep the size of the following example down I don't address the issue of constant time initialization from literal, but just show how assignment and copy initialization can be reduced to almost nothing:

#include <cppx-core/_all_.hpp>  // https://github.com/alf-p-steinbach/cppx-core

using C_str = const char*;      // Is also available in cppx.

namespace my
{
    $use_cppx( Raw_array_of_, Size );
    $use_std( begin, end, make_shared, vector, shared_ptr );

    class Cow_string
    {
        using Buffer = vector<char>;

        shared_ptr<Buffer>      m_buffer;
        Size                    m_length;

        void ensure_is_owning()
        {
            if( m_buffer.use_count() > 1 )
            {
                m_buffer = make_shared<Buffer>( *m_buffer );
            }
        }

    public:
        auto c_str() const
            -> C_str
        { return m_buffer->data(); }

        auto length() const
            -> Size
        { return m_length; }

        auto operator[]( const Size i ) const
            -> const char&
        { return (*m_buffer)[i]; }

        auto operator[]( const Size i )
            -> char&
        {
            ensure_is_owning();
            return (*m_buffer)[i];
        }
        
        template< Size n >
        Cow_string( Raw_array_of_<n, const char>& literal ):
            m_buffer( make_shared<Buffer>( literal, literal + n ) ),
            m_length( n - 1 )
        {}
    };
}  // namespace my

Here assignment is the default-generated assignment operator that just assigns the data members m_buffer and m_length, which are a shared_ptr and an integer, and ditto for copy initialization.

And apparently this code abides by the COPOW principle, so it should be safe…

The problem: UB by adding code that just copies.

Consider the following usage code, it's perfectly fine:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();
    
    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        const char first_char = s[0];
        ignore = first_char;
    }
    
    $use_std( cout, endl );
    cout << p << endl;
}

This code is fine because the COW string is already in state owning when s[0] is executed on the non-const s. So all that the initialization of first_char does is to copy a char value. Fine.

But if a maintainer innocently just introduces a logical copy of the string value, which is what COW primarily optimizes, and which certainly doesn't change the conceptual value, then mayhem ensues:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();
    
    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        my::Cow_string other = s;
        ignore = other;

        const char first_char = s[0];
        ignore = first_char;
    }
    
    $use_std( cout, endl );
    cout << p << endl;      //! Undefined behavior, p is dangling.
}

Uh oh.

Since s here is in state sharing, the COPOW principle makes the s[0] operation copy the shared buffer, to become owning. Then at the end of the block the only remaining owner of the original buffer, the other string, is destroyed, and destroys the buffer. Which leaves the p pointer dangling.

For a custom string type like Cow_string this is a user error. The type is just badly designed, so that it's very easy to inadvertently use it incorrectly. But for a std::string it's formally a bug in the COW implementation, a bug showing that COPOW is just not enough.

For a std::string, if the standard had permitted a COW implementation, to avoid the above calamity it would be necessary to transition to the owned state, incurring an O(n) copying of string data, every place that a reference, pointer or iterator is handed out, regardless of const-ness of the string. One could maybe call that copy on handing out any reference, COHOAR. It greatly reduces the set of cases where COW has an advantage. The C++ standardization committee deemed that cost too high, the remaining advantages of COW too low, to continue supporting COW. So,

  • the C++03 wordings that supported COW were removed;
  • wording was introduced, especially a general O(1) complexity requirement for [] indexing, that disallowed COW; and
  • functionality such as string_view was added, that relies on having pointers to string buffers, and that itself hands out references.

What about threads?

It'a common misconception that COW std::strings would be incompatible with multi-threading, or that making it compatible would make it inefficient, because with COW ordinary copying of a string doesn't yield an actual copy that another thread can access freely.

In order to allow string instances that are used by different threads, to share a buffer, just about every access function, including simple [] indexing, would need to use a mutex.

However, a simple solution is to just not use ordinary copy initialization or assignment to create a string for another thread, but instead a guaranteed content copying operation such as std::string::substr, or initialization from iterator pair. The C++11 standard could have gone in this other direction. It could, in principle, have added to the existing C++03 support for COW strings, noting that COHOAR, not just COPOW, is required, and added a dedicated deep-copy operation or deep-copy support type plus wording about thread safety.

What about reference counted immutable strings?

An immutable string is a string type such as the built in string types in Java, C# or Python, where the available operations don't support string data modification. Strings can still be assigned. One just can't directly change the string data, like changing “H” to ”J“ in “Hello”.

Immutable strings can and in C++ typically will share their string data via reference counting, just as with COW strings. As with COW strings they support amortized constant time initialization from literal, ditto superfast copy assignment and copy initialization, and in addition, if strings don't need to be zero-terminated they support constant time substring operations. They're highly desirable.

So, is the problem shown above, a showstopper for immutable strings in C++?

Happily, no. The problem comes about because std::string hands out references, pointers and iterators that can be used to change the string data without involving std::string code, i.e. without its knowledge. That can't happen with an immutable string.

And figuring out this, whether there was a showstopper, and whether std::string_view (that hands out references) could be used freely in code that would deal with immutable strings, was the reason that I delved into the question of COW std::string again. At one point, long ago, I knew, because I participated in some debates about it, but remembering the problematic use case wasn't easy. It's just not intuitive to me, that adding an operation that just copies, can create UB…

@BjoernAtBosch
Copy link

Nice article!

But I think, the primary reason, why the second example ("uh oh") is failing, is not the CO(PO)W. It's because c_str() is offering direct access to an internal data structure of the Cow_string (same as the c_str() of std::string), which is - as you mentioned - bad design. But there might be reasons for doing such a "trade of" having this function ...

A similar behaviour would occur if the Cow_string class would additionally define a public append function (like std::string):

    Cow_string& append(const Cow_string& appendee)
    {
        ensure_is_owning();
        m_buffer->insert(m_buffer->end(), appendee.m_buffer->begin(), appendee.m_buffer->end());
        return *this;
    }

Using this on a single (Cow_)string object has the same effect:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    s.append(my::Cow_string(" This is a very long appendee making sure the buffer vector needs being reallocated, throwing away its old native array."));

    cout << p << endl;  // p is most probably invalid now
}

So, reading on the returned char array should be handled with care. The pointer should not be used on some "long term perspective". After calling some modifying function on the original Cow_string object might at least change contents of the char array or even invalidate the pointer completely.

Interestingly enough, the (somewhat outdated) documentation for the std::string::c_str() function on cplusplus.com says:

The pointer returned may be invalidated by further calls to other member functions that modify the object.

Whereat the documentation for that function on cppreference.com is more concrete:

The pointer obtained from c_str() may be invalidated by:

  • Passing a non-const reference to the string to any standard library function, or
  • Calling non-const member functions on the string, excluding operator[], at(), front(), back(), begin(), rbegin(), end() and rend().

I think, that's what you want to show in your example: Calling the non-const index operator should not invalidate the pointer (because no reallocation is required). And taking it this way, the issue is caused by the COPOW.
(I found it a bit surprising, that the compiler is using the non-const index operator here, even thoughin fact only read access is done.)

But: Clients accessing data via such low level functions should do that always with extreme care. They should get the pointer, do whatever they want it for and shall not do anything with the original (string) object during that time. Only after dropping that pointer, they should start working on the original object again.

Best regards.

@dabrahams
Copy link

Great write-up! Tiny nit: no mutexes are needed in order to implement CoW string; it just needs an atomic reference count. That's what Swift does. Of course, the way C++ rampantly exposes unsafe pointers and references to the string's internals and can't optimize a read-only indexing operation on a non-const string probably still makes CoW unreasonable for C++.

@germandiagogomez
Copy link

germandiagogomez commented Jul 13, 2022

@dabrahams

std::as_const(mystr)[i]

Ugly but probably effective 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment