graninas/cpp_stm_free_tutorial.md

## cpp_stm_free_tutorial.md

      
    Raw
  

              cpp_stm_free_tutorial.md
            
          
    Software Transactional Memory in C++: pure functional approach (Tutorial)

In this article I’ll tell you about my pure functional library for Software Transactional Memory (STM) that I’ve built in C++. I adopted some advanced functional programming concepts that make it composable and convenient to use. Its implementation is rather small and robust, which differentiates the library from competitors. Let’s discuss what STM is and how to use it.

Intro
Links
Software Transactional Memory
STM basics
Composable transactions
Dining Philosophers: The task
Dining Philosophers: Data model
Dining Philosophers: Behavior model
Dining Philosophers: Multithreaded evaluation
Conclusion
Additional materials


Intro

Parallel and concurrent programming is hard. We fall to this opinion once we’ve faced that our multithreading code isn’t working as expected. It was fine for a while, but then something bad is happened. Either the system stuck in a deadlock, or it has returned some invalid data due to race condition, or even crashed: no matter the happening, the system went berserk and started doing wrong things. This is hard to predict, hard to debug and certainly hard to test because multithreading bugs are quite sneaky and we cannot be sure the code is totally free from them.
Software Transactional Memory is introduced to solve many of these issues and provide an easy-to-use concurrency approach. With STM, you build your concurrent data model and then start evaluating it in different threads. STM makes a code running (almost) safe. It removes many problems usually following parallel code as a class. STM code doesn’t contain any synchronization bits and has a well-defined behavior, so it’s much easier to write it correctly. STM can’t completely save you from the problems, but it significantly reduces the complexity of parallel code, so finding problems becomes a walk in a forest. Sounds good, right? Let’s see how STM approaches this goal.
Links


The cpp_stm_free library
Dining Philosophers sample
More STM samples
C++ Russia 2018 talk slides (Eng)
C++ Russia 2018 talk (Rus)

Software Transactional Memory

The idea of STM is similar to transactional databases. Like them, you have some shared mutable data model and can evaluate safe transactional updates over it. Unlike databases, STM is not an external storage, it’s your application’s state. Notably, threaded transactions usually work on different parts of the model and they are isolated from each other. Transactions are able to update the model simultaneously. Well, to some degree; but sometimes it happens so that the transactions want to change the same part of the model. Just allowing them to do this is not a solution: simultaneous writing of the same shared resource is a guaranteed race condition. In the STM terminology, this situation is known as conflict. STM is responsible for solving conflicts in the background. It decides what transaction should commit the result, and what transaction should be restarted. You don’t need to control this behavior, STM has own tools to synchronize the two competing transactions. You’ll just be knowing that the data model stays valid all the time because of atomic updates STM does using the transactions.
Now I’ll show you how to construct concurrent data models and write transactions. I’ll be using my library, but other STM libraries have similar notions so it probably shouldn’t be a problem to catch the idea they all have in common. I’d also say my library has the same interface as Haskell’s one, and you’ll see it’s really easy to learn. So maybe learning it would be like learning Haskell a little bit. I hope, it will inspire you, not scare! You’ll see it’s easy and quite fun.
STM basics

Every implementation of STM brings a way to build any kind of data models you need for your tasks. But you can’t just define a variable and work with it inside a transaction. It won’t be thread-safe because the variable is not atomic. Ok, you’re right there is std::atomic, and under some circumstances, it can be a better solution than STM. After all, if you have just one shared variable (let’s say a counter of type int), involving STM would be like renting a helicopter to visit a coffee shop. It becomes more profitable when your data model is big enough, or when the logic of parallel code tends to become too sophisticated.
We’ll start from simple cases, though. In my library, a basic notion of shared resource is represented by TVar - namely, “transactional variable”. Here:
TVar<int> counter;
Currently, this variable has no any data, it’s undefined. Inside a regular code, you can’t read and write its internal int value, you only can pass the counter variable to some transaction and process it there. Ok, accessing a counter TVar will be a transaction itself because any reading and writing of TVar should be thread-safe. Actually, creating of a TVar is a transaction, too. And it’s impossible to create empty TVar. Note: "undefined" is not "empty". The code above just declares some container for our shared integer resource, but the STM runtime doesn’t know about it yet. Let’s really create a TVar; in my library, this will require one more thing: a notion of context.
Context ctx;
TVar<int> counter = newTVarIO<int>(ctx, 0);
Here, you create an object of the Context class that will hold all the runtime data needed STM to work. Copying of the ctx variable would be a bad idea, because it also keeps a general copy your concurrent model. Every TVar is related to some context. All the transactions are evaluating within some context, and this is how different threads can communicate with each other: through a context shared by reference. You can also share your TVars, they are like avatars of a shared resource inside a particular context, but not the resource itself. (Hint: prefer to share TVars by copy not by reference, it's cheap and more safe).
In the code above, the newTVarIO function works with the context passed. It allocates some memory for the int number and provides you an accessor to it. Avatar, accessor, pointer, - all these words describe the essence of the TVar data structure, more or less. So, what’s next with it? Earlier I said every operation with TVars is transaction, even the creation. Ok, except the newTVarIO function that is a special non-transactional version of newTVar. The full equivalent of the previous code will be the following:
Context ctx;
STML<TVar<int>> transaction = newTVar<int>(0);
TVar<int> counter = atomically(ctx, transaction);
Now you Sherlock see it. The function newTVar creates a transaction. You know it because of the variable name: transaction. You also see the initial value 0 passed to newTVar. Next, from the string STML<TVar<int>> you conclude this crafty type STML holds the transaction, and notably you can specify an arbitrary type for it. The last string says this type parameter is nothing else than the return type of the transaction: that’s what you’ll get if you evaluate it. How? Atomically, of course. Within this particular context. And yes, as far as you know what transaction does (creates a TVar), you declare the counter variable to store the result. Whoa, do you feel deductive?
But in the code above, the counter transaction variable is just created. We probably want to do something with it, for example, increment the counter in different threads. So there should be a way to modify this TVar somehow concurrently, thread-safely. To do this, we should use more functions that work with TVar, and, what is important, we should be able to build a bigger transactions using them.
Composable transactions

There are only three operations you can do with TVars: create one, read its contents and write new contents. Take a look at these functions:
template <typename A>
STML<TVar<A>> newTVar(const A& val);

template <typename A>
STML<A> readTVar(const TVar<A>& tvar);

template <typename A>
STML<Unit> writeTVar(const TVar<A>& tvar, const A& val);
They all return  the parameterized type STML<T> which indicates they are transactions. You may ask what the type Unit is. Nothing interesting, literally: we use it to show we aren’t interested in the result. Indeed, there is nothing useful the writeTVar function could return to us, but still, it’s a transaction, so it has to return STML<something>. In functional programming, the Unit plays the same role as void in C++.
The next step is to understand the idea of composable transactions that all are STML-returning functions. The STML type is rather complex inside, but fortunately you don’t need to know these implementation details. You can just use several useful combinators that allow you to tie two or more transactions together. Let’s start from the most important one: the bind combinator.
template <typename A, typename B>
STML<B> bind(const STML<A> ma, const std::function<STML<B>, A>& f);
As you can see, it takes two arguments: some transaction ma of type STML<A>, and some function f. The ma transaction will be evaluated, and its result will be passed to the function f. Then, the function does something with the result, and returns another transaction with the type STML<B>. Finally, this new STML<B> transaction will be the result of bind. Let’s see how to use it to increment the counter.
Context ctx;
TVar<int> counter = newTVarIO(ctx, 0);

STML<int> read = readTVar<int>(counter);
STML<Unit> increment = bind<int, Unit>(read, [&](int i) {
    STML<Unit> write = wirteTVar<int>(counter, i + 1);
    return write;
}

Unit result = atomically(ctx, increment);
Here, we pass some lambda function to the bind combinator and expect the result of the first transaction as argument. We also pass the counter TVar to access it inside the lambda. Both read and write operations will be evaluated atomically in a single transaction, so the counter won’t be changed by some other thread.
The code above is a bit verbose. Let’s shorten it a bit:
auto increment = bind<int, Unit>(readTVar<int>(counter), [&](int i) {
    return wirteTVar<int>(counter, i + 1);
};
Reading a TVar and then changing it somehow is a common operation. Why not to use a special combinator that is called modifyTVar? It has an understandable definition:
template <typename A>
STML<Unit> modifyTVar(const TVar<A>& tvar, const std::function<A(A)>& f);
Give it a TVar and a modification function, and it will make the rest work. It also returns the “boring” Unit value wrapped into a transaction. But it seems very helpful to return the new value rather than Unit. This means, we need to read the counter right after this modification. We still need stm::bind to tie one transaction to another, but this time we don’t need the Unit value from the first modifying transaction; we’ll just ignore it and then read counter passed as clojure:
STML<int> incrementCounter(const TVar<int>& counter)
{
   STML<Unit> modified = modifyTVar<int>(counter, [](int i) { return i + 1; });

   return bind<Unit, int>(modified,
                         [&](const Unit&){ return readTVar<int>(counter); });
}
It’s very convenient to place your transactions into functions, because you name them by what they do, not by how they are built internally. So here the name is quite self-explanatory, which is good.
Now we need to make this little concurrent model work in a multithreaded environment. Let it be two threads that are competing for the right to increment the counter. No any additional requirements; threads can eval their transactions in any order, we don’t care.
Take a look at the following code which is rather big but hopefully understandable:
struct CounterRt {
   int thread;                // Thread number
   std::mutex& logLock;       // Mutex for logging
   stm::Context& context;     // STM Context
   stm::TVar<int> tCounter;   // Shared counter TVar
};

// Thread worker function
void counterWorker(const CounterRt& rt) {
   for (int i = 0; i < 50; ++i) {
       int counter = stm::atomically(rt.context, incrementCounter(rt.tCounter));
       std::lock_guard g(rt.logLock);
       std::cout << "thread: [" << rt.thread << "] counter: " << counter << std::endl;
   }
}

// main function
int main() {
   stm::Context ctx;
   stm::TVar<int> tCounter = stm::newTVarIO(ctx, 0);

   std::mutex logLock;

   std::vector<std::thread> threads;
   threads.push_back(std::thread(counterWorker, CounterRt {1, logLock, ctx, tCounter}));
   threads.push_back(std::thread(counterWorker, CounterRt {2, logLock, ctx, tCounter}));

   for (auto& t: threads)
       t.join();

   return 0;
}
Each thread worker function receives some structure called CounterRt where we keep all the needed info, including our concurrent model (just a single TVar for counter). As you can see, threads call the transaction atomically and get the result. To avoid output mess, we use std::mutex for synchronization. The main function prepares tCounter and starts the threads. The program will output something like this:
thread: [1] counter: 1
thread: [2] counter: 2
thread: [1] counter: 3
thread: [1] counter: 4
thread: [2] counter: 5

Notice the slips here. The first thread managed to calculate two numbers in a row: 3 and 4, while the second thread was slowpocking somewhere. This happens because our simple model does not have any bits obligating the threads alternate each other. However, it’s possible to create reliable and predictive models with STM. In the next part we’ll be discussing additional tools STM provides (hint: the retry operation that allows you to restart the transaction by some reason), but the counter task looks too boring for this. After reading the article you can return to it and solve like an exercise. And now we’re moving further, to more complex concurrent models.
Dining Philosophers: The task

You’ve probably heard about this famous task. It was introduced to show main problems of concurrent programming, and it is very good to demonstrate STM possibilities. Notably, with STM, the task can be solved elegantly comparing to any other “bare” solutions. The following is a short description of the task:
Five philosophers are sitting at the circle dining table. There are five plates with spaghetti on it, and also five forks arranged so that every philosopher can take two nearest forks in both his hands. Philosophers either can think about their philosophical problems, or they can eat spaghetti. The thinkers are obligated to use two forks for eating. If a philosopher holds just one fork, he can’t eat or think. As a programmer, you need to organize their existence so that every philosopher could eat and think by turn, forever.
The task doesn’t sound that difficult, right? But aware of some pitfalls unseen from the first sight. The root of the problem are forks which can be considered as shared mutable resources for threads (philosophers). Any two neighbour thinkers are competing for the fork between them, and this enables such silly situations like “every philosopher has took the right fork, and they all stuck because no one could take the left fork anymore”. It's a deadlock. Thread starvation problem also can occur in a wrongly developed code, and, ironically, it will result in “starvation” of some philosophers: while part of them eat and think normally, other part can acquire resources hardly ever. So in good solutions all philosophers should pass their think-eat iterations almost equally.
Let’s see how we can do it with STM.
Dining Philosophers: Data model

We need to represent forks in our concurrent model. Any fork can be either Free or Taken. To distinguish different forks, we’ll assign some label to each.
enum class ForkState {
   Free,
   Taken
};

struct Fork {
   std::string name;
   ForkState state;
};
We can put all forks into one vector and then make a shared TVar from it:
TVar<std::vector<Fork>> tForks;
But this structure has one bad property: the only way to update a single fork is to update the whole shared resource (tForks). So when some thread completes its transaction over the TVar, all other threads will be restarted. After this, the next thread will come to update the TVar, and the situation will repeat again. This means, the model will be producing a lot of writing conflicts which is undesirable because it makes the model behave less effectively. Making all the forks to be a separate TVars should be more performant:
using TFork = stm::TVar<Fork>;

TFork tFork1 = stm::newTVarIO(context, Fork {"1", ForkState::Free });
TFork tFork2 = stm::newTVarIO(context, Fork {"2", ForkState::Free });
...
Speaking truthfully, the dilemma of constructing either a coarse-grained or a fine-grained concurrent model is well-known in parallel programming. It’s true that fine-grained models built with traditional approaches require more synchronization work, and therefore they are much more error-prone than coarse-grained. On the other hand, fine-grained models have a higher degree of potential parallelism. Good news that STM frees a developer from manual synchronizing of threads which radically reduces the amount of concurrent bugs, so we just create a better grained models by default.
For the convenience, we’ll introduce one more TVar that will keep a philosopher’s current Activity. The whole structure describing a philosopher will be the following:
enum class Activity {
   Thinking,
   Eating
};

struct TForkPair {
   TFork left;
   TFork right;
};

struct Philosopher {
   std::string name;
   stm::TVar<Activity> activity;
   TForkPair forks;
};
That’s it! The model is ready, now it’s time to breathe life into it.
Dining Philosophers: Behavior model

In construction of our transactions, we’ll start from the smallest ones and then compose them to get a bigger transactions. But let’s first discuss the idea. When the philosopher changes his activity from Thinking to Eating, he tries to take the left and the right fork. If he successfully takes both, he spends some time eating his spaghetti. If any of the forks is taken by a neighbour, our philosopher should put all forks he managed to acquire back on the table and then he should wait for a little amount of time hoping both forks become free.
This is how transaction tryTakeFork can be implemented. Noice the new combinator called withTVar:
STML<Unit> tryTakeFork(const TFork& tFork) {
   return withTVar<Fork, Unit>(tFork, [=](const Fork& fork) {
      if (fork.state == ForkState::Free) {
          return writeTVar<Fork>(tFork, Fork {fork.name, ForkState::Taken});
      }
      else {
          return pure<Unit>(unit);
      }
   });
}
This transaction does obvious things. The withTVar combinator reads the TVar and passes its contents to the lambda function. Then the transaction branches: if the previous fork state was Free, then the new state will be assigned; otherwise, nothing useful will be done, because the pure combinator just returns some uninteresting unit value. We can safely ignore results of this type (Unit), but we still need to bind the STML<Unit> transaction to another transaction if we want it to be a bigger one. We want, actually: let’s write a transaction that tries to take both forks. The sequence combinator will help us with, well, sequencing of some transactions. Check this out:
template <typename A, typename B>
STML<B> sequence(const STML<A> ma, const STML<B>& mb);
The sequence combinator takes the ma transaction, evaluates it, ignores the result, and returns mb. In our case, the A and B types are equal to Unit. Taking of two forks as simple as:
STML<Unit> tryTakeForks(const TForkPair& forks) {
   STML<Unit> lm = tryTakeFork(forks.left);
   STML<Unit> rm = tryTakeFork(forks.right);
   return sequence(lm, rm); // Also you can use bothVoided here.
}
But now we have to stop and understand what this transaction will do. Suppose, both forks were free, then the state of the model will be correct after evaluation of tryTakeForks. However it’s more likely one of the forks was already taken by someone else; it will be just skipped here, and we will never know about it. The model can move to the following state: the first fork is taken by our philosopher, the second fork is taken by its neighbour, and another neighbour becomes unnecessarily blocked. How can we avoid this situation? We could certainly write our transactions so they return additional information what fork was taken, and if only one was, we could put it back. This will work, but there is a better way.
Why not to disallow the invalid state completely? “All or nothing”, - this principle works well with STM models. If any fork of the two was already taken, the transaction should be restarted until the state of this particular TVar is changed. We can consider the transaction will be successful if and only if the state of a TVar is what we expect it to be. And this is when the power of the retry combinator comes to the scene. Consider the definition first:
template <typename A>
STML<A> retry();
This is just a transaction that returns any possible type but instead of doing something useful it restarts the whole transaction. Beware leaving the retry combinator unconditional because it will block your thread forever. The normal use case when you switch by some TVar states and “mark” part of them “illegal”. In our case we only need to prohibit the fork to be in the state Taken when we try to take it:
STML<Unit> tryTakeFork(const TFork& tFork) {
   return withTVar<Fork, Unit>(tFork, [=](const Fork& fork) {
      if (fork.state == ForkState::Free) {
          return writeTVar<Fork>(tFork, Fork {fork.name, ForkState::Taken});
      }
      else {
          return retry<Unit>();
      }
   });
Done. The tryTakeForks transaction stays the same: the STM engine will ensure the state of the model is legal. The transaction won’t proceed further if some of the forks was in the undesired state. But there is one important moment here. STM restarts this transaction a bit later; this means any side effect placed to the transaction will be possibly evaluated many times. Imagine there is some new operator lurking there. Memory can eventually start leaking! Or imagine an file writing operation: it can garbage your file by an unwanted bytes. Or… Ok, you’ve got the idea: transaction should never run any side effects. All the transactions should be pure. Exceptions are also prohibited because it will break the chain of transactions, which is an undefined behavior pretty much. So this is why the presented approach called “purely functional”. You can run your effect outside the STM code basing on the result a transaction returned, and in some cases you can run idempotent operations inside the transactions. But it’s always a good idea to avoid effects at all.
We are about to finish our behavior model. A philosopher’s state should be or should not be changed accordingly, let’s create a transaction for this:
STML<Activity> changeActivity(const Philosopher& philosopher) {

   STML<Activity> mAct = readTVar(philosopher.activity);

   STML<Unit> changed = bind<Activity, Unit>(mAct, [=](Activity oldAct)
       {
           if (oldAct == Activity::Thinking) {
               STML<Unit> taken = tryTakeForks(philosopher.forks);
               return sequence<Unit, Unit>(taken, writeTVar<Activity>(philosopher.activity, Activity::Eating));
           }
           else {
               STML<Unit> freed = putForks(philosopher.forks); // Implement it yourself
               return sequence<Unit, Unit>(freed, writeTVar<Activity>(philosopher.activity, Activity::Thinking));
           }
       });
   return sequence<Unit, Activity>(changed, readTVar<Activity>(philosopher.activity));
}
Notice we don’t need to check whether the forks were actually taken. We just know this situation is impossible with our model, and we can lean on it safely. Also, the model describes how to concurrently access the TVars, but it does not have any explicit synchronization objects here. We are guaranteed it will work as it’s written, thanks to the retry combinator and STM nature. This is a huge step forward comparing to bare parallel programming, do you agree?
Dining Philosophers: Multithreaded evaluation

To complete the article, I have to show the multithreaded evaluation of the model. It’s pretty straightforward, but let me simplify it for better demonstrability. I’ll shrink the number of philosophers to 2, and omit some unimportant stuff. Thread worker function:
struct PRt {
   Context& context;
   Philosopher& p;
};

void philosopherWorker(PRt rt) {
   while (true) {
       stm::atomically(rt.context, changeActivity(rt.p));
       std::this_thread::sleep_for(100ms);
   }
}
Evaluation:
void runPhilosophers() {
   stm::Context context;

   TFork tFork1 = stm::newTVarIO(context, Fork {"1", ForkState::Free }, " Fork 1");
   TFork tFork2 = stm::newTVarIO(context, Fork {"2", ForkState::Free }, " Fork 2");

   Philosophers philosophers =
       { mkPhilosopher(context, "1", tFork1, tFork2) // Does additional stuff
       , mkPhilosopher(context, "2", tFork2, tFork1)
       };

   std::vector<std::thread> threads;
   for (Philosopher& p: philosophers) {
       threads.push_back(std::thread(philosopherWorker, PRt {logLock, context, p}));
   }

   for (auto& t: threads) {
       t.join();
   }
}
The sample code also has one more thread that monitors the state of the model constantly. If you run the code, you’ll see the following output:
Snapshot #1
[1] Thinking, 1=Taken:2=Taken
[2] Eating, 2=Taken:3=Taken
[3] Thinking, 3=Taken:4=Free
[4] Thinking, 4=Free:5=Taken
[5] Eating, 5=Taken:1=Taken
Philosopher 2 changed activity to: Thinking for 3 seconds.
Philosopher 5 changed activity to: Thinking for 3 seconds.
Philosopher 3 changed activity to: Eating for 1 seconds.
Philosopher 1 changed activity to: Eating for 1 seconds.
Snapshot #2
[1] Eating, 1=Taken:2=Taken
[2] Thinking, 2=Taken:3=Taken
[3] Eating, 3=Taken:4=Taken
[4] Thinking, 4=Taken:5=Free
[5] Thinking, 5=Free:1=Taken

It works. It just works.
Conclusion

STM provides you some handy way to build reliable concurrent data models. To be said, STM is a lock-free approach in sense the client code will be free from any synchronization objects or explicit locks. The STM code is quite readable and understandable, and therefore it’s harder to make mistakes there.
Let me revise you to use STM, in short.

First, represent your shared resources as TVars. Forget about granularity problem, just make your resources fine-grained by default.
Second, create behavior model with transactions and convenient combinators binding them in a monadic way. The retry combinator can be used to prohibit certain states of the model. Consider making as smaller transactions as possible because the bigger your transactions the more conflicts your model will have. This impacts performance a lot. Also, avoid side effects and exceptions inside the transactions.
Finally, run the model in as many threads as you want. Great.

However, there are circumstances you should be aware: all the goodness of STM comes with cost.

STM can be less efficient than ad-hoc solutions because it uses some tricky synchronization inside.
Transactions and concurrent model can be written badly. It’s possible to make completely inefficient STM code that will behave as a poor single-threaded code.
Or even worse, the wrongly displaced retry combinator can hang your model forever.

My cpp_stm_free library is not production ready and therefore it has own issues I’d like to resolve in the future:

Experimental, works on model tasks but wasn’t tested in real conditions.
Performance is unknown. I didn’t any benchmarks yet due to lack of time. It works quite fast.


Simple transactions can be evaluated less than 0.1 ms.


Lot of separate atomically evaluated complex transactions can be evaluated for ~30.0 ms.


Lot of concurrently competing complex transactions can be evaluated for ~150.0 ms.


Not optimized. There is a lot of inefficiencies I’m aware about and even more I'm not. Many of them can be (should be) fixed. Since v0.5, the new engine was provided. Now STM works over the Church-encoded Free monad which gave a huge boost for long transactions.
The STM implementation with the Free monad also has additional overhead due to the Free monad’s nature. Since v0.5 it has been replaced by Church-encoded Free monad.
Lacks of side effect handling possibilities.
Lacks of important concurrent structures: TArray, TQueue, TChannel.
Needs more useful transactional combinators.
Needs more debugging / logging possibilities.
Lacks of documentation, but this tutorial should be fine to start from. Hopefully, combinators are self-descriptive, though.
Still can have some multithreading bugs.

Despite of this, I hope the library can be useful. At least it can fill the gap between a completely insane Transactional Memory from the TS 19841:2015 upcoming specification and really big Wyatt-STM. The latter can be your choice actually because it is mature and used in production for years. See the links below.
Additional materials


Software Transactional Memory (Wikipedia)
Wyatt-STM: production-ready STM for C++
The talk about Wyatt-STM by Brett Hall on CppCon 2015
TS 19841:2015 Transactional Memory
Beyound locks: Software Transactional Memory