albertnetymk/STM.md

## STM.md

      
    Raw
  

              STM.md
            
          
    How STM interacts with locks in kappa

I would firstly assume sync block has the same semantics of pthread_mutex_lock. When the execution outside of transactions reaches sync block, it would gain exclusive access to this critical region, and all other executions, either inside or outside transactions, are blocked if they try to enter the critical region.
When the execution inside of transactions reaches sync block, it would gain non-exclusive access to this critical region, which could be obtained by any other executions inside transactions, but executions outside transactions would be blocked.
In other words, the mutex lock inside the transactions becomes a reader lock, and mutex lock outside the transactions becomes a writer lock.
Software transactional memory

static transactions, transactions which access a pre-determined sequence of locations.
deadlock is eliminated by acquiring all the needed locks in a total order
Herlihy’s method

copy the whole data to a newly allocated space, do the update, and switch the pointer to the new location using ll/sc. expensive, and doesn’t support concurrent update
Barnes

read data into local memory, do the update locally, and using k word read-modify-write (kRMW) atomic op to update the shared memory. The kRMW requires “locking” semantics, so in order to get non-blocking property, “cooperative method” is used; whenever a process needs a location already locked by another process, it helps the locking process complete its own operation, and this helping policy is recursive.
Improvement done in this paper:

Two processes, P(a,b), Q(b), where a and b are two data locations. Assumes that Q has already owned b, and P has already owned a. After P has finished helping Q, P would resume its own transaction, which will probably fail, for b may be updated by Q. All the processes waiting for a will first help P, then Q, and again P. Therefore, it’s possible that two process who shares no direct resource could be brought together by a third process, which creates unnecessary contention.
transaction() {
  ret = lock_all_data;
  if ret == success
    commit;
    unlock_all_data;
  else
    unlock_all_data
    if not_helper
      help_the_blocked_transaction
}

The not_helper check removes the recursiveness of the helping policy.
Transparent Support for Wait-Free Transactions

implements non-static STM, where the accessed memory locations are not pre-determined.
Different from previous paper, where exclusive access is acquired at the beginning of the transaction, while in this paper acquiring exclusive access, using MWCAS, happens right before commit. Inconsistent READ values signals that other transactions have commits on the same values, so the current transaction would fail anyway, so transactions would early abort as soon as inconsistent READ is observed.
Simliar techniques to Readers-writers lock is used to avoid aborting two transactions only reading the same variables but still revert read/write conflict.
Exclusive accesses are acquired in total order, and helping policy similar to previous paper is used as well, so the implementation is lock-free.
Using MWCAS to implement wait-free STM; whenever MWCAS fails, the current transaction asks help from the succeeded process. If the number of total process is bounded, then each process is guaranteed to make progress.
Software Transactional Memory for Dynamic-sized Data Structures


We present the first dynamic STM implementation. Prior STM designs required both the memory usage and the transactions to be defined statically in advance.

This statement seems contradicting with the previous paper, for pre-determined access is not required in the previous paper. Dynamically create transaction is not covered in the two previous papers, though.
For each access to the shared variables in the transaction, READ or WRITE must be specified, and this info is used to check conflicts. If any, contention manager (CM) is consulted, and one straightforward implementation of CM could just abort the transaction, which could cause live-lock. Therefore, the problem of avoid live-lock is delegated to CM.
Requirement for CM: any active transaction that asks sufficiently often must eventually get permission to abort a conflicting transaction.
In a linked-list based set implementation, member and insert could have conflicts when they iterate through the list, even if the insertion point is already passed. Therefore, the release operation is introduced so that objects accessed in READ mode could be ruled out before committing so that the likelihood of  conflicting is reduced.
The release operation has to be called explicitly by the programmers, which could be error prone. (Maybe it’s possible to automatically call it, just like GC?)
Language Support for Lightweight Transactions

In the original STM, each shared word has one associated ownership word, so memory requirement doubles; in hash table STM, each memory location is mapped to one ownership entry, and there could be aliases: multiple locations mapped to a single ownership entry. (In extreme case, one-to-one map, the situation degenerates to the word-base STM.) Transaction execution is private until STMCommit; right before commit, the ownership records corresponding to all memory access in this transactions are acquired using MWCAS.
The space requirement goes down by using hash table (ownership table) with the price that two transactions which doesn’t overlap at all could still conflict, for the memory locations access in those two transactions are mapped into the same ownership record.
Sharing read:

acquire all WRITE locations
check all READ location to see if it has been written by other transactions
commit the transaction
Need to ensure no write happens between 2 and 3, so assign READ_PHASE to the state during all READ until 3, and other transactions that write to the same locations would abort it if it’s in READ_PHASE, or abort themselves if already committed.

During MWCAS, if one ownership entry is already acquired by another transaction, there are two options: abort the current transaction, which is blocking, or using the stealing policy below, which is obstruction-free.
If the conflicting transaction is ACTIVE, it would be aborted, which could incur live-lock, so this approach is only obstruction-free. If the conflicting transaction is COMMITTED, its relevant entry would be merged to the current one. The ownership entry would have a counter keeping track of all the transactions that have pending writes. The owner always has the most recent value, and each transaction must use that value, and decrement the counter. Only reaching zero, will the transaction ownership be removed.
Adaptive Software Transactional Memory

STM design space:

when concurrent objects are acquired (eager or lazy acquire)
how they are acquired (per-object or per transaction metadata)
what they look like when not currently acquired (level of indirection)
non-blocking semantics (obstruction-free vs lock-free)

STM generally consists four phases:

Open: bring the object into the transaction
Acquire: assert the ownership of the object. Not a separate operation, part of Open in eager acquire, and part of Commit in lazy acquire
Commit: make all updates visible
Release: relinquish the exclusive access obtained in Acquire.

DSTM: eager, per-object, indirect object reference, obstruction-free
OSTM: lazy, per-transaction, direct object reference, lock-free
write-dominant: DSTM > OSTM
Whenever a new object is opened, OSTM must revalidate all previously opened objects, both READ and WRITE, while DSTM only needs to check objects opened in READ, for it has already acquired the ownership of the object for WRITE. In write-dominant case, this revalidation overhead in DSTM is much smaller than that in OSTM.
read-dominant: OSTM > DSTM
Since DSTM uses indirect object reference, in this case, it tends to induce an extra cache miss.
16 processor SunFire, 1 to 48 threads
write dominant

DSTM, eager ASTM, ASTM > OSTM, lazy ASTM


IntSet : a sorted list of integers, 0..255; threads insert/delete node concurrently.


LFUCache : uses priority queue to simulate cache replacement in web proxy. Each transaction reads a page from zipf distribution, increment frequency count, and rearrange the queue.


read dominant

ASTM, eager ASTM, lazy ASTM, OSTM > DSTM


IntSetRelease : transactions open objects in read mode and release them, while moving down the list. Once found, the nodes to be updated are upgraded to write mode.


RBTree : Search target nodes in read mode, and convert to write mode after identifying the target.


long transaction

ASTM, lazy ASTM, OSTM > eager ASTM, DSTM
RandomGraphList : a random undirected graph as a sorted linked list, with each node has a sorted neighbours list. Each transaction performs insert/delete nodes.
With eager acquire, reader transactions encounter unnecessary contention with writer transactions for objects that would be released early anyway.
Presentation for the above papers. https://docs.google.com/presentation/d/1FN_9b7bqEkyPtUuFadR2-OcloQCdSYew_Ze1LJxTkgo/edit?pli=1#slide=id.p
Advanced contention management for dynamic software transactional memory

Contention manager spectrum


never aborts the enemy transaction: starvation, poor performance due to preemptive scheduling
always aborts the enemy transaction: starvation, live lock
somewhere in the middle: abort the enemy transactions often enough to tolerate preemption and seldom enough to make starvation unlikely.

visible vs invisible reader

invisible reader: save TMObject-> Locator mapping into the private read set, and compare the current locator with the saved locator during validation time.
visible reader: maintain a list of readers for each TMObject, and writers could see what readers are.
Contention management interfaces

notification methods (hooks) for various events: beginning a transaction, successfully/unsuccessfully committing, attempts to access/acquire an object, and successfully acquisition, etc.
request methods for whether a transaction should (re)start and whether enemy transaction should be aborted.
Basic policy


Polite: exponential back off sleep [0, 2, 4 …] time unit, abort the enemy after reaching a threshold


Karma: judge the amount of work that a transaction has done so far when deciding whether to abort it. Use the number of opened objects to approximate the workload. Priority is reset on commit, but not on abort, so that small transaction could build up priority over time. On opening conflict, the enemy transaction would be aborted if the number of attempts of opening the object exceed the priority difference between the enemy and the current transaction. Otherwise, it backs off a fixed period of time.


Eruption: variant of Karma. On opening conflict, adds the transaction’s priority to the enemy so that enemy could finish quickly and the blocked transaction could resume.


Kindergarten: On opening conflict, abort the enemy if it’s present in hit list; if not, add to hit list, and backs off some time. After this, if it’s still blocked, abort itself, and restart.


Timestamp: set timestamp at the beginning of each transaction. Older transaction would abort younger transactions. Younger transactions would mark the conflicting transactions as defunct after waiting for some time, and two consecutive defunct-marking would abort the transaction. Active transactions clear the defunct flag constantly.


New policy


PublishedTimestamps: Timestamp requires a long period to abort an inactive (usually preempted) transaction. Introduces a recency timestamp, which is updated on every notification event. If the enemy’s recency is old enough to exceed its inactivity threshold, it’s aborted. If an aborted transaction restarts, its inactivity threshold double until a upper bound.


Polka: Polite + Karma
The current transaction backs off priority-diff number of intervals, with each interval grows exponentially.


Prioritized Contention Management

Each thread has a base priority BP that ranks its overall importance relative to other threads.
proportional-share management: each thread’s cumulative throughput is proportional to its base priority;

a thread with base priority 3 should ideally complete 50% more work over any given period of time than one with base priority 2


Karma, Eruption and Polka: opening objects would add BP (instead of 1, before) to its priority


Timestamp variants: One transaction retain its timestamp through BP committed transactions. IOW, they are allowed to act as the oldest several times in a row.


Kindergarten: randomize updates to the list of transactions in favor of which a thread has aborted to probability BP_t/(BP_t + BP_e) for a transaction with base priority BP_t and an enemy with base priority BP_e.


Result

For performance test, measure throughput for each benchmark-manager pairing with both visible and invisible read implementation. For fairness test, there are two configurations, one contains 8 threads, with priority [1..8], the other contains 4 threads, with initial priority [1..4], and inverse the priority [4..1] in the half way of the test.
Evaluation platform: SunFire 6800 16 UltraSPARC processor

IntSet: sorted linked-list, each object is acquired for write access
IntSetUpgrade: sorted linked-list, read-only iteration, acquire for write as needed
RBTree: red-black tree, read-only iteration, acquire for write as needed
Stack: transactional push and pop
ArrayCounter: increment transactions increment each counter in the array from the beginning to the end, decrement transactions decrement each counter in the array from end to the beginning.
LFUCache: simulation of HTTP web proxy with LFU. Requests are generated using zipf distribution.

Throughput

write-dominated benchmarks

Stack, IntSet, ArrayCounter
every pair of transaction conflicts, IOW sequential program. At best, the throughput stays the same when #threads increases.
good performance requires one transaction to dominate the others long enough to finish. Karma and Eruption perform well because of the accumulated priority. Polka’s exponential back off reduces cache contention.
LFUCache

It’s write-dominated as well, so the result is similar to previous ones; since it contains more concurrency, the result shows greater “spread”.
RBTree

A typical transaction would access objects in read-only mode until insertion/deletion point, and perform fix-ups that re-balances the tree, working towards the root and acquire objects as needed.
Up to 4 threads, visible-read is better, which could be attributed to validation overhead.
IntSetUpgrade

Each transaction consists of a series of reads followed by one write, so validation incurs a large throughput penalty for invisible-read.
Throughput result summary

No single manager outperforms all others in every benchmark, but Polka achieves decent throughput even when it’s outperformed. The tradeoff between visible and invisible read remains application-specific.
Fairness

Timestamp protocol achieves the almost perfect fairness.
A fundamental limitation to the technique used in this paper is that by relying on contention manager to effect priority, it’s impossible to discriminate between transactions that don’t encounter each other.
A Flexible Framework for Implementing Software Transactional Memory

Define atomic classes


annotating interfaces with @atomic
pass the interface to the constructor of transactional factory, which uses reflection and run-time class to synthesis an anonymous class implementing the interface
create transactional objects from the factory

@atomic
public interface INode {
  int getValue();
  void setValue(int v);
  INode getNext();
  void setNext(INode v);
}

Factory<INode> factory = dstm2.Thread.makeFactory(INode.class);
INode n = factory.create();

The default factory can only synthesis implementation for methods matching setter and getter signature; for other methods, customised factory, which knows how to implement those, should be provided.
T getField();
void setField(T v);

Default Factory class

1. Base Factory

Divide interface methods into properties and the rest, so that child factories’ logic could be simplified
2. Obstruction-free Factory

Similar to DSTM, locator object is created on object opening for write (acquiring). Two variants exist for this factory, visible reader and invisible reader. visible reader is like reader lock, writers must abort them before writing. invisible readers must validate the reader list before committing.
3. Shadow Factory

For every user defined field, another shadow field is created along with it. The shadow fields holds the old version, and update is done on the real field. When an object is opened, it checked the status of previous transaction holding this object, if committed, the real field holds the latest version, it would call backup() to copy each real field to its shadow field; if aborted, the shadow field holds the latest version, it would call restore() to copy from the shadow fields.
McRT-STM: a high performance software transactional memory system for a multi-core runtime

1. non-blocking guarantee:

STM acquires all relevant locks before committing. The transaction would be blocked if a lock is not available in order to maximise throughput.

If a transaction T1 finds that a lock has been acquired, it queries the McRT scheduler, and only if the lock owner T2 is not running does T1 abort T2.

2. reader-writer lock vs read versioning and writer locking

reader-writer lock

atomically increment a counter to keep track of all readers, and when the counter reaches zero, it means all readers release their lock, and it’s free to be acquire by reader or writer. atomically set the upgrade bit to indicate upgrading intention so that no new readers could acquire this lock, and after all current readers release their locks, it acquires the write lock.
read versioning


a checks the lock-word to insure no writer currently.
read the value and its version
validate the version before commit

(It didn’t say revalidating on each read
read versioning > reader/writer lock
because a) read versioning eliminates readers to write to the lock, b) upgrading requires all readers to relinquish their read lock, which means an atomic operation on the lock word with destructive cache effect.
3. write buffering vs undo logging

In undo-base scheme, read after write (RAW) can be handle trivially. In write buffering scheme, lock acquiring is postponed until commit time, which reduces the lock-holding time.
undo logging > write buffering
because: having to search for the most recent value in write buffering
(It’s not very clear why this searching in write buffering can’t be eliminated. I feel it should be feasible to create local aliases for all accessed global resources, word-based, in the transaction, and use the local versions when the transaction is active, and replace the global resources when commit. For the read, local versions are used as well, so no searching is needed, I think.)
4. object vs cache-line conflict detection

McRT memory manager (McRT-Malloc) saves small objects (<8K) into a size-segregated heap, and large objects into another heap. Object-base locking is only supported for small objects, and for the rest (large objects, stack-allocated objects, global variables) cache-line locking is used.
object-based locking:
a) inline-locks: allocate locks inline with objects, good for cache locality, but pay the space price even when no transaction is used.
b) lock-table: allocate a table pointer with objects
cache line is a special case of hash table STM.
No significant difference between different detection algo.
STM overhead:

read barrier: read the version number if not locked and return, otherwise, consult contention manager, wait or retry or abort.
validation: whenever a new object is opened (read), need to validate all previously opened objects.

x = y = 0

atomic {
  // x reads the value before the other transaction, while y reads the updated value
  if (x!=y) { while(1); }
}

atomic {
  x++;
  y++;
}

The performance of McRT-STM and locks is comparable in spam filter in sendmail.
Compiler and Runtime Support for Efficient Software Transactional Memory

The core platform consists of three major components:

Open Runtime Platform (ORP) virtual machine: for Java and Common Language Infrastructure
StarJIT: dynamic compiler for Java and C#
McRT-STM: STM runtime

Language constructs


atomic
retry: block the thread, and restart if the depended variables changes
orelse: if the first block is completed, the second is discarded, if the first block calls retry, the second block is executed.

STM implementation

two-phase locking, versioning read, update in-place and undo log.
supports object and word conflict detection: each object is padded with one extra slot to store the transaction record pointer or a hash initialised at allocation time. The pointer is used for object conflict detection. The hash plus the index of the field is used to index the global transaction record table.
contention manager: exponential backoff
no need for revalidation on each opening in a managed environment, for runtime-exception would be handle by try-catch block. To prevent infinite loop, revalidation is introduced on backward branches.
closed nested transactions: transactions remain isolated until parent commits
allocates a transaction memento in the activation record to mark the beginning of the nested transaction so that partial abort could be implemented.
Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language

Language construct

pragma:

tm_atomic:
tm_function: for functions and function pointers used inside a transaction

Implementation


Builds on top of McRT-STM, extending with global timestamp.
Uses cache-line conflict detection.
global timestamp incremented by each transaction on commit or abort

(I feel increment the timestamp after abort is not needed, but it’s possible that I overlooked some details.)
Need to account for stack allocated variables when process undo-logs on abort. The potential problem of a straightforward implementation is shown below, and the solution is to save stack pointer (SP) on starting a transaction, and undo-logs touching addresses between the saved SP and the current SP are skipped.
atomic {
  foo();
  … // aborted and process undo-log
}

int foo() {
  int a[100];
  bar(&a[0]);
}

void bar(int *p) {
  *p = … // generates undo log
  ...
}

Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough

Previous STM focuses:

synthetic workload with small transaction
API-base STM, where each memory access is annotated manually

introduces a compiler-base STM, modified Intel C/C++ with McRT-STM runtime.
benchmark


Game Physics Engine (GPE) performs collision detection by solving ordinary differential equations. This application represents the core computation found in many 3D graphics applications such as 3D games.


Smoothed Particle Hydrodynamics (SPH) performs particle dynamics simulation, a typical application found in computer animated fluid.


Sphinx is a speech recognition program, part of ALPBench.


Genome is a gene sequencing program.


Kmeans is a machine learning algorithm applied to a set of data points.


Vacation models the database transactions of a travel agency.


The last three are from STAMP (Stanford Transactional Applications for Multiprocessing)
GLOCK version is created by acquiring a single global lock for each critical region.
1. false conflict


a false conflictoccurs on an STM when two different addresses are mapped to the same transaction record.


two different address on the same cache line (false sharing).
local-induced: if either of the conflicting address is thread-local.
global-induced: if neither of the conflicting address is thread-local.

Even with false sharing, workloads scales well. Local-induced false conflict could be reduced by avoiding instrumentation of thread-local memory access. Global-induced false conflict could be reduced by using a different hash function.
Originally, only use 4 out of 64 bytes (cache line). Now uses 4 extra bits to index into 16 (2^4) slots in the cache line.
2. over-instrumentation

Hard to distinguish thread-local from regular memory access, so compiler emit more read/write barrier than necessary.
Proposed tm_waiver pragma for functions or blocks so that the body will not be instrumented. Similar to early release, powerful and dangerous.
3. privatization

atomic {
  node = list->head
  list->head = null
}
process(node)

atomic {
  node = list-> head
  if (node) {
    node->data = ...
  }
}

function excise_and_print (int v)
1     begin transaction
2       prev = list_head; curr = prev->next
3       while curr->val != v
4         prev = curr; curr = curr->next
5         if curr == NULL goto 7
6       prev->next = NULL
7     end transaction
8     while curr != NULL
9        print curr->val
10      curr = curr->next

T1 calls excise_and_print(5) and T2 calls excise_and_print(7). Given the list is [1..10], if T1 executes before T2, the correct output should be T1: [5..10], T2: []; if T2 executes before T1, the correct output should be T1: [5..6], T2: [7..10].
However, output like T1: [5..6], T2: [] could be produced if STM uses undo-logs, because there exists a delay between a failed transaction aborts and when its speculative writes are rolled back, so T1 observes aborted in-place writes. Other wrong outputs could be emitted for STM using redo-logs, locator objects, etc.
The bottom line is that STM (weak isolation) can’t help the code not using atomic block.
Software Transactional Memory: Why Is It Only a Research Toy?

STM: no special requirement on hardware, but too much runtime overhead
HTM: high implementation and verification cost; performance degrades after overflowing the hardware constraints
Hybrid: how to mix the above two remains unclear; what hardware primitives should be exposed from STM to provide the transactional semantics.
semantics issues:

interaction with non-transactional codes: access shared data from outside of a transaction, and using locks inside a transaction
Exceptions and serializability—how to handle exceptions and propagate consistent exception information from within a transactional context, and how to guarantee that transactional execution respects a correct ordering of operations. (Don’t understand it clearly, reprint from the original paper.)
interaction with code that can’t be transactionalized
Livelock, or the system guarantee that all transactions make progress even in the presence of conflicts. (Some STM implementation is lock-free, if this issue is talking about livelock.)

STMAP (Stanford Transactional Applications for Multiprocessing) and Lonestar large transaction benchmark
IBM STM
Intel STM
Sun TL2 STM
b+tree: an implementation of database indexing operations on a b-tree data structure for which the data is stored only on the tree leaves
delaunay: implements the Delaunay Mesh Refinement algorithm
genome: Gene sequencing benchmark
kmeans: K-means clustering benchmark
vacation: Travel reservation system benchmark
All three STM perform rather bad. (This re-validation overhead is huge. Maybe we could use sth like reader/writer lock to support sharing read, and detect write conflict eagerly?)
Using a cycle-accurate simulator of PowerPC architecture, it’s possible to capture the overhead at the instruction level. Hugh single-threaded overhead. (I guess they only use IBM STM, for the previous evaluation shows that the three STM are mostly on par.)