arielshaqed/concurrent-ops-in-alternatives.md

## concurrent-ops-in-alternatives.md

      
    Raw
  

              concurrent-ops-in-alternatives.md
            
          
    How concurrent are lakeFS alternatives?

DeltaLake

DeltaLake concurrency uses LogStore.  On S3
it is single-writer concurrency (a writer is a Spark cluster):

Delta Lake supports concurrent reads from multiple clusters,
but concurrent writes to S3 must originate from a single Spark driver
in order for Delta Lake to provide transactional guarantees.

Nessie

Nessie supports (or tries to, depending on the underlying system?) these three isolation levels:

Read committed (Iceberg if refresh is allowed, Delta);
Repeated read (delegating somehow through Hive's "HMS Bridge", or Iceberg if avoiding refreshes)
Serializable (no examples given)

Documentation about the Nessie commit kernel
says:

Nessie’s production commit kernel is optimized to provide high commit throughput against a
distributed key value store that provides record-level ACID guarantees. Today, this kernel
is built on top of DynamoDB. The commit kernel is the heart of Nessie’s oeprations and
enables it to provide lightweight creation of new tags and branches, merges, rebases all
with a very high concurrent commit rate.

And everything goes through the DynamoDB refs table:

The refs table will have objects equal to the current number of active tags and
branches. This will generally be small (10s-1000s). All commits run through this table and
thus the writes and reads of this table should be provisioned based on the amount of read
and write operations expected per second. Since the dataset is small, sharding will be
unlikely to happen on this table. Scans are regularly done on this table.

The design goal is modest: 300 writes/second.
Nessie-on-DeltaLake

Odd...

Nessie is able to interact with Delta Lake by implementing a custom version of Delta's
LogStore interface. This ensures that all filesystem changes are recorded by Nessie as
commits. The benefit of this approach is the core ACID primitives are handled by Nessie. The
limitations around concurrency that Delta would normally have are removed, any number of
readers and writers can simultaneously interact with a Nessie managed Delta Lake table.

Iceberg

Best (recent) summary is on this issue from 2020-09.

Using Iceberg requires a catalog that can swap a pointer to the metadata file atomically. This can be done using a compare and swap or lock/unlock API.

They work with a Hive metastore, and "[a]nyone could easily build an integration for any catalog".
Nessie show up on that issue and mention that they can do it for you... with DynamoDB.
There's PR 1608 which adds support for "Glue".
But it also "uses DynamoDB for the locking support missing in Glue".
It seems to have gone in recently, no idea if it has already been released.
A lock manager on top of DynamoDB is in #2034.