Skip to content

Instantly share code, notes, and snippets.

@simon-mo
Created April 18, 2023 00:12
Show Gist options
  • Save simon-mo/c723103f07465029e7251e3d931ff927 to your computer and use it in GitHub Desktop.
Save simon-mo/c723103f07465029e7251e3d931ff927 to your computer and use it in GitHub Desktop.

WANalytics: Analytics for a Geo-Distributed Data-Intensive World

WANalytics is proposed, a system that pushes computation to edge data centers, automatically optimizing workow execution plans and replicating data when needed, which delivers substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor suciently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257 reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

Millions of Tiny Databases

Physalia is a transactional keyvalue store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients.

Starting in 2013, we set out to build a new database to act as the configuration store for a high-performance cloud block storage system (Amazon EBS).This database needs to be not only highly available, durable, and scalable but also strongly consistent. We quickly realized that the constraints on availability imposed by the CAP theorem, and the realities of operating distributed systems, meant that we didn’t want one database. We wanted millions. Physalia is a transactional keyvalue store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Physalia uses its knowledge of datacenter topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client. This paper describes Physalia in context of Amazon EBS, and some other uses within Amazon Web Services. We believe that the same patterns, and approach to design, are widely applicable to distributed systems problems like control planes, configuration management, and service discovery.

Online Migration for Geo-distributed Storage Systems

This work introduces distributed storage overlays, a simple abstraction that represents data as stacked layers in different places that can be readily used to cache data objects, migrate these caches, and migrate the home of data objects.

We consider the problem of migrating user data between data centers. We introduce distributed storage overlays, a simple abstraction that represents data as stacked layers in different places. Overlays can be readily used to cache data objects, migrate these caches, and migrate the home of data objects.We implement overlays as part of a key-value object store called Nomad, designed to span many data centers. Using Nomad, we compare overlays against commonmigration approaches and show that overlays are more flexible and impose less overhead. To drive migration decisions, we propose policies for predicting the location of future accesses, focusing on a web mail application.We evaluate themigration policies using real traces of user activity from Hotmail.

A modular design for geo-distributed querying: work in progress report

This paper proposes a modular architecture that is flexible and allows query processing systems to make trade-offs according to different use case requirements and describes adaptive mechanisms that make use of this flexibility to enable queries on secondary attributes to dynamically adjust to query and write operation workloads.

Most distributed storage systems provide limited abilities for querying data by attributes other than their primary keys. Supporting efficient search on secondary attributes is challenging as applications pose varying requirements to query processing systems, and no single system design can be suitable for all needs. In this paper, we show how to overcome these challenges in order to extend distributed data stores to support queries on secondary attributes. We propose a modular architecture that is flexible and allows query processing systems to make trade-offs according to different use case requirements. We describe adaptive mechanisms that make use of this flexibility to enable query processing systems to dynamically adjust to query and write operation workloads.

Bandwidth constrained placement in a WAN

A simple algorithm to generate a bandwidth-constrained placement by hierarchically refining an initial per-cache greedy placement is developed and it is proved that this hierarchical algorithm generates a placement whose expected access time is within a constant factor of the optimal placement's expectedAccess time.

In this paper, we examine the bandwidth-constrained placement problem, focusing on trade-offs appropriate for wide area network (WAN) environments. The goal is to place copies of objects at a collection of distributed caches to minimize expected access times from distributed clients to those objects subject to a maximum bandwidth constraint at each cache. We develop a simple algorithm to generate a bandwidth-constrained placement by hierarchically refining an initial per-cache greedy placement. We prove that this hierarchical algorithm generates a placement whose expected access time is within a constant factor of the optimal placement's expected access time. We then proceed to extend this algorithm to compute close to optimal placement strategies for dynamic environments.

Surviving Congestion in Geo-Distributed Storage Systems

Vivace provides strong consistency and replicates data across sites for access locality and disaster tolerance, and is designed to cope well with network congestion across sites, which occurs because the bandwidth across sites is smaller than within sites.

We present Vivace, a key-value storage system for web applications that span many geographically-distributed sites. Vivace provides strong consistency and replicates data across sites for access locality and disaster tolerance. Vivace is designed to cope well with network congestion across sites, which occurs because the bandwidth across sites is smaller than within sites. To deal with congestion, Vivace relies on two novel algorithms that prioritize a small amount of critical data to avoid delays due to congestion. We evaluate Vivace to show its feasibility and effectiveness.

Efficient Replica Maintenance for Distributed Storage Systems

The paper proposes the Carbonite replication algorithm for keeping data durable at a low cost and shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures.

This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.

The following insights in designing an efficient replication algorithm emerge from the paper's analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency.

Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage

This work shows that the key to addressing sub-optimality is to allow for erasure coding, not just replication, of data across data centers, and mitigate the resultant increase in read and write latencies by rethinking how to enable consensus across the widearea network.

By replicating data across sites in multiple geographic regions, web services can maximize availability and minimize latency for their users. However, when sacrificing data consistency is not an option, we show that service providers have to today incur significantly higher cost to meet desired latency goals than the lowest cost theoretically feasible. We show that the key to addressing this sub-optimality is to 1) allow for erasure coding, not just replication, of data across data centers, and 2) mitigate the resultant increase in read and write latencies by rethinking how to enable consensus across the widearea network. Our extensive evaluation mimicking web service deployments on the Azure cloud service shows that we enable near-optimal latency versus cost tradeoffs.

Performance Isolation and Fairness for Multi-Tenant Cloud Storage

Pisces achieves per-tenant weighted fair shares of the aggregate resources of the shared service, even when different tenants' partitions are co-located and when demand for different partitions is skewed, time-varying, or bottlenecked by different server resources.

Shared storage services enjoy wide adoption in commercial clouds. But most systems today provide weak performance isolation and fairness between tenants, if at all. Misbehaving or high-demand tenants can overload the shared service and disrupt other well-behaved tenants, leading to unpredictable performance and violating SLAs.

This paper presents Pisces, a system for achieving datacenter-wide per-tenant performance isolation and fairness in shared key-value storage. Today's approaches for multi-tenant resource allocation are based either on per-VM allocations or hard rate limits that assume uniform workloads to achieve high utilization. Pisces achieves per-tenant weighted fair shares (or minimal rates) of the aggregate resources of the shared service, even when different tenants' partitions are co-located and when demand for different partitions is skewed, time-varying, or bottlenecked by different server resources. Pisces does so by decomposing the fair sharing problem into a combination of four complementary mechanisms--partition placement, weight allocation, replica selection, and weighted fair queuing--that operate on different time-scales and combine to provide system-wide max-min fairness.

An evaluation of our Pisces storage prototype achieves nearly ideal (0.99 Min-Max Ratio) weighted fair sharing, strong performance isolation, and robustness to skew and shifts in tenant demand. These properties are achieved with minimal overhead (<3%), even when running at high utilization (more than 400,000 requests/second/server for 10B requests).

Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3

This paper reports the experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage service.

This paper reports our experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage service. By "lightweight formal methods" we mean a pragmatic approach to verifying the correctness of a production storage node that is under ongoing feature development by a full-time engineering team. We do not aim to achieve full formal verification, but instead emphasize automation, usability, and the ability to continually ensure correctness as both software and its specification evolve over time. Our approach decomposes correctness into independent properties, each checked by the most appropriate tool, and develops executable reference models as specifications to be checked against the implementation. Our work has prevented 16 issues from reaching production, including subtle crash consistency and concurrency problems, and has been extended by non-formal-methods experts to check new features and properties as ShardStore has evolved.

Transaction chains: achieving serializability with low latency in geo-distributed storage systems

It is shown that it is possible to obtain both serializable transactions and low latency, under two conditions: transactions are known ahead of time, permitting an a priori static analysis of conflicts, and transactions are structured as transaction chains consisting of a sequence of hops.

Currently, users of geo-distributed storage systems face a hard choice between having serializable transactions with high latency, or limited or no transactions with low latency. We show that it is possible to obtain both serializable transactions and low latency, under two conditions. First, transactions are known ahead of time, permitting an a priori static analysis of conflicts. Second, transactions are structured as transaction chains consisting of a sequence of hops, each hop modifying data at one server. To demonstrate this idea, we built Lynx, a geo-distributed storage system that offers transaction chains, secondary indexes, materialized join views, and geo-replication. Lynx uses static analysis to determine if each hop can execute separately while preserving serializability---if so, a client needs wait only for the first hop to complete, which occurs quickly. To evaluate Lynx, we built three applications: an auction service, a Twitter-like microblogging site and a social networking site. These applications successfully use chains to achieve low latency operation and good throughput.

Building consistent transactions with inconsistent replication

This paper presents TAPIR -- the Transactional Application Protocol for Inconsistent Replication, the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency.

Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.

Ambry: LinkedIn's Scalable Geo-Distributed Object Store

Ambry is a production-quality system for storing large immutable data (called blobs) designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching.

The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).

Sharding the Shards: Managing Datastore Locality at Scale with Akkio

Measurements from the production environment show that Akkio reduces access latencies, cross-datacenter traffic, and storage footprint by up to 40% compared to reasonable alternatives.

Akkio is a locality management service layered between client applications and distributed datastore systems. It determines how and when to migrate data to reduce response times and resource usage. Akkio primarily targets multi-datacenter geo-distributed datastore systems. Its design was motivated by the observation that many of Facebook's frequently accessed datasets have low R/W ratios that are not well served by distributed caches or full replication. Akkio's unit of migration is called a µ-shard . Each µ-shard is designed to contain related data with some degree of access locality. At Facebook, µ-shards have become a first-class abstraction.

Akkio went into production at Facebook in 2014, and it currently manages ∼ 100PB of data. Measurements from our production environment show that Akkio reduces access latencies by up to 50%, cross-datacenter traffic by up to 50%, and storage footprint by up to 40% compared to reasonable alternatives. Akkio is scalable: it can support trillions of µ-shards and process many 10's of millions of data access requests per second. And it is portable: it currently supports five datastore systems.

Bigtable: A Distributed Storage System for Structured Data

The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

PNUTS: Yahoo!'s hosted data serving platform

PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees and utilizes automated load-balancing and failover to reduce operational complexity.

We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore provides fully serializable ACID semantics within ne-grained partitions of data, which allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.

Megastore is a storage system developed to meet the requirements of today’s interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within ne-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore’s semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.

Yugong: Geo-Distributed Data and Job Placement at Scale

Y Yugong is presented — a system that manages data placement and job placement in Alibaba’s geo-distributed DCs, with the objective to minimize cross-DC bandwidth usage.

Companies like Alibaba operate tens of data centers (DCs) across geographically distributed locations. These DCs collectively provide the storage space and computing power for the company, storing EBs of data and serving millions of batch analytics jobs every day. In Alibaba, as our businesses grow, there are more and more cross-DC dependencies caused by jobs reading data from remote DCs. Consequently, the precious wide area network bandwidth becomes a major bottleneck for operating geo-distributed DCs at scale. In this paper, we present Yugong — a system that manages data placement and job placement in Alibaba’s geo-distributed DCs, with the objective to minimize cross-DC bandwidth usage. Yugong uses three methods, namely project placement, table replication, and job outsourcing, to address the issues of high bandwidth consumption across the DCs. We give the details of Yugong’s design and implementation for the three methods, and describe how it cooperates with other systems (e.g., Alibaba’s big data analytics platform and cluster scheduler) to improve the productivity of the DCs. We also report comprehensive performance evaluation results, which validate the design of Yugong and show that significant reduction in cross-DC bandwidth usage has been achieved. PVLDB Reference Format: Yuzhen Huang, Yingjie Shi, Zheng Zhong, Yihui Feng, James Cheng, Jiwei Li, Haochuan Fan, Chao Li, Tao Guan, Jingren Zhou. Yugong: GeoDistributed Data and Job Placement at Scale. PVLDB, 12(12): 2155-2169, 2019. DOI: https://doi.org/10.14778/3352063.3352132

TAO: Facebook's Distributed Data Store for the Social Graph

TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook's demanding workload using a fixed set of queries.

We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model. TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook's demanding workload using a fixed set of queries. It is deployed at Facebook, replacing memcache for many data types that fit its model. The system runs on thousands of machines, is widely distributed, and provides access to many petabytes of data. TAO can process a billion reads and millions of writes each second.

CockroachDB: The Resilient Geo-Distributed SQL Database

The design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware is presented and its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect.

We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years.

Spanner: Google's Globally-Distributed Database

This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty, critical to supporting external consistency and a variety of powerful features.

Spanner is Google’s scalable, multiversion, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free snapshot transactions, and atomic schema changes, across all of Spanner.

A comprehensive study of Convergent and Commutative Replicated Data Types

This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case, and describes several useful CRDTs, including container data types supporting bothadd and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences.

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both \add and \remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs.

Managing update conflicts in Bayou, a weakly connected replicated storage system

The motivation for and design of these mechanisms for conflict detection and per -write conflict resolution based on client-provid ed procedures are presented and the experiences gained with an initial implementation of the system are described.

Bayou is a replicated, weakly consistent storage system designed for a mobile computing environment that includes portable machines with less than ideal network connectivity. To maximize availability, users can read and write any accessible replica. Bayou’s design has focused on supporting application-specific mechanisms to detect and resolve the update conflicts that naturally arise in such a system, ensuring that replicas move towards eventual consistency, and defining a protocol by which the resolution of update conflicts stabilizes. It includes novel methods for conflict detection, called dependency checks, and per -write conflict resolution based on client-provid ed mer ge procedures. To guarantee eventual consistency, Bayou servers must be able to rollback the effects of previously executed writes and redo them according to a global serialization order . Furthermore, Bayou permits clients to observe the results of all writes received by a server , including tentative writes whose conflicts have not been ultimately resolved. This paper presents the motivation for and design of these mechanisms and describes the experiences gained with an initial implementation of the system.

MDCC: multi-data center consistency

The experiments show that MDCC outperforms existing synchronous transactional replication protocols, such as Megastore, by requiring only a single message round-trip in the normal operational case independent of the master-location and by scaling linearly with the number of machines as long as transaction conflict rates permit.

Replicating data across multiple data centers allows using data closer to the client, reducing latency for applications, and increases the availability in the event of a data center failure. MDCC (Multi-Data Center Consistency) is an optimistic commit protocol for geo-replicated transactions, that does not require a master or static partitioning, and is strongly consistent at a cost similar to eventually consistent protocols. MDCC takes advantage of Generalized Paxos for transaction processing and exploits commutative updates with value constraints in a quorum-based system. Our experiments show that MDCC outperforms existing synchronous transactional replication protocols, such as Megastore, by requiring only a single message round-trip in the normal operational case independent of the master-location and by scaling linearly with the number of machines as long as transaction conflict rates permit.

TARDiS: A Branch-and-Merge Approach To Weak Consistency

This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems, and finds that TardiS reduces coding complexity for applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.

This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems. Reasoning about these systems is hard, as neither causal consistency nor per-object eventual convergence allow applications to deal satisfactorily with write-write conflicts. TARDiS instead exposes as its fundamental abstraction the set of conflicting branches that arise in weakly-consistent systems. To this end, TARDiS introduces a new concurrency control mechanism: branch-on-conflict. On the one hand, TARDiS guarantees that storage will appear sequential to any thread of execution that extends a branch, keeping application logic simple. On the other, TARDiS provides applications, when needed, with the tools and context necessary to merge branches atomically, when and how applications want. Since branch-on-conflict in TARDiS is fast, weakly-consistent applications can benefit from adopting this paradigm not only for operations issued by different sites, but also, when appropriate, for conflicting local operations. We find that TARDiS reduces coding complexity for these applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.

Stronger Semantics for Low-Latency Geo-Replicated Storage

The evaluation shows that the Eiger system achieves low latency, has throughput competitive with eventually-consistent and non-transactional Cassandra, and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides "stronger" semantics. Namely, all client requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers.

The primary contributions of this work are enabling scalable causal consistency for the complex columnfamily data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low latency (single-ms), has throughput competitive with eventually-consistent and non-transactional Cassandra (less than 7% overhead for one of Facebook's real-world workloads), and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

The potential dangers of causal consistency and an explicit solution

This work exposes causal consistency's serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms, and advocates the use of explicit causality, or application-defined happens-before relations.

Causal consistency is the strongest consistency model that is available in the presence of partitions and provides useful semantics for human-facing distributed services. Here, we expose its serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms. As an alternative to classic potential causality, we advocate the use of explicit causality, or application-defined happens-before relations. Explicit causality, a subset of potential causality, tracks only relevant dependencies and reduces several of the potential dangers of causal consistency.

Don't settle for eventual: scalable causal consistency for wide-area storage with COPS

This paper identifies and defines a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints and presents the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area.

Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.

Session guarantees for weakly consistent replicated data

Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: "read your writes", "monotonic reads", "writes follow reads", and " monotonic writes".

Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: "read your writes", "monotonic reads", "writes follow reads", and "monotonic writes". The intent is to present individual applications with a view of the database that is consistent with their own actions, even if they read and write from various, potentially inconsistent servers. The guarantees can be layered on existing systems that employ a read-any/write-any replication scheme while retaining the principal benefits of such a scheme, namely high availability, simplicity, scalability, and support for disconnected operation. These session guarantees were developed in the context of the Bayou project at Xerox PARC in which we are designing and building a replicated storage system to support the needs of mobile computing users who may be only intermittently connected.<>

An Application-Based Adaptive Replica Consistency for Cloud Storage

An application-based adaptive mechanism of replica consistency is proposed in this paper, dividing the consistency of applications into four categories according to their read frequencies and update frequencies, and then design corresponding consistency strategies.

The intrinsic characteristic heterogeneous of cloud applications makes their consistency requirements different. Furthermore, the consistency requirement of certain application changes continuously at runtime, so a fixed consistency strategy is not enough. An application-based adaptive mechanism of replica consistency is proposed in this paper. We divide the consistency of applications into four categories according to their read frequencies and update frequencies, and then design corresponding consistency strategies. Applications select the most suitable strategy automatically at runtime to achieve a dynamic balance between consistency, availability, and performance. Evaluation results show that the proposed mechanism decreases the amount of operations significantly while guaranteeing the application’s consistency requirement.

Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary

This work proposes RedBlue consistency, which enables blue operations to be fast while the remaining red operations are strongly consistent (and slow), and introduces a method that increases the space of potential blue operations by breaking them into separate generator and shadow phases.

Online services distribute and replicate state across geographically diverse data centers and direct user requests to the closest or least loaded site. While effectively ensuring low latency responses, this approach is at odds with maintaining cross-site consistency. We make three contributions to address this tension. First, we propose RedBlue consistency, which enables blue operations to be fast (and eventually consistent) while the remaining red operations are strongly consistent (and slow). Second, to make use of fast operation whenever possible and only resort to strong consistency when needed, we identify conditions delineating when operations can be blue and must be red. Third, we introduce a method that increases the space of potential blue operations by breaking them into separate generator and shadow phases. We built a coordination infrastructure called Gemini that offers RedBlue consistency, and we report on our experience modifying the TPC-W and RUBiS benchmarks and an online social network to use Gemini. Our experimental results show that RedBlue consistency provides substantial performance gains without sacrificing consistency.

The failure and recovery problem for replicated databases

A theory for proving the correctness of algorithms that manage replicated data is presented, an extension of serializability theory, which is applied to three replicated data algorithms: Gifford's “quorum consensus” algorithm, Eager and Sevcik’s “missing writes’ algorithm, and Computer Corporation of America's ”available copies” algorithms.

A replicated database is a distributed database in which some data items are stored redundantly at multiple sites. The main goal is to improve system reliability. By storing critical data at multiple sites, the system can operate even though some sites have failed. However, few distributed database systems support replicated data, because it is difficult to manage as sites fail and recover. A replicated data algorithm has two parts. One is a discipline for reading and writing data item copies. The other is a concurrency control algorithm for synchronizing those operations. The read-write discipline ensures that if one transaction writes logical data item ×, and another transaction reads or writes x, there is some physical manifestation of that logical conflict. The concurrency control algorithm synchronizes physical conflicts; it knows nothing about logical conflicts. In a correct replicated data algorithm, the physical manifestation of conflicts must be strong enough so that synchronizing physical conflicts is sufficient for correctness. This paper presents a theory for proving the correctness of algorithms that manage replicated data. The theory is an extension of serializability theory. We apply it to three replicated data algorithms: Gifford's “quorum consensus” algorithm, Eager and Sevcik's “missing writes” algorithm, and Computer Corporation of America's “available copies” algorithm.

An adaptive data replication algorithm

An algorithm for dynamic replication of an object in distributed systems is presented and it is shown that the algorithm can be combined with the concurrency control and recovery mechanisms of ta distributed database management system.

This article addresses the performance of distributed database systems. Specifically, we present an algorithm for dynamic replication of an object in distributed systems. The algorithm is adaptive in the sence that it changes the replication scheme of the object i.e., the set of processors at which the object inreplicated) as changes occur in the read-write patern of the object (i.e., the number of reads and writes issued by each processor). The algorithm continuously moves the replication scheme towards an optimal one. We show that the algorithm can be combined with the concurrency control and recovery mechanisms of ta distributed database management system. The performance of the algorithm is analyzed theoretically and experimentally. On the way we provide a lower bound on the performance of any dynamic replication algorith.

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

The Mesa system is presented and reports the performance and scale that it achieves, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes.

Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.

Low Latency Geo-distributed Data Analytics

Iridium is presented, a system for low latency geo-distributed analytics that achieves low query response times by optimizing placement of both data and tasks of the queries and contains a knob to budget WAN usage.

Low latency analytics on geographically distributed datasets (across datacenters, edge clusters) is an upcoming and increasingly important challenge. The dominant approach of aggregating all the data to a single datacenter significantly inflates the timeliness of analytics. At the same time, running queries over geo-distributed inputs using the current intra-DC analytics frameworks also leads to high query response times because these frameworks cannot cope with the relatively low and variable capacity of WAN links. We present Iridium, a system for low latency geo-distributed analytics. Iridium achieves low query response times by optimizing placement of both data and tasks of the queries. The joint data and task placement optimization, however, is intractable. Therefore, Iridium uses an online heuristic to redistribute datasets among the sites prior to queries' arrivals, and places the tasks to reduce network bottlenecks during the query's execution. Finally, it also contains a knob to budget WAN usage. Evaluation across eight worldwide EC2 regions using production queries show that Iridium speeds up queries by 3× -- 19× and lowers WAN usage by 15% -- 64% compared to existing baselines.

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Volley is evaluated on the month-long Live Mesh trace, and it is found that, compared to a state-of-the-art heuristic, Volley simultaneously reduces datacenter capacity skew, reduces inter-datacenter traffic by over 1.8× and reduces 75th percentile user-latency by over 30%.

As cloud services grow to span more and more globally distributed datacenters, there is an increasingly urgent need for automated mechanisms to place application data across these datacenters. This placement must deal with business constraints such as WAN bandwidth costs and datacenter capacity limits, while also minimizing user-perceived latency. The task of placement is further complicated by the issues of shared data, data inter-dependencies, application changes and user mobility. We document these challenges by analyzing month-long traces from Microsoft's Live Messenger and Live Mesh, two large-scale commercial cloud services.

We present Volley, a system that addresses these challenges. Cloud services make use of Volley by submitting logs of datacenter requests. Volley analyzes the logs using an iterative optimization algorithm based on data access patterns and client locations, and outputs migration recommendations back to the cloud service.

To scale to the data volumes of cloud service logs, Volley is designed to work in SCOPE [5], a scalable MapReduce-style platform; this allows Volley to perform over 400 machine-hours worth of computation in less than a day. We evaluate Volley on the month-long Live Mesh trace, and we find that, compared to a state-of-the-art heuristic that places data closest to the primary IP address that accesses it, Volley simultaneously reduces datacenter capacity skew by over 2×, reduces inter-datacenter traffic by over 1.8× and reduces 75th percentile user-latency by over 30%.

Performance Sensitive Replication in Geo-distributed Cloud Datastores

This paper presents models that optimize percentiles of response time under normal operation and under a data-center (DC) failure in quorum-based cloud storage systems, and evaluates their models using real-world traces of Twitter, Wikipedia and Go Walla on a Cassandra cluster deployed in Amazon EC2.

Modern web applications face stringent requirements along many dimensions including latency, scalability, and availability. In response, several geo-distributed cloud data stores have emerged in recent years. Customizing data stores to meet application SLAs is challenging given the scale of applications, and their diverse and dynamic workloads. In this paper, we tackle these challenges in the context of quorum-based systems (e.g. Amazon Dynamo, Cassandra), an important class of cloud storage systems. We present models that optimize percentiles of response time under normal operation and under a data-center (DC) failure. Our models consider factors like the geographic spread of users, DC locations, consistency requirements and inter-DC communication costs. We evaluate our models using real-world traces of three applications: Twitter, Wikipedia and Go Walla on a Cassandra cluster deployed in Amazon EC2. Our results confirm the importance and effectiveness of our models, and highlight the benefits of customizing replication in cloud datastores.

A Self-Configurable Geo-Replicated Cloud Storage System

Tuba is a geo-replicated key-value store that automatically reconfigures its set of replicas while respecting application-defined constraints so that it adapts to changes in clients' locations or request rates.

Reconfiguring a cloud storage system can improve its overall service. Tuba is a geo-replicated key-value store that automatically reconfigures its set of replicas while respecting application-defined constraints so that it adapts to changes in clients' locations or request rates. New replicas may be added, existing replicas moved, replicas upgraded from secondary to primary, and the update propagation between replicas adjusted. Tuba extends a commercial cloud-based service, Microsoft Azure Storage, with broad consistency choices (as in Bayou), consistency-based SLAs (as in Pileus), and a novel replication configuration service. Compared with a system that is statically configured, our evaluation shows that Tuba increases the reads that return strongly consistent data by 63%.

Consistency-based service level agreements for cloud storage

Evaluations running on a worldwide test bed with geo-replicated data show that the Pileus system adapts to varying client-server latencies to provide service that matches or exceeds the best static consistency choice and server selection scheme.

Choosing a cloud storage system and specific operations for reading and writing data requires developers to make decisions that trade off consistency for availability and performance. Applications may be locked into a choice that is not ideal for all clients and changing conditions. Pileus is a replicated key-value store that allows applications to declare their consistency and latency priorities via consistency-based service level agreements (SLAs). It dynamically selects which servers to access in order to deliver the best service given the current configuration and system conditions. In application-specific SLAs, developers can request both strong and eventual consistency as well as intermediate guarantees such as read-my-writes. Evaluations running on a worldwide test bed with geo-replicated data show that the system adapts to varying client-server latencies to provide service that matches or exceeds the best static consistency choice and server selection scheme.

SPANStore: cost-effective geo-replicated storage spanning multiple cloud services

SPANStore is presented, a key-value store that exports a unified view of storage services in geographically distributed data centers that can lower costs by over 10x in several scenarios, in comparison with alternative solutions that either use a single storage provider or replicate every object to every data center from which it is accessed.

By offering storage services in several geographically distributed data centers, cloud computing platforms enable applications to offer low latency access to user data. However, application developers are left to deal with the complexities associated with choosing the storage services at which any object is replicated and maintaining consistency across these replicas. In this paper, we present SPANStore, a key-value store that exports a unified view of storage services in geographically distributed data centers. To minimize an application provider's cost, we combine three key principles. First, SPANStore spans multiple cloud providers to increase the geographical density of data centers and to minimize cost by exploiting pricing discrepancies across providers. Second, by estimating application workload at the right granularity, SPANStore judiciously trades off greater geo-distributed replication necessary to satisfy latency goals with the higher storage and data propagation costs this entails in order to satisfy fault tolerance and consistency requirements. Finally, SPANStore minimizes the use of compute resources to implement tasks such as two-phase locking and data propagation, which are necessary to offer a global view of the storage services that it builds upon. Our evaluation of SPANStore shows that it can lower costs by over 10x in several scenarios, in comparison with alternative solutions that either use a single storage provider or replicate every object to every data center from which it is accessed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment