masahitojp/gist:8430314

## gistfile1.rst

      
    Raw
  

              gistfile1.rst
            
          
    SWIM: スケーラブルな弱一貫性,伝染様式プロセスグループメンバーシッププロトコル[和訳]


update
2014-01-26

version
0.1


原題: SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
原文: "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"( PDF)
This article is translated by @masahito. Please contact me if any problem.

Abhinandan Das, Indranil Gupta, Ashish Motivala Dept. of Computer Science, Cornell University Ithaca NY 14853 USA
{asdas,gupta,ashish}@cs.cornell.edu
Abstract

分散P2Pアプリケーションでは、全参加プロセスでのプロセス・グループ・メンバーシップ情報についての弱い一貫性を要求するものもあります。
SWIMは、大規模スケールでのプロセス・グループ用にサービスを提供する総括的なソフトウェア・モジュールです。
SWIMは、従来のハートビードプロトコルがスケーラブルではないことに同期づけられています。ハードビートプロトコルはグループのサイズによってネットワーク負荷の2次的な成長、応答時間の妥協、プロセス衝突についての偽陽性頻度を課します。
本論文では,PCで構成された大きなクラスタでのSWIMサブシステムのデザイン、インプリメンテーションおよびパフォーマンスを報告します。
伝統的なハートビートプロトコルとは異なり、SWIMは、障害検出とメンバーシッププロトコルのメンバーシップ更新散布機能が分離している。
プロセスは効率的なP2Pの定期無作為化プロービングプロトコルを介して監視されます。
各プロセスに障害が発生した最初の検出、およびメンバーごとに予想されるメッセージ負荷と予想時間の両方は、グループサイズによって変化しません。
メンバーシップの変更、プロセスの参加、ドロップアウトや障害に関する情報は、pingメッセージと確認応答に便乗して伝達されます。
結果として、堅牢で高速な感染様式（伝染、または、gossip形式）な散布になります。
これは、システムが誤った故障検出を発見し、修正することができます - SWIMシステム内の誤った故障検出率は、失敗したように、グループのメンバーがそれを宣言する前にプロセスを疑うことができるようにプロトコルを変更することによって削減される。
最後に、プロトコルは、失敗を検知するために決定論的な時間限界を保証します。
SWIMプロトタイプからの実験結果を提示します。
WAN全体の規模への設計の拡張を議論します。
1. Introduction

あなたが環境を通してのんびり泳ぐと、世界の秘密があなたにうつるだろう。
メンバーシップ・プロトコルを利用する既存のミドルウェア・システムの例は信頼できるマルチキャスト¹ ,² , およびepidemicスタイルの情報散布³ ,⁴ ,⁵ を含んでいます。
これらのプロトコルは直近のdisconnectを一致させることが必要な分散データベース⁶ 、パブリッシャー・スクライブシステム、大規模P2Pシステム⁷ などのアプリケーションで使われています。
このような大規模な協調的なゲーム、その他の共同分散アプリケーションなど、他の新興のアプリケーションのパフォーマンスは、アプリケーション内で使用しているメンバーシップ保守プロトコルの信頼性と拡張性に決定的に依存している。
簡単に説明すると、メンバーシッププロトコルは、グループ内の他の非障害なプロセスのローカルに管理されているリストを持つグループのプロセス（「メンバー」）を提供します。
プロトコルは、メンバーシップ·リストが新しいメンバーがグループに参加、または、（自発的にまたは失敗によって）脱落して更新されることを保証します。
メンバーシップリストは、そのアドレス空間で直接アプリケーションが利用できるようにするか、コールバックインタフェースまたはAPIを介して行われます。
アプリケーションは、必要に応じてリストの内容を自由に使用します。例えば、ゴッシプを基にした配布プロトコルは定期的にゴシップの対処となるメンバーを選択するリストを使用します。
メンバーシップ·サブシステムの信頼性とスケーラビリティは、いくつかのパフォーマンス·メトリックを介して測定することができる。
メンバーシップの変更は、その発生後に速やかにグループ内で伝播されなければならない。
メッセージを失ったプロセスは失敗と区別がつかないため⁸ 、基盤となるネットワークの非同期性と信頼性の欠如は、プロセスの障害の誤検出につながる、メッセージ消失の原因となります。
偽陽性のこのレートは低くなることがあります。
最後に、プロトコルは、（中央サーバに依存しない）P2Pであり、ネットワークやプロセス上の低メッセージおよび計算負荷を課す必要があります。
メンバーシップ·プロトコルは、使用するアプリケーションのパフォーマンスに影響を与え、数十プロセスを超えたグループにスケールすること⁹¹⁰ は困難であった。
¹¹ で報告されているように、これらのグループサイズでパフォーマンスが悪化する主な症状は、プロセスの誤った故障検出率、または障害を検出するための時間のどちらかの増加である。
[12] identifies the quadratic increase in the message load imposed by such membership protocols as another symptom of the unscalability of traditional protocols for membership maintenance.
An example of an application that relies heavily on the membership sub-system is the class of virtually synchronous multicast protocols [3].
Traditional implementations of this specification suffer a drastic reduction in performance, and partitioning, at beyond a few dozen members [11].
This paper presents our effort in the SWIM project to implement a membership sub-system that provides stable failure detection time, stable rate of false positives and low message load per group member, thus allowing distributed applications that use it to scale well.
We focus on a weaker variant of group membership, where membership lists at different members need not be consistent across the group at the same (causal) point in time.
Stronger guarantees could be provided by augmenting the membership sub-system, e.g. a virtually-synchronous style membership can be provided through a sequencer process that checkpoints the membership list periodically. However, unlike the weakly consistent problem, strongly consistent specifications might have fundamental scalability limitations � . The design of a distributed membership algorithm � has traditionally been approached through the technique of heartbeating. Each process periodically sends out an incremented heartbeat counter to the outside world. Another process is detected as failed when a heartbeat is not received from it for some time. However, actual implementations of heartbeating suffer from scalability limitations. Sending all heartbeats to a central server leads to hot-spot creation. Sending heartbeats to all members (through either network multicast, or gossiping [16]) leads to a message load on the network and group that grows quadratically with the group size. Heartbeating along a logical ring [9] suffers from unpredictability of failure detection time when there are multiple failures. Unfortunately, as the group size rises, so does the likelihood of simultaneous multiple failures. An extended discussion of reasons behind the inherent unscalability of heartbeat-based membership maintenance mechanisms can be found in [12]. This paper also proposed a randomized distributed failure detector protocol based on members randomly probing each other instead of heartbeat-ing � . Mathematical analysis showed that as the group size is scaled up, the protocol’s properties of (expected) failure detection time, rate of false positives, and message load per member, are all independent of the group size. This is an improvement over all-to-all heartbeating based protocols that have a linear variation (with group size) of either the detection time for failures or the network bandwidth usage at each member (or an increase in the false positive rate). Our work in this article is motivated by a realization from the work of [12] that the unscalability of the popular class of all-to-all heartbeating protocols arises from the implicit decision therein to fuse the two principal functions of the membership problem specification: 1) Membership update Dissemination: propagating membership updates arising from processes joining, leaving or failing, and 2) Failure detection: detecting failures of existing members. The overhead of multicasting heartbeats is eliminated by designing an efficient non-multicast based failure detector, and using the dissemination component only when a membership change occurs. The Membership Dissemination component can be implemented through either hardware multicast or in infection-style. While [12] presented a failure detection protocol and analyzed it theoretically, our work in the current paper looks at incorporating the Membership Dissemination component in to build a working membership sub-system. In addition, the resulting protocol is augmented by mechanisms that reduce the rate of false positives and give stronger deterministic guarantees on failure detection times at individual processes. Our system, called SWIM, provides a membership substrate that:

imposes a constant message load per group member;
detects a process failure in an (expected) constant time at some non-faulty process in the group;
provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to detect failure of another process;
propagates membership updates, including information about failures, in infection-style (also gossip-style or epidemic-style [2, 8]); the dissemination latency in the group grows slowly logarithmically) with the number of members;
provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.

While (1) and (2) are properties of the failure detection protocol of [12], (3)-(5) represent our subsequent work in the current paper. Experimental results of a prototype implementation of SWIM running on a PC cluster are discussed. The SWIM protocol can also be extended to work over a wide area network (WAN) or virtual private network (VPN), and we touch on this briefly in Section 6. The rest of the paper is organized as follows. Section 2 summarizes previous work in this area, and the basics of scalable failure detection protocols from [12]. Section 3 describes the basic SWIM protocol, and Section 4 the improvements to the protocol. Experimental results from a prototype implementation are presented in Section 5. We conclude in Section 6.
2. Previous Work

In traditional distributed all-to-all heartbeating failure detection algorithms, every group member periodically transmits a “heartbeat” message (with an incremented counter) to all other group members. A member Mj is declared as failed by a non-faulty member Mj when Mj does not receive heartbeats from Mj for some consecutive heartbeat periods. Distributed heartbeating schemes guarantee that a faulty member is always detected as such at any non-faulty member (within a time interval after its failure), since a member that has crashed also stops sending heartbeat messages. However, the accuracy and scalability guarantees of these protocols differ, depending on the actual mechanism used to disseminate the heartbeats. In the simplest implementation, each heartbeat is multicasted to all other group members. This results in a a network load of θ(n²/T) messages per second (even if IP multicast is used), where T is the failure detection time required by the distributed application. van Renesse et al [16] proposed that heartbeats be disseminated via a robust gossipstyle protocol. In this protocol, every t _gossip time units, each member gossips, to a few random targets, a θ(n) �-sized list of the latest known heartbeat counters received from other members. While gossiping reduces the false positive frequency, a new heartbeat count typically takes, on expectation, θ[log (n) ⋅ t_gossip] time units to reach an arbitrary other group member. In order to satisfy the applicationspecified detection time, the protocol generates a network load of θ[n² ⋅ log (n)/t_gossip] ���bytes a second. The use of message batching to solve this is limited by the UDP packet size limit, e.g. 5B heartbeats (IP address and count) of 50 members would already occupy 250 B, while SWIM generates packets that have a size of at most 135 B, regardless of the group size. The quadratic increase in the network load results from the communication of heartbeat notification to all group members. This can be avoided by separating the failure detection operation from that of membership update dissemination. CThis property is called Strong Completeness. Several hierarchical membership systems have been proposed, e.g. Congress [1]. This belongs to a broader class of solutions where each process heartbeats only a subgroup of processes. This class of protocols requires careful configuration and maintenance of the overlay along which membership information flows, and the accuracy of the protocol depends on the robustness of this graph. In comparison, the design of SWIM avoids the overhead of a virtual graph. SWIM’s solution to the above unscalability problems described above is based on (a) designing the failure detection and membership update dissemination components separately, and (b) using a non-heartbeat based strategy for failure detection. Before moving on to describe the SWIM protocol internals, we first lay the foundation for understanding the key characteristics of the efficiency and scalability of distributed failure detector protocols. Several research studies [6, 7, 12, 16], have led to the identification of these basic properties of distributed failure detector protocols (from both theoretical and practical angles), as well as impossibility results related to satisfying them concurrently. The resulting tradeoff is usually determined by the safety and liveness properties required by distributed applications. These properties are [12]: (1) Strong Completeness: crash-failure of any group member is detected by all non-faulty members [6]); (2) Speed of failure detection: the time interval between a member failure and its detection by some non-faulty group member; (3) Accuracy: the rate of false positives of failure detection; (4) Network Message Load, in bytes per second generated by the protocol. [6] proved the impossibility of building a failure detector over an asynchronous network that is both accurate (no false detections) and strongly complete. However, since a typical distributed application relies on Strong Completeness always holding (in order to maintain up to date information in dynamic groups), most failure detectors, including heartbeating-based solutions, guarantee this property while attempting to maintain a low rate of false positives. SWIM takes the same approach. In [12], a simple computation identifies the minimal total network load (bytes per second) required to satisfy specified parameters of false detection rate at each member (denoted PM(τ)), and detection time (τ) in a group of size n. [12] calculates this load as n ⋅ log (PM(τ))/log (p_ml) , where p_ml is the probability of a packet drop within the underlying network. Although this calculation is done under idealized conditions of independent message loss probabilities on each message ( p_ml ), it serves as a good baseline for comparing the scalability of different failure detection protocols. For example, the all-to-all heartbeat protocols discussed in Section 2 have a sub-optimality factor that varies linearly with group size.
3. The Basic SWIM Approach

先に述べたように、SWIMのアプローチは、2つの要素があります:

メンバーの障害を検出する故障検出器の要素と
最近参加またはグループを去った、または失敗しているメンバーに関する情報を発信する発信要素。

基礎的なSWIMプロトコルを記述することで基礎を作ります。
基本的なプロトコルは、（セクション3.1）¹² のランダムなプロービングに基づく障害検出プロトコルを使用し、ネットワークのマルチキャスト（セクション3.2）を介してメンバーシップの更新を発信します。
SWIMプロトコルは、この初期設計を改良することにより、後続のセクション（セクション4）で開発されます。
6. Conclusions and Future Work

SWIM(スケーラブルであり、弱一貫性のあるプロセス·グループ·メンバーシップ·プロトコル)の設計、実装、および性能評価を発表しました。
SWIMプロジェクトは、今日の分散システムの設計者に愛用されているハートビートベースのプロトコルがスケールしないことによって動機付けられている。
SWIMのソリューションは、故障検出と問題のメンバーシップ更新発信コンポーネントの分離に基づいている。
SWIM異常検出器は、ハートビートを回避し、代わりにプロセスのプロービング・ランダム・ピア·ツー·ピアを使用することによってスケーラビリティを達成します。
これは、グループメンバーに一定のオーバーヘッドだけでなく、故障の一定の予想される検出時間を提供しています。
メンバーシップの更新は、故障検出プロトコルによって生成されたパケットに便乗して、感染スタイル（伝染病スタイル）で効率的かつ確実に伝達されます。
障害検出時間とのトレードオフしながら、（仮想具現化番号での）疑惑メカニズムの追加は、偽陽性の頻度を低減します。
プロトコル保証時間への最後の拡張は、各非障害性のあるプロセスで障害を検出することが限界です。
SWIMは、このようにネットワーク内部のコア·ネットワーク要素上の帯域幅の使用を低減する¹³ 、トポロジ情報とのping対象の選択肢を計量することによって、ワイドエリアネットワーク（WAN）、仮想プライベートネットワーク（VPN）に拡張することができる。
我々は現在この機能を評価しています。
本稿では、特定のコンテキストでSWIMの設計と実装について説明しましたが、我々の結果は、より一般的な場合に適用可能である。
SWIMのデザインはプロセスの大規模なグループを対象にしていますが、我々の分析と実験結果は、より多くの中規模のサブグループ内(例えば、Chord, Pastry, Opus などのDHTシステムでのレプリカグループ¹⁴)で使用する場合には、分散ハートビートの代替案は、オーバーヘッドで大きさの減少の順番を与えることができることを示している
疑惑メカニズムは明確な障害検出とメンバーシップの更新コンポーネントのデザインを持つメンバーシップシステムに適用可能である。
今日のインターネットでは大規模な分散アプリケーションの継続的な増殖の程度は、これらのシステム内で実行プロトコルの拡張性と効率的な設計に依存します。
SWIMはこれらのアプリケーションのためのグループ·メンバーシップ·コンポーネントへソリューションを提供しています。
Acknowledgments:

We thank Werner Vogels for help with accessing and using the Galaxy cluster. We also thank Ben Atkin, Ken Birman, Alan Demers, Robbert van Renesse, and the anonymous referees, for reviewing drafts.
References

参考文献
http://www.cs.cornell.edu/gupta/swim


Birman. The process group approach to reliable distributed computing. Comm. of the ACM, 36(12):37–53,1993.


↩

Gupta, K. Birman, and R. van Renesse. Fighting fire with fire: using randomized gossip to combat stochastic scalability limits. To appear in The Journ. of Quality and Reliability Engineering International, 2002.

↩


Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal multicast. ACM Trans. on Computer Systems, 17(2):41–88, 1999.


↩

Demers, D. Greene, C. Hauser, W. Irish, and J. Larson. Epidemic algorithms for replicated database maintenance. In Proc. 6th Annual ACM Symp. Principles of Distributed Computing, pages 1–12. ACM Press, 1987.

↩
A.-M. Kermarrec, L. Massoulie, and A. Ganesh. Reliable probabilistic communication in large-scale information dissemination systems. Technical Report MMSR-TR-2000-105, Microsoft Research, Cambridge, UK, 2000.↩

Petersen, M. Spreitzer, D. Terry, M. Theimer, and A.J.Demers. Flexible update propagation for weakly consistent replication. In Proc. 16th ACM Symp. on Operating Systems Principles, pages 3–6. ACM Press, 1997.

↩

Rowstron and F. Kaashoek, editors. Proc. 1st Intnl.Workshop on Peer-to-Peer Systems. Springer-Verlag, 2002.

↩


Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journ. of the ACM, 32(2):374–382, 1985.


↩

Gupta, K. Birman, and R. van Renesse. Fighting fire with fire: using randomized gossip to combat stochastic scalability limits. To appear in The Journ. of Quality and Reliability Engineering International, 2002.

↩

van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proc. Middleware, pages 55–70,1998.

↩

van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proc. Middleware, pages 55–70,1998.

↩

Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. 20th Annual ACM Symp. on Principles of Distributed Computing,pages 170–179. ACM Press, 2001.

↩

van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proc. Middleware, pages 55–70,1998.

↩

Rowstron and F. Kaashoek, editors. Proc. 1st Intnl.Workshop on Peer-to-Peer Systems. Springer-Verlag, 2002.

↩