ys-qb/hacluster..md

## hacluster..md

      
    Raw
  

              hacluster..md
            
          
    HA


SPOF

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working.
 
source: wikipedia


Quorom - 정족수(定足數)

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system.
Arbiter - 조정자
source: wikipedia

Split-brain syndrome

A split-brain(a network partition) indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope
source: wikipedia

Cluster computing

Computer cluster consists of a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system.

source: wikipedia

Disaster recovery (DR)

A set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
source: wikipedia

Failover

The continuation of a service after the failure of one or more of its components.
Load balancing

The distribution of workloads across multiple computing resources.
Load balancing is often used to implement failover.
source: wikipedia

Global Server Load Balancing (GSLB)


DNS based
Routing Policy

Weighted, Lantency, Failover set, Geo-location


AWS Route53


Wildcard DNS record


*.example.com
virtual hosts

Host HTTP/1.1 header
Multi host names bound to one IP address.


[Example_usages](https://en.wikipedia.org/wiki/Wildcard_DNS_record#Example_usages)

HA example


RAID (redundant array of independent disks)

A data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.
 
source: wikipedia

Linux DM Multipath

DM(device mapper)-Multipathing provides input-output (I/O) fail-over and load-balancing by using multipath I/O within Linux for block devices.
 
source: wikipedia

Ceph

Object storage on a single distributed computer cluster. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level.
 
source: www.linux-mag.com

Link aggregation

Combining (aggregating) multiple network connections in parallel in order to increase throughput beyond what a single connection could sustain, and to provide redundancy in case one of the links should fail.
 
source: wikipedia

Distributed Replicated Block Device(DRBD)

 
source: drbd.org


Coordinator


What does coordinator do?


Service Registration
Service Discovery

DNS support: Consul only


Consistent and durable general-purpose K/V store
Leader Election

source: https://gist.github.com/kchida/d1c15f3968f4f8272c49

Consul vs Zookeeper vs Etcd


Consul


ready to use(DNS, health check), commercially supported.


Zookeeper


Java


old but stable


Etcd


new and just simple K/V store.


ttl, atomic


Etcd


TTL
compareAndDelete, compareAndSwap


Zookeeper


znode 단위로 관리된다.
znode는 파일 시스템과 유사한 디렉토리 구조(path)를 가진다.
znode에 데이터를 저장할 수 있다.
znode의 변화 watch
Zookeeper와 클라이언트의 연결이 끊어지면 자동으로 삭제
EPHEMERAL
EPHEMERAL_SEQUENTIAL : Ticketing
No TTL


zkHelper

var zk = require('zkHelper'),
  options = {
		basePath: '/myapps';
		configPath: '/myapps/config',
		node: require('os').hostname(),
		servers: ['zk0:2181', 'zk1:2181', 'zk2:2181'], // zk servers
		clientOptons: { sessionTimeout: 10000, retries: 3 }
	};
zk.init(options, function (err, zkClient) {
  if (zk.isMaster()) {
    console.info('I am master')
  } else {
    console.info('master', zk.getMaster() && zk.getMaster().master)
  }
  var config = zk.getConfig();
});

HAproxy


haproxy

HAProxy is free, open source software
that provides a high availability load balancer and proxy server for TCP(L4) and HTTP-based applications(L7)
that spreads requests across multiple servers.


manpage: fast and reliable http reverse proxy and load balancer


LVS(Linux Virtual Server) : L4 only alternative; kernel space impl.


Reverse , Forward proxy

 
source: wikipedia

haproxy stat - hatop

 
hatop

 
L4 TCP socket

frontend mqtt_fe
    option tcplog
    bind :1883
    mode tcp
    timeout client 90s
    default_backend mqtt_be

backend mqtt_be
    mode tcp
    timeout server 90s
    server mqtt.1 10.0.0.11:1883 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.2 10.0.0.22:1883 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.3 10.0.0.13:1883 maxconn 50000 check inter 2000 rise 2 fall 3


L7 websocket/ssl, sticky, redirect

frontend mqttwss_fe
    option httplog
    bind :8083
    bind :8483 ssl crt mqtt.pem
    redirect scheme https if !{ ssl_fc }
    mode http
    default_backend mqttwss_be

backend mqttwss_be
    mode http
    cookie SRV insert indirect nocache
    server mqtt.1 10.0.0.11:80 cookie mqtt.1 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.2 10.0.0.22:80 cookie mqtt.2 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.3 10.0.0.13:80 cookie mqtt.3 maxconn 50000 check inter 2000 rise 2 fall 3


wildcard host routing

frontend HttpFrontend
    bind *:80
    mode http
    acl fooBackend hdr_beg(host) -i foo.
    acl barBackend hdr_beg(host) -i bar.

    use_backend fooBackend if fooBackend
    use_backend barBackend if barBackend

    default_backend bazBackend
<...>


autoscaling


Use haproxy instead of AWS ELB
Update haproxy to use all instances running in a security group.


update-haproxy.py [-h] --security-group SECURITY_GROUP
[SECURITY_GROUP ...] --access-key ACCESS_KEY
--secret-key SECRET_KEY [--output OUTPUT]
[--template TEMPLATE] [--haproxy HAPROXY] [--pid PID]
[--eip EIP] [--health-check-url HEALTH_CHECK_URL]
```

haproxy command line options


-D : daemon
-f : config file(/etc/haproxy/haproxy.cfg)
-p : pid file to have its children's pids
-sf pidlist

Send FINISH signal to the pids in pidlist after startup.
The processes which receive this signal will wait for all sessions to finish before exiting.


-st pidlist

Send TERMINATE signal to the pids in pidlist after startup.
The processes which receive this signal will wait immediately terminate, closing all active sessions.


haproxy HA


use GSLB
Active/Standby


performance tunning


sysctls tunning

ulimit -a


multi-process

ssl offloading
dedicate process for a task
dedicate processor for irq handling


mult-process

global
  nbproc 4            # number of processes

frontend access_http
   bind 0.0.0.0:80
   bind-process 1     # dedicate one process to http
   mode            http
   default_backend backend_nodes

frontend access_https
   bind 0.0.0.0:443 ssl crt /etc/yourdomain.pem 
   bind-process 2 3 4 # dedicate the other processes to https
   mode            http
   option           forwardfor
   option           accept-invalid-http-request
   reqadd         X-Forwarded-Proto:\ https
   default_backend backend_nodes


Firewall


Firewall HA

 
source:[firewall-ha-with-conntrackd-and-keepalived](http://backreference.org/2013/04/03/firewall-ha-with-conntrackd-and-keepalived/)

Keepalived


VRRP(Virtual Router Redundancy Protocol)
LVS(Linux Virtual Server)


Keepalived - VRRP

vrrp_instance E1 {
    interface eth0
    state BACKUP
    virtual_router_id 61
    priority 100
    advert_int 1        # advertise every 1sec to multicast: 224.0.0.18
    virtual_ipaddress {
        10.15.7.100/24 dev eth0
        2001:db8:15:7::100/64 dev eth0 
    }
    nopreempt
    garp_master_delay 1 # 1sec delay for gratuitous ARP after transition to MASTER
}


Conntrackd

Sync {
  Mode FTFW {
    DisableExternalCache Off
  }
  UDP {
    IPv4_address 10.0.0.1
		IPv4_Destination_Address 10.0.0.2
		Port 3780
		Interface eth2
		SndSocketBuffer 1249280
		RcvSocketBuffer 1249280
		Checksum on
	}
}


failover scenario


FW1: 장애로, VRRP advertise pkt 전송 안됨.
FW2: VIP 할당되고, gratuitous ARP pkt 전송.

switch port의 mac address 갱신.
nodes의arp table 갱신.


FW2: external cache(FW1's conntrack info) --> internal(kernel) cache로 갱신함.

source:[VRRP(Virtual Router Redundancy Protocol) 상세 동작 원리](https://www.slideshare.net/netmanias-ko/netmanias20080324-vrrp-protocoloverview)

Pacemaker/Corosync


Corosync

provides clustering infracture such as membership, messaging and quorum.

corosync.conf

# quorum 이 구성
quorum {
  provider: corosync_votequorum
        two_node: 0
   }
# totem protocol 설정
totem {
  version:                             2
  token:      3000  # token 을 받지 못해서 해당 노드 fail로 판단하는 시간(ms)
  token_retransmits_before_loss_const: 10
  join:                                60
  consensus:  3600 # 새로운 q uorum member을 구성을 시작하는 전 기다리는 시간(ms).
  ...
}


pacemaker

It is an open source high availability resource manager software used on computer clusters since 2004. Its preferred API for this purpose is the OCF resource agent API.

OCF(Open Cluster Framework)

LSB(linux standard base) init script와 유사한 shell script이다.
Resource Agent를 만드는데 이용된다.
exit code에 따라 pacemaker가 다른 행동을 한다.


Two-node Active/Passive clusters using Pacemaker and DRBD are a cost-effective solution for many High Availability situations

source: clusterlabs.org

By supporting many nodes, Pacemaker can dramatically reduce hardware costs by allowing several active/passive clusters to be combined and share a common backup node.

source: clusterlabs.org

When shared storage is available, every node can potentially be used for failover. Pacemaker can even run multiple copies of services to spread out the workload.

source: clusterlabs.org

N to N Redundancy

node 13: node-03
node 2: node-01
node 9: node-02


CRM property

property cib-bootstrap-options: \
        dc-version=1.1.12-561c4cf \
        cluster-infrastructure=corosync \
        no-quorum-policy=stop \
        stonith-enabled=false \
        start-failure-is-fatal=false \
        symmetric-cluster=false \
        last-lrm-refresh=1490016772


3대의 node 중 1 대의 node가 단절됐을 때


나머지 두 node는 정상 동작.
문제가 된 node는 quorum 구성을 못하고, 해당 node의 모든 resource stop!

no-quorum-policy=stop


Resource - ex) conntrackd + vip

conntrackd : one master at a node and slaves are on the other nodes.
vip public : one resource at a node.
primitive p_conntrackd ocf:fuel:ns_conntrackd \
	op monitor interval=30 timeout=60 \
	op monitor interval=27 role=Master timeout=60 \
	params bridge=br-mgmt \
	meta migration-threshold=INFINITY failure-timeout=180s

primitive vip__vrouter_pub ocf:fuel:ns_IPaddr2 \
	op monitor interval=5 timeout=20 \
	op start interval=0 timeout=30 \
	op stop interval=0 timeout=30 \
	meta migration-threshold=3 failure-timeout=60 resource-stickiness=1

ms master_p_conntrackd p_conntrackd \
  meta notify=true ordered=false interleave=true clone-node-max=1 master-max=1 master-node-max=1


Resource - ex) conntrackd + vip

vip internal, vip public and conntrackd at the same node.
location master_p_conntrackd-on-node-01 master_p_conntrackd 100: node-01
location master_p_conntrackd-on-node-02 master_p_conntrackd 100: node-02
location master_p_conntrackd-on-node-03 master_p_conntrackd 100: node-03

colocation conntrackd-with-pub-vip inf: vip__vrouter_pub:Started master_p_conntrackd:Master

colocation vip__vrouter-with-vip__vrouter_pub inf: vip__vrouter vip__vrouter_pub


Galera


source: [Introduction to Galera](https://www.slideshare.net/henrikingo/introduction-to-galera)


source: [Introduction to Galera](https://www.slideshare.net/henrikingo/introduction-to-galera)


source: [Introduction to Galera](https://www.slideshare.net/henrikingo/introduction-to-galera)

terms


WSRep(Write set replication)
GTID(Global Transaction ID) : {uuid}:{sequence number}
State: INIT -> JOINER -> JOINED -> SYNCED
SST(State Snapshot Transfers)
IST(Incremental State Transfers)
IST trigger condition:

해당 클러스터 그룹의 state UUID와 joiner node의 statu UUID가 같아야 함
모든 missing write-sets이 donor의 write-set 캐시에 존재 해야 함


...


vs


zookeeper, etcd, consul : builing block for own coordinator; loosely connected;


haproxy: failover and load balancing micro services.


Pacemaker: Pacemaker is really only designed to do membership and failure detection at small scale <50 nodes. tightly connected.


HA Clustering의 요건


Technology : ...
Process : documented, clear ownership.
People : skill set, attitudes, leadership, role & responsibiltiy.


Thank you