Skip to content

Instantly share code, notes, and snippets.

@yaocw2020
Last active April 28, 2024 15:24
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save yaocw2020/c135c9dc313334d52739eac48a16e664 to your computer and use it in GitHub Desktop.
Save yaocw2020/c135c9dc313334d52739eac48a16e664 to your computer and use it in GitHub Desktop.
harvester design

Harvester Cloud Provider Design

Table of Contents

Overview

Executive Summary

Harvester is an open-source HCI software built on Kubernetes. It's convenient for users to set up a Kubernetes cluster or using the Rancher node driver in Harvester. In this sense, it's logical to make Harvester a cloud provider for Kubernetes. That's the motivation for us to implement a Kubernetes cloud controller manager (CCM) for Harvester like other cloud providers, such as AWS, Openstack and so on.

Currently, users can spin up a Kubernetes through the Harvester node driver, but they can only expose service through nodeport or ingress. Our motivation is to provide a load balancer type of service for the guest cluster.

Issues:

Requirements

  • Harvester should have a fixed VIP for the external node driver.
  • Harvester should provide load balancers for the guest clusters constructed by Harvester VMs.
  • Develop a CCM for Harvester.

Solutions

  • Use Kube-vip to implement control-plane HA and provide a fixed VIP.
  • There are 2 options to implement load balancers.
    • Leverage Kubernetes service.
    • Use Traefik.
  • Develop an out-of-tree CCM.

We will discuss the technical details below.

Technical Details

Use Kube-vip to implement control-plane HA and provide a fixed VIP

  1. We have to add the VIP in TLS cert before setting up Harvester so that the users can access the apiservers. As we know, Harvester is a K3s cluster. According to the K3s Server Configuration Reference, We should set VIP as the value of --tls-san parameter.
  2. Kube-vip will set the VIP as a float IP to the specified network interface. Harvester network controller should consider the float IP when setting up the VLAN network.
  3. Kube-vip supports two VIP failover mechanisms. They both have limitations.
    • Arp mode: single-node bottlenecking and potentially slow failover.
    • BGP mode: depend on external routers BGP configuration.

Harvester Load Balancer Implementation

The following two components are required.

  • Load Balancer: It's the basic component to forward traffics to the Kubernetes load balancer service.
  • Entry: we need an entry to access the service behind the load balancer. Kube-vip is a proper choice.
Option 1: Leverage Kubernetes service

Kubernetes service is a built-in load balancer for the pods. The VM in Harvester is a pod. Thus, it's simple to provide a load balancer by leveraging Kubernetes service in Harvester. Every service of type “LoadBalancer” has a nodeport. We take this nodeport as the target port for service in Harvester. In this way, a service in Harvester can become a load balancer for the service of the guest cluster.

image-20210517173310781

Pros:

  • It's simple to configure and we don’t need to modify the source code.

Cons:

  • Need to add health check

  • Poor extensibility because of Kubernetes service, for example, do not support QoS.

  • Do not support TLS.

Option 2: Use Traefik

We can use a reverse proxy such as Traefik as the load balancer component implementation.

image-20210517184017675

Pros:

  • Better forwarding performance compared with Kube-vip.
  • Support both layer 4 and layer 7 protocol.
  • Support TLS.
  • Support dynamic reload.
  • Support health check.

Cons:

  • Custom development and future maintenance of LB controller and Traefik configurations.
Load balancer Interface
type LoadBalancer struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec              LoadBalancerSpec   `json:"spec,omitempty"`
	Status            LoadBalancerStatus `json:"status,omitempty"`
}

type LoadBalancerSpec struct {
	// +optional
	Description string     `json:"description,omitempty"`
  // +optional
	Listeners   []Listener `json:"listeners"`
}

type Listener struct {
	Name           string          `json:"name"`
	EntryPort      int             `json:"entryPort"`
	Protocol       string          `json:"protocol"`
  // +optional
	BackendServers []BackendServer `json:"backendServers"`
  // +optional
	HeathCheck     HeathCheck      `json:"healthCheck,omitempty"`
  // +optional
  // TODO TLS
  // +optional
  // TODO Middleware
}

type BackendServer struct {
	Address string `json:"address"`
  // +optional
	Weight  int    `json:"weight,omitempty"`
}

type HeathCheck struct {
	Path     string        `json:"path"`
	Port     int           `json:"port"`
	Interval time.Duration `json:"interval"`
	Timeout  time.Duration `json:"timeout"`
}

type LoadBalancerStatus struct {
	// +optional
	Conditions []Condition `json:"conditions,omitempty"`
}

Harvester Cloud Controller Manager

Here is the brief Development guideline in Kubernetes official website. The main task is to implement the Cloud Provider Interface, including Instance(called node in Kubernetes), LoadBalancer, Routes interface, etc. We focus on Instance and LoadBalancer interface.

// Interface is an abstract, pluggable interface for cloud providers.
type Interface interface {
	// Initialize provides the cloud with a kubernetes client builder and may spawn goroutines
	// to perform housekeeping or run custom controllers specific to the cloud provider.
	// Any tasks started here should be cleaned up when the stop channel closes.
	Initialize(clientBuilder ControllerClientBuilder, stop <-chan struct{})
	// LoadBalancer returns a balancer interface. Also returns true if the interface is supported, false otherwise.
	LoadBalancer() (LoadBalancer, bool)
	// Instances returns an instances interface. Also returns true if the interface is supported, false otherwise.
	Instances() (Instances, bool)
	// InstancesV2 is an implementation for instances and should only be implemented by external cloud providers.
	// Implementing InstancesV2 is behaviorally identical to Instances but is optimized to significantly reduce
	// API calls to the cloud provider when registering and syncing nodes. Implementation of this interface will
	// disable calls to the Zones interface. Also returns true if the interface is supported, false otherwise.
	InstancesV2() (InstancesV2, bool)
	// Zones returns a zones interface. Also returns true if the interface is supported, false otherwise.
	// DEPRECATED: Zones is deprecated in favor of retrieving zone/region information from InstancesV2.
	// This interface will not be called if InstancesV2 is enabled.
	Zones() (Zones, bool)
	// Clusters returns a clusters interface.  Also returns true if the interface is supported, false otherwise.
	Clusters() (Clusters, bool)
	// Routes returns a routes interface along with whether the interface is supported.
	Routes() (Routes, bool)
	// ProviderName returns the cloud provider ID.
	ProviderName() string
	// HasClusterID returns true if a ClusterID is required and set
	HasClusterID() bool
}

We focus on LoadBalancer and InstancesV2 interface.

InstancesV2
type InstancesV2 interface {
	// InstanceExists returns true if the instance for the given node exists according to the cloud provider.
	// Use the node.name or node.spec.providerID field to find the node in the cloud provider.
	InstanceExists(ctx context.Context, node *v1.Node) (bool, error)
	// InstanceShutdown returns true if the instance is shutdown according to the cloud provider.
	// Use the node.name or node.spec.providerID field to find the node in the cloud provider.
	InstanceShutdown(ctx context.Context, node *v1.Node) (bool, error)
	// InstanceMetadata returns the instance's metadata. The values returned in InstanceMetadata are
	// translated into specific fields and labels in the Node object on registration.
	// Implementations should always check node.spec.providerID first when trying to discover the instance
	// for a given node. In cases where node.spec.providerID is empty, implementations can use other
	// properties of the node like its name, labels and annotations.
	InstanceMetadata(ctx context.Context, node *v1.Node) (*InstanceMetadata, error)
}

We can fetch the instance information through the VirtualMachine and VirtualMachineInstance API of Harvester.

We may need the help of the guest agent to get the VM address.

LoadBalancer

The LoadBalancer API depends on how we implement the load balancer.

Known Issues and limitations

  • The physical NIC can't be used by VLAN network and Kube-vip at the same time if Kube-vip gets the VIP from DHCP server.

harvester network controller refactor

版本 0.1.0 存在的缺陷和问题

  1. 依赖 connman 管理指定物理网卡的 DHCP
    • 直接调用 connmanctl 命令,存在硬编码
    • 操作系统需先安装 connman 并使用它管理网络,某种程度上存在操作系统依赖
  2. 操作 network interface 后,使用 sleep 或者轮训的方式来获取最新状态信息
    • 不同的网络硬件设备,性能各异,sleep 时间或者轮训周期和重试次数是无法准确设置的
    • controller 的流程处理时间增加
  3. 物理网卡的设置放在全局唯一的 network-setting 中
    • 所有节点只能设置同一个名称的物理网卡,无该名称的节点无法设置成功
    • 各个节点上的 harvester network controller pod 都往 network-setting 中更新状态信息,状态更新无法反映各个节点的情况

重构思路

下图为 harvester-network-controller 0.1.0 版本的框架图

image-20201229150123530

重构之后,整体框架图如下,

image-20201229193641642

变化主要有三点,阐释如下。

  1. 创建 CRD 承载网络信息

    • 新增 harvester node CRD,包含 network 以及 networkStatus
    • 新增 harvester node controller,负责管理 harvester node 的生命周期
  2. 监听节点上 network interface 的状态 (watch & list)

    • 引入 netlink 监听 bridge 和物理网卡(包括 IP、路由等),尽可能修复异常状态,并在 CR 中记录和更新状态
    • 定时主动查询,更新状态
  3. 修改 bridge 获取配置 IP 的方法

    • 增加 ebtables 配置,使物理网卡能正常收发 DHCP 报文

      ebtables -t broute -A BROUTING -p ipv4 --ip-proto udp --ip-destination-port 67:68 -i ens192 -s <physical NIC mac address> -j DROP
      ebtables -t broute -A BROUTING -p ipv4 --ip-proto udp --ip-destination-port 67:68 -i ens192 -d <physical NIC mac address> -j DROP
    • 监听物理网卡的 IP 变化,如有变化,重新配置 bridge 的 IP

CRD

  • 原生的 node 哪些属性需要继承?
apiVersion: harvester.cattle.io/v1alpha1
kind: Node
metadata:
  name: harvester-6d7sw
  namespace: harvester-system
  labels:
  	harvester.cattle.io/networkConfigTemplate: defaultVlan     # 绑定 networkConfigTemplate
spec:
  network:
    type: vlan      # vlan or vxlan
    NIC: eth0
status:
  networkStatus:
    eth0:      # physical NIC name 
      index: 1
      type: physical nic
      mac: xx.xx.xx.xx
      linkStatus: up
      ipv4Address: 172.16.1.13/16
      master: harvester-br0
      routes: ""
      conditions:
        Ready:
          lastProbeTime: ""
          lastTransitionTime: "2020-12-21T09:11:14Z"
          message: ""
          reason: ""
          status: True
          type: Ready
    harvester-br0:    # bridge name
      index: 100
      type: bridge
      mac: xx.xx.xx.xx
      linkStatus: up
      ipv4Address: 172.16.1.13/16
      vlanfilter: true
      promisc: true
      routes:
      - "172.16.0.0/16 dev harvester-br0 proto kernel scope link src 172.16.1.14"
      conditions:
        Ready:
          lastProbeTime: ""
          lastTransitionTime: "2020-12-21T09:11:14Z"
          message: ""
          reason: ""
          status: True
          type: Ready
    # harvester-vtep0:       # reserved for vxlan vtep
    # harvester-vtep1:
    networkNumbers:
      numbers: 
      - 1
      - 2
      conditions:
        Ready:
          lastProbeTime: ""
          lastTransitionTime: "2020-12-21T09:11:14Z"
          message: ""
          reason: ""
          status: True
          type: Ready

netlink demo

package main

import (
	"bytes"
	"encoding/binary"
	"log"
	"syscall"

	"github.com/mdlayher/netlink"
	"github.com/mdlayher/netlink/nlenc"
)

// https://www.smacked.org/docs/netlink.pdf

func main() {
	c, err := netlink.Dial(syscall.NETLINK_ROUTE, &netlink.Config{
		Groups: (1 << (syscall.RTNLGRP_LINK - 1)) | (1 << (syscall.RTNLGRP_IPV4_IFADDR)) | (1 << (syscall.RTNLGRP_IPV4_ROUTE)),
	})
	if err != nil {
		log.Fatalf("failed to dial netlink: %v", err)
	}
	defer c.Close()

	for {
		// Listen for netlink messages triggered by multicast groups
		msgs, err := c.Receive()
		if err != nil {
			log.Fatalf("failed to receive messages: %v", err)
		}

		for _, m := range msgs {
			switch m.Header.Type {
			case syscall.RTM_DELLINK:
				ifInfomsg := syscall.IfInfomsg{}
				buf := bytes.NewBuffer(m.Data[:syscall.SizeofIfInfomsg])
				if err := binary.Read(buf, nlenc.NativeEndian(), &ifInfomsg); err != nil {
					log.Println(err)
					break
				}
				log.Printf("ifInfomsg: %+v", ifInfomsg)
				decodeAttributes(m.Data[syscall.SizeofIfInfomsg:])
			case syscall.RTM_NEWADDR:
			case syscall.RTM_DELADDR:
			case syscall.RTM_NEWROUTE:
			case syscall.RTM_DELROUTE:
			default:
			}
		}
	}
}

func decodeAttributes(b []byte) {
	ad, err := netlink.NewAttributeDecoder(b)
	if err != nil {
		log.Fatal(err)
	}

	for ad.Next() {
		switch ad.Type() {
		case syscall.IFLA_IFNAME:
			log.Printf("ifname: %s", ad.String())
		default:

		}
	}
}

交互设计

  1. harvester installer
    • 新增网络类型选择下拉框,当前只支持 vlan,可以为空
    • 新增物理网卡下拉框,多网卡情况下默认使用与管理网络不同的网卡
    • installer 根据配置生成 yaml 并 apply 生成 harvester node CR
  2. harvester UI
    • 主机表单支持查看和配置网络类型和选择物理网卡
    • yaml 显示及修改切换为 harvester.cattle.io/Node

FQA

  • 为什么不给网络接口设计专门的crd(比如一个nic对应一个crd资源,通过label和node关联),而考虑放在node之中?

    针对当前需求而言,两者没有什么本质区别。设计一个 harvester node 可以为未来其他功能保留扩展性。

  • netlink模块对物理网卡/harvster-br0以外的设备如何处理?用户配置的bridge/bond之类的

    根据名字进行过滤,用户配置的 bridge、bond 产生的事件直接忽略

  • 如果可以支持多种网络类型,在安装的时候让用户选择是否合理。能否默认将管理网络nic用作vlan bridge物理网卡(这样不需要加额外的安装配置),如果有需要用户可以在UI上针对各个node修改要用哪个网卡。

    当前网卡类型只支持 vlan,可以先在配置中写死,后续支持不同的类型,依据不同设计,再考虑放在那里设置会比较合理。

    我们鼓励用户使用多网卡,鼓励用户在管理网络和 vlan 等数据网络中使用不同的网卡,以便管理网络和数据网络的隔离。一般而言,管理网络 down 了,数据网络不应该受到影响。如果默认将管理网络 nic 用作 vlan bridge 物理网卡,可能会导致用户不自觉都使用同一个网卡。

reference

Monitoring Linux networking state using netlink

Netlink Library (libnl)

netlink.pdf

Harvester Networking

Overview

Harvester supports two kinds of network, [management network](#Management Network) and [VLAN network](#VLAN Network). Before introducing them, it's necessary to learn about VM networking of KubeVirt. It's about how to connect pod and VM inside it.

image-20210105164142636

Harvester networking diagram

VM Networking

VM networking of KubeVirt is implemented by binding mechanisms. As of now, the existent binding mechanisms are bridge, masquerade and slip. We only use the first two kinds in Harvester networking. Exactly, bridge binding mechanism is applied in VLAN network while masquerade binding mechanism in management network.

image-20210105215827048

Part of Harvester networking: VM networking

Both of two mechanisms make use of a tap as the NIC of VM, and a bridge to forward data package. However, there are some differences between them.

  • Under the masquerade binding mechanism, both the interface inside VM and the bridge have a fix IP address, that means every bridge in different pods and every NIC in different VMs have the same IP address. While under the bridge binding mechanism, ip address of the bridge is useless, and the ip address of interface inside VM can be any you like.

  • The way how to forward packages to the VM interface is different. Under the bridge binding mechanism, the packages received by the interface in pod generated by CNI (tentatively called it as pod interface) will be forwarded to the bridge through the Layer 2 network, because the bridge is the master of the pod interface. While under the masquerade binding mechanism, it NAT the traffic from the pod interface into the VM interface via iptables. The iptables rule is shown as follow.

    Chain PREROUTING (policy ACCEPT)
    target     prot opt source               destination
    KUBEVIRT_PREINBOUND  all  --  anywhere             anywhere
    Chain KUBEVIRT_PREINBOUND (1 references)
    target     prot opt source               destination
    DNAT       all  --  anywhere             anywhere             to:10.0.2.2
  • There are some issues of bridge binding mechanism, that is the reason why we adopt masquerade bind mechanism in management network.

    • Due to IPv4 address delegation, in bridge mode the pod doesn’t have an IP address configured, which may introduce issues with third-party solutions that may rely on it. For example, Istio may not work in this mode.
    • Live migration is not allowed with a pod network binding of bridge interface type, and also some CNI plugins might not allow to use a custom MAC address for your VM instances.

If you want more details of the binding mechanisms, please refer KubeVirt docs.

Management Network

Management network is well-understandable if you have some basic knowledge about Kubernetes network, because they are the same indeed if ignore the VM network stated above. There are two meaningful points.

  • Harvester cluster adopts flannel as the default CNI.
  • You can access the VM with the pod address directly, but it can only be reached in the cluster if you don't expose it outside with other methods, because the address is in the internal network as we know.

VLAN Network

Let's explain how VLAN network works with the help of VLAN network diagram below.

image-20210105185112481

VLAN network diagram

There are 4 key points of VLAN network.

  • We create a bridge for every node and enable VLAN filter to divide multiple VLANs.
  • Commonly, we use veth pair to connect pods and the bridge.
  • A physical NIC is required to be set as the slave of the bridge so that we can tranfer the traffic outside.
  • The port of switch connected with the physical NIC should be at trunk mode, otherwise, the tagged package with different VLANID from PVID will be dropped.

In order to support DHCP, we set a ebtables rule to allow DHCP reply message be received by the slave NIC of bridge, rather than by bridge.

ebtables -t broute -A BROUTING -p ipv4 --ip-proto udp --ip-destination-port 68 -i <NIC NAME> -d <NIC mac address> -j DROP

Reference

VLAN filter support on bridge

VMI Networking

Interfaces and Networks

Harvester SNAT Problem

Problem Description

Harvester leverages the Kubernetes service to provide the load balancer for the service in the Harvester virtual machines. The backend servers of the load balancer are the <VM IP>:<service port>.

cw-harv:/home/rancher # kubectl get svc
NAME                        TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
default-nginx-lb-db9bdca5   LoadBalancer   10.43.113.238                  80:32586/TCP   4h32m

cw-harv:/home/rancher # kubectl get endpointslices
NAME                       ADDRESSTYPE   PORTS   ENDPOINTS       AGE
default-nginx-lb-db9bdca5  IPv4          80      172.16.178.178   4h33m

However, when the VMs are in the VLAN, we failed to access the service by the cluster IP address if the client and the backend VM are in the same Harvester host. We show the VLAN network topology following.

image-20220220172154649

harvester:/home/rancher # curl 10.43.113.238
curl: (7) Failed to connect to 10.53.202.161 port 80: Connection timed out

Analysis

Figure out traffic path

Generally, we have to figure out how the traffic flows for the network problem. In Kubernetes, when we connect the cluster IP, the kube-proxy will translate the cluster IP(destination address) to one of the backend server addresses, exactly, 172.16.178.178:80 in our problem. Because the destination address 172.16.178.178 is in the VLAN 178, the request and the response packets will go through the gateway. Let's mark the traffic path in the network topology.

image-20220220232132842

Capture packets

We can connect the VMs normally between VLANs, so we first exclude that it's caused by external network configurations. To locate the fault, we capture the packets across the network interfaces in the traffic path inside the Harvester.

# eth0
localhost:/home/opensuse # tcpdump -i eth0 host 172.16.178.178 and host 172.16.0.57 -nnevvv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:58:56.041457 00:0c:29:4c:93:8c > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x0b3b (incorrect -> 0x24d3), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
16:58:56.041889 4c:e9:e4:72:63:9c > a6:5b:3e:50:d4:b2, ethertype 802.1Q (0x8100), length 78: vlan 178, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 63, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x24d3 (correct), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
16:58:56.043227 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype 802.1Q (0x8100), length 78: vlan 178, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.38944: Flags [S.], cksum 0x0598 (correct), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775780302 ecr 3130476189,nop,wscale 7], length 0
16:58:56.043347 4c:e9:e4:72:63:9c > 00:0c:29:4c:93:8c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.38944: Flags [S.], cksum 0x0598 (correct), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775780302 ecr 3130476189,nop,wscale 7], length 0
16:58:57.056295 00:0c:29:4c:93:8c > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 7173, offset 0, flags [DF], proto TCP (6), length 60)
# eth1
localhost:/home/opensuse # tcpdump -i eth1 host 172.16.178.178 and host 172.16.0.57 -nnnvve
tcpdump: listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:58:56.041668 00:0c:29:4c:93:8c > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x24d3 (correct), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
16:58:56.041834 4c:e9:e4:72:63:9c > a6:5b:3e:50:d4:b2, ethertype 802.1Q (0x8100), length 78: vlan 178, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 63, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x24d3 (correct), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
16:58:56.043030 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype 802.1Q (0x8100), length 78: vlan 178, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.38944: Flags [S.], cksum 0x0b3b (incorrect -> 0x0598), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775780302 ecr 3130476189,nop,wscale 7], length 0
# veth2db2ad9c
localhost:/home/opensuse # tcpdump -i veth2db2ad9c host 172.16.178.178 and host 172.16.0.57 -nnevvv
tcpdump: listening on veth2db2ad9c, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:58:56.041888 4c:e9:e4:72:63:9c > a6:5b:3e:50:d4:b2, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x24d3 (correct), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
16:58:56.042634 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.10598: Flags [S.], cksum 0x0b3b (incorrect -> 0x7452), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775780302 ecr 3130476189,nop,wscale 7], length 0
16:58:57.056409 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.10598: Flags [S.], cksum 0x0b3b (incorrect -> 0x705d), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775781315 ecr 3130476189,nop,wscale 7], length 0
# VM eth0
bash-5.0# tcpdump -i net1 host 172.16.178.178 and host 172.16.0.57 -nnnvve
tcpdump: listening on net1, link-type EN10MB (Ethernet), capture size 262144 bytes
08:58:56.041934 4c:e9:e4:72:63:9c > a6:5b:3e:50:d4:b2, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 7172, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.0.57.10598 > 172.16.178.178.80: Flags [S], cksum 0x24d3 (correct), seq 3653572735, win 64240, options [mss 1460,sackOK,TS val 3130476189 ecr 0,nop,wscale 7], length 0
08:58:56.042474 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.10598: Flags [S.], cksum 0x0b3b (incorrect -> 0x7452), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775780302 ecr 3130476189,nop,wscale 7], length 0
08:58:57.055888 a6:5b:3e:50:d4:b2 > 4c:e9:e4:72:63:9c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.178.178.80 > 172.16.0.57.10598: Flags [S.], cksum 0x0b3b (incorrect -> 0x705d), seq 4139910355, ack 3653572736, win 65160, options [mss 1460,sackOK,TS val 2775781315 ecr 3130476189,nop,wscale 7], length 0

image-20220220172249420

It shows that the VM network interface receives the SYN package from the client normally. However, the destination port of the response SYN/ACK package is changed from veth2db2ad9c to eth1 after forwarding by the bridge. Because of the mismatched source port, the SYN/ACK packet will be dropped which leads to an incomplete TCP 3-Way handshake. According to the conntrack tables, the modified port is the source port from the original direction. It's reasonable to guess that a NAT happens when the bridge forwards the SYN/ACK packets.

localhost:/home/opensuse # conntrack -L | grep 172.16.178.178
tcp      6 56 SYN_RECV src=172.16.0.57 dst=10.43.113.238 sport=38944 dport=80 src=172.16.178.178 dst=172.16.0.57 sport=80 dport=10598 mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 262 flow entries have been shown.

As the Kubernetes document describes,

if the plugin connects containers to a Linux bridge, the plugin must set the net/bridge/bridge-nf-call-iptables sysctl to 1 to ensure that the iptables proxy functions correctly.

The net.bridge.bridge-nf-call-iptables setting controls whether or not packets traversing the bridge are sent to iptables for processing. After setting net.bridge.bridge-nf-call-iptables as 0, we found that we access the service successfully.

It's no doubt that the iptables rules led to this problem. However, we still don't know which rule it is. In other words, we'd better dive into netfilter to find the root cause.

Dive into netfilter NAT

With the help of the pwru tool, we confirmed that the NAT happens in the pre-routing chain.

0xffff9e90d1c55200        [<empty>]      skb_ensure_writable   16542012456363 netns=4026531992 mark=0x0 ifindex=10 proto=8 mtu=1500 len=60 172.16.178.178:80->172.16.0.57:10598(tcp)
0xffff9e90d1c55200        [<empty>] inet_proto_csum_replace4   16542012502740 netns=4026531992 mark=0x0 ifindex=10 proto=8 mtu=1500 len=60 172.16.178.178:80->172.16.0.57:38944(tcp)
# stack of the function skb_ensure_writable
0xffff9e90c12a8d00    [ksoftirqd/5]      skb_ensure_writable   16558140061379 netns=4026531992 mark=0x0 ifindex=10 proto=8 mtu=1500 len=60 172.16.178.178:80->172.16.0.57:10598(tcp)
skb_ensure_writable
l4proto_manip_pkt	[nf_nat]
nf_nat_ipv4_manip_pkt	[nf_nat]
nf_nat_manip_pkt	[nf_nat]
nf_nat_ipv4_pre_routing	[nf_nat]
nf_hook_slow
br_nf_pre_routing	[br_netfilter]
br_handle_frame	[bridge]
__netif_receive_skb_core
__netif_receive_skb_one_core
process_backlog
__napi_poll
net_rx_action
__softirqentry_text_start
run_ksoftirqd
smpboot_thread_fn
kthread
ret_from_fork

# stack of the function inet_proto_csum_replace4
0xffff9e90c12a8d00    [ksoftirqd/5] inet_proto_csum_replace4   16558140095491 netns=4026531992 mark=0x0 ifindex=10 proto=8 mtu=1500 len=60 172.16.178.178:80->172.16.0.57:38944(tcp)
inet_proto_csum_replace4
l4proto_manip_pkt	[nf_nat]
nf_nat_ipv4_manip_pkt	[nf_nat]
nf_nat_manip_pkt	[nf_nat]
nf_nat_ipv4_pre_routing	[nf_nat]
nf_hook_slow
br_nf_pre_routing	[br_netfilter]
br_handle_frame	[bridge]
__netif_receive_skb_core
__netif_receive_skb_one_core
process_backlog
__napi_poll
net_rx_action
__softirqentry_text_start
run_ksoftirqd
smpboot_thread_fn
kthread
ret_from_fork

Diving into the Linux kernel, the NAT hook in the pre-routing chain only changes the destination address, exactly, DNAT or de-SNAT.

static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
	{
		.hook		= ipt_do_table,
		.pf		= NFPROTO_IPV4,
		.hooknum	= NF_INET_PRE_ROUTING,
		.priority	= NF_IP_PRI_NAT_DST,
	},
	...
}

image.png

In this problem, it should be the de-NAT that changes the source port of the SYN/ACK packet. The kube-proxy inserts the SNAT iptables rule in the post-routing chain. The packets to the service will be marked in the output chain and do a SNAT in the post-routing chain.

-A KUBE-SVC-2CMXP7HKUVJN7L6M ! -s 10.42.0.0/16 -d 10.43.220.155/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

When the packet traverses the bridge, the pre-routing NAT hook does a de-SNAT.

Solution

Because the Harvester uses the canal CNI independent on the bridge netfilter, we can disable the net.bridge.bridge-nf-call-iptables to avoid the unwanted de-SNAT when the packets traverse the bridge.

Reference

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#network-plugin-requirements

Net.bridge.bridge-nf-call and sysctl.conf

IPTABLES的连接跟踪与NAT分析

Introduction of the Harvester network

Harvester is built on Kubernetes, which uses CNI as an interface between network providers and Kubernetes pod networking. Naturally, we implement the Harvester network based on CNI. Moreover, Harvester UI integrates the Harvester network to provide a user-friendly way to configure network for VMs.

By version 0.2, Harvester supports two kinds of networks:

  • the management network
  • VLAN

Implementation

Management network

Harvester adopts flannel as the default CNI to implement the management network. It's an internal network, which means the user can only access the VM's management network within its cluster nodes or pods.

VLAN

Harvester network controller leverages the multus and bridge CNI plugins to implement the VLAN.

Below is a user case of the VLAN in Harvester.

image-20210412181210146

  • Harvester network controller use a bridge for a node and a pair of veth for a VM to implement the VLAN. The bridge acts as a switch to forward the network traffic from or to VMs and the veth pair is like the connected ports between vms and switch.
  • VMs within the same VLAN is able to communicate with each other, while the VMs within different VLANs cann't.
  • The external switch ports conneted with the hosts or other devices(such as DHCP server) should be set as trunk or hybrid type and permit the specified VLANs.

UI Interaction

  • Enable VLAN via going to Setting > vlan to enable VLAN and input a valid default physical NIC name for the VLAN. The first physical NIC name of each Harvester node always defaults to eth0. It is recommended to choose a separate NIC for the VLAN other than the one used for the management network(the one selected during the Harvester installation) for better network performance and isolation. (Note: modify the default VLAN network setting will not change the existing configured host networks).

    image-20210408233858097

  • (optional) Users can always customize each node's VLAN network configuration via going to HOST > Network tab.

    image-20210408235822223

  • Creating a new VLAN network via going to Advanced>Networks page and clicking the Create button.

    image-20210408232931111

  • Creating a VM and add the network configurations.

    • Only the first network card will be enabled by default, the user can either choose to use a management network or a VLAN network. (Note: will need to select Install guest agent option in the Advanced Options tab to get the VLAN network IP address from the Harvester UI)

      image-20210412175826097

    • Users can choose to add one or multiple network cards, the additional network card configurations can be set via cloud-init network data. e.g.

      version: 1
      config:
        - type: physical
          name: enp1s0 # name is various upon OS image
          subnets:
            - type: dhcp
        - type: physical
          name: enp2s0 
          subnets:
            - type: DHCP
      
@gitlawr
Copy link

gitlawr commented Dec 28, 2020

  1. 架构图没看明白,为什么multus-cni会get network settting CRD,multus应该不认识harvester network setting?
  2. CRD kind不要用dash,除了network-attachment-definition,没见过其他地方这么搞。 API group用harvester.cattle.io,harvester.io目前不是我们控制的域名。
  3. Node.spec和network-config-template.spec基本一样,怎么用?最好有个从用户输入开始的时序图说明一下。
  4. spec.network(尤其是networkNumbers)的设计看起来依赖于特定网络实现,放在node spec有些奇怪。

@yaocw2020
Copy link
Author

yaocw2020 commented Dec 28, 2020

4. spec.network(尤其是networkNumbers)的设计看起来依赖于特定网络实现,放在node spec有些奇怪。

主要是因为这里配置的是节点的网络,而不是整个集群的网络。这样设计,也保留了不同节点设置不同网络的可能性,比如节点一在 vlan 100 里,而节点二就不在,当然不推荐用户这么使用,调度上会有问题。

@yaocw2020
Copy link
Author

  • Node.spec和network-config-template.spec基本一样,怎么用?最好有个从用户输入开始的时序图说明一下。
    这个漏了,改着改着给改没了,就是通过 label 使 networkConfigTemplate 与 node 绑定,可以实现批量操作,在实际环境中,节点的网络配置大部分都是相同的。

@gitlawr
Copy link

gitlawr commented Dec 28, 2020

如果用户交互方式/UI有变化,也在文档里补充说明

@futuretea
Copy link

futuretea commented Dec 28, 2020

harvester.cattle.io/v1alpha1/Node
是否可以加上:

  1. status.networkStatus[*].type 表示网卡类型
  2. status.networkStatus[*].mac 表示mac地址

@gitlawr
Copy link

gitlawr commented Dec 29, 2020

  1. node crd的生命周期如何管理?看起来创建删除在network controller范围之外
  2. 为什么不给网络接口设计专门的crd(比如一个nic对应一个crd资源,通过label和node关联),而考虑放在node之中?
  3. netlink模块对物理网卡/harvster-br0以外的设备如何处理?用户配置的bridge/bond之类的
  4. 如果可以支持多种网络类型,在安装的时候让用户选择是否合理。能否默认将管理网络nic用作vlan bridge物理网卡(这样不需要加额外的安装配置),如果有需要用户可以在UI上针对各个node修改要用哪个网卡。
  5. 安装时设置的NIC,实现上如何设置到network controller。

@gitlawr
Copy link

gitlawr commented Dec 29, 2020

  1. 宿主机上需要的依赖和先决条件在新方案有什么变化?

@yaocw2020
Copy link
Author

yaocw2020 commented Dec 29, 2020

node crd的生命周期如何管理?看起来创建删除在network controller范围之外

需要额外增加一个简单的 node controller,负责监听原来的 node,创建和删除 harvester node。

为什么不给网络接口设计专门的crd(比如一个nic对应一个crd资源,通过label和node关联),而考虑放在node之中?

针对当前需求而言,两者事实上没有什么本质区别。设计一个 harvester node 可以为未来其他功能保留扩展性。

netlink模块对物理网卡/harvster-br0以外的设备如何处理?用户配置的bridge/bond之类的

根据名字进行过滤,用户配置的 bridge、bond 产生的事件直接忽略

如果可以支持多种网络类型,在安装的时候让用户选择是否合理。能否默认将管理网络nic用作vlan bridge物理网卡(这样不需要加额外的安装配置),如果有需要用户可以在UI上针对各个node修改要用哪个网卡。

我们鼓励用户使用多网卡,鼓励用户在管理网络和 vlan 等数据网络中使用不同的网卡,以便管理网络和数据网络的隔离。一般而言,管理网络 down 了,数据网络不应该受到影响。如果默认将管理网络 nic 用作 vlan bridge 物理网卡,可能会导致用户不自觉都使用同一个网卡。

安装时设置的NIC,实现上如何设置到network controller。

由 harvester installer 生成 harvester node yaml 文件并 apply 生成 crd。

宿主机上需要的依赖和先决条件在新方案有什么变化?

  1. 可能需要解决 ebtables 的 bug,目前暂不确定使用 go 配置 ebtables 规则是否会遇到同样的问题。
  2. 系统需要支持 netlink,我们使用 netlink 最基础的功能,在 linux 很早期的版本(2.6)就已经有了,一般内核版本都是支持的。

@futuretea
Copy link

networkConfigTemplate 如何使用?

@futuretea
Copy link

futuretea commented Dec 30, 2020

status.networkStatus[*].conditions 应该是个列表。除了ready这个状态还有别的状态么?不同状态之间变换的条件与时机是什么?

@futuretea
Copy link

networkNumbers的含义是什么?

@yaocw2020
Copy link
Author

yaocw2020 commented Dec 30, 2020

networkConfigTemplate 如何使用?

已经删除了

status.networkStatus[*].conditions 应该是个列表。除了ready这个状态还有别的状态么?不同状态之间变换的条件与时机是什么?

暂时没有了

networkNumbers的含义是什么?

vlanID 或者 vxlan 中的 vni

@gitlawr
Copy link

gitlawr commented Dec 30, 2020

vlan id/vni不是和某个interface关联的吗,这里看起来是node的属性

@futuretea
Copy link

futuretea commented Jan 7, 2021

KIND:     Host
VERSION:  harvester.cattle.io/v1alpha1

DESCRIPTION:
     <empty>

FIELDS:
   apiVersion   <string>
   kind <string>
   metadata     <Object>
      annotations       <map[string]string>
      clusterName       <string>
      creationTimestamp <string>
      deletionGracePeriodSeconds        <integer>
      deletionTimestamp <string>
      finalizers        <[]string>
      generateName      <string>
      generation        <integer>
      labels    <map[string]string>
      managedFields     <[]Object>
         apiVersion     <string>
         fieldsType     <string>
         fieldsV1       <map[string]>
         manager        <string>
         operation      <string>
         time   <string>
      name      <string>
      namespace <string>
      ownerReferences   <[]Object>
         apiVersion     <string>
         blockOwnerDeletion     <boolean>
         controller     <boolean>
         kind   <string>
         name   <string>
         uid    <string>
      resourceVersion   <string>
      selfLink  <string>
      uid       <string>
   spec <Object>
      description       <string>
      network   <Object>
         nic    <string>
         type   <string>
   status       <Object>
      conditions        <[]Object>
         lastTransitionTime     <string>
         lastUpdateTime <string>
         message        <string>
         reason <string>
         status <string>
         type   <string>
      networkIDs        <[]integer>
      networkStatus     <map[string]Object>
         conditions     <[]Object>
            lastTransitionTime  <string>
            lastUpdateTime      <string>
            message     <string>
            reason      <string>
            status      <string>
            type        <string>
         index  <integer>
         ipv4Address    <string>
         mac    <string>
         master <string>
         promiscuous    <boolean>
         routes <[]string>
         state  <string>
         type   <string>

@yaocw2020
Copy link
Author

/dev/net/tun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment