Skip to content

Instantly share code, notes, and snippets.

@anandarajm
Last active August 11, 2020 23:20
Show Gist options
  • Save anandarajm/78d9cd863fa78b43fc8e0291020981a7 to your computer and use it in GitHub Desktop.
Save anandarajm/78d9cd863fa78b43fc8e0291020981a7 to your computer and use it in GitHub Desktop.
Dinesh Dutt book

2 types of Clos

Formula to count the number of servers possible in a Clos = N^2/2

Formula to count the number of switches needed in a Clos = N+N/2

  1. Virtual chassis - Uniform latency
  2. Pod based - 2 types of latency (within pod Vs inter pod)

Virtual chassis are better suited for homogeneous applications like say FB Pod based Clos are more suited for Hyperscale cloud service providers (CSP)

Open source Cabling verification can be done using perscriptive topology manager (PTM)

Network virtualization

Drawbacks Hashing is done using outer header. Sourceport of outer header is filled with checksum of inner header flow. LISP/VXLAN follow this Additional process at NIC level Increased MTU Lack of visibility - traceroute

Namespaces

A process running in the kernel is identified by different types of namespaces. In the case of containers, these namespaces of virtualized.

Types of namespacs: cgroups: /proc/pid/cgroups /proc/pid/mountinfo etc are virutalized to ensure process view is abstracted Network: Helps create network interface and provides ability to connect to the outside world such as interface, socket, routing table, mac table, etc., PID: Helps makes the process within the container to think it is running in kernel like a regular process, by default there will be a PID 1 created within the container. User: Helps provide which user can access/execute process within container, container can have its own root IPC: Helps enable communication of process within the container through standard libraries like POSIX msgQs, shared memory, semaphores Mount: Virtualizes the filesystem mounts, typically enabled using chroot UTS: Helps virtualize hostname and domain name for the container.

Docker creates interfaces within the network name space (netns) using virutal interfaces (veth). veths are typically created in pairs, with each end of the veth in different namespace to enable communication between containers or to the outside world. e.g. communication between 2 NS within a host. NS1 (veth1) ---------- (veth2) NS2

veth modes:

No network host network Single-host network Multihost network

Single host network

Dockers create a bridge called docker0, every container created within the host are created with one veth in single host network mode with one end connected to docker0 and another connected to containerns.

Bridge Figure 2

Docker uses default subnet of 172.17.0.0/16 for docker0 bridge and assigns 172.17.0.1 to the bridge itself. Any communication from docker0 to outside world undergoes NAT using iptables

MacVlan Figure 3

Another mode to enable single host network connectivity to outside world is using MacVlan. in this mode, each container is assigned a virutal MAC with OUI - 02:42:ac and assigned an IP address using DHCP like regular ethernet interface. However communication between macvlan interface and host require hairpinning of traffic. With Docker, inter maclvan communication can happen without traffic hairpinnign. There is no NAT in this model.

Multihost container networking

L2 or L3 communication between containers across host has implications IPAM across hosts

Overlay network

This creates 2 bridges in the host. One for VTEP communication across the hosts. This provides a view that containers are residing within the same L2 network. The new bridge in addtion to docker0 is called docker_gwbridge

L3 Routing

Disable NAT, run a routing protocol instance using FRR (ospf/bgp) to advertise docker subnet across hosts. Calico follows this model.

BGP Message types

Open Update Keepalive Notification Route Refresh

BGP Timers

Tweaks are needed to adopt BGP in DC compared to how BGP is traditionally used in ISPs

Advertisement interval --> DCs prefer this to be 0 instead default value of 30 Keepalive & Hold timers --> Can be reduced to 1 & 3 instead of 60 & 180, Also can enable BFD Connect timer --> Can be reduced to 10 seconds from 60 seconds

Unnumbered interfaces

Unnumbered interface for physical interfaces are obtained through the following steps.

  1. Use IPv6 link local address (LLA) on an interface as IP address. FRR send/expect BGP connect message via this LLA.
  2. Through IPv6 router advertisement, neighbor discovery is ensured.
  3. Using RFC 5549 capability in BGP, i.e. Advertise IPv4 NLRI over IPv6 BGP neighbor with IPv6 nexthop. This capability is called "Extended next hop"
  4. With this capability, MAC address of neigbor is automatically known using RA message and packet forwarding can be facilitated with just IPv4 address space.
  5. Show command will replace IPv6 Next hop IP to IPv4 Next hop using 169.254.0.0/16 subnet with Static ARP.

BGP constructs to support Virtual Network routes

Route Distinguishers

RD is an eight byte value that is added to every virtual network address to keep the address globally unique. There are 3 different types of RDs. The format used in EVPN is of 64 bits length in below format. Though VNI is 3 byte length usually, it is assumed No virtual network is more than 64000 VNI long and no silicon supports so many VNIs.

Type (2bytes) | Device Loopback (4 bytes) | VNI ID (2 bytes)

By utilizing device loopback IP in RD, no 2 device in virutal network is expected to have same RD. RD is encoded as part of NLRI in the MP_REACH_NLRI AND MP_UNREACH_NLRI

Route Target

RT encodes the virtual network the prefix belongs to. Advertising router will use a specific RT called 'export RT'. A BGP speaker receiving and using advertisement uses this RT to decide which local vnet to add the routes. This is called 'import RT'.

Format of RT looks like as follows...

ASN (2 bytes) | A (bit) | Type (3 bit) | Domain ID (4 bit) | Service ID (3 byte)

A - Auto or manually derived Type - Vlan (0) or Vxlan (1) Domain ID - Typically 0, used to resolve conflicts in case of any overlap in Vxlan ID in the administrative domain.

FRR supports auto derivation of RT via 'route-target import auto'

EVPN route types

Typically non-IPv4 route types are advertised via MP_REACH_NLRI AND MP_UNREACH_NLRI attributes. For most AFI/SAFI combinations, structure and contents are carried in UPDATE message is same across the AFI/SAFI. This is not the case with EVPN. In EVPN, there is a need to advertise MAC, IP Prefix, unicast or Multicast prefix etc., EVPN NLRI consists of differet route types

RT1 - Ethernet segment auto discovery -- Supports multihomed endpoints. (MLAG alternative)

RT4 - Designated forwarder -- Ensures only a single VTEP forwards to BUM to multihomed endpoints

RT2 - MAC, VNI, IP -- Adv. reachablity to specific MAC address in vnet & its IP address

RT3 - VNI/VTEP association -- Adv. VTEP's interest in virtual networks

RT5 - IP prefix, VRF -- Adv. IP prefixes and VRF associated with the prefix.

RT6 - Mcast group membership -- Contains information about mcast groups an VTEP is interested in.

@anandarajm
Copy link
Author

anandarajm commented May 2, 2020

Figure 1
vxlan

@anandarajm
Copy link
Author

anandarajm commented May 5, 2020

Figure 2
Capture

@anandarajm
Copy link
Author

Figure 3
Capture

@anandarajm
Copy link
Author

Figure 4
Capture

@anandarajm
Copy link
Author

Figure 5
Capture

@anandarajm
Copy link
Author

Capture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment