Skip to content

Instantly share code, notes, and snippets.

@rtreffer
Last active December 22, 2015 01:09
Show Gist options
  • Save rtreffer/6394792 to your computer and use it in GitHub Desktop.
Save rtreffer/6394792 to your computer and use it in GitHub Desktop.
Network topologies

Are network topologies broken?

I hate hierarchies. Everywhere. Most networks are organized like this:

  • Assume 6 simple subnets
  • Every subnet is connected to a "core switch"

It will roughly look like this

          ----------        ----------
         | Network1 |      | Network2 |
          ----------        ----------
                     \    /
 ----------           \  /          ----------
| Network3 | -------  Core ------- | Network4 |
 ----------           /  \          ----------
                     /    \
          ----------        ----------
         | Network1 |      | Network2 |
          ----------        ----------

Let's try to get some numbers out of that. The design has 2 possible bottlenecks:

  • The Core switch switching capacity
  • The lines to a subnet

The former is a huge problem as money can only buy you so much switching capacity. The second one is a problem if one subnet is attracting way more traffic, e.g. because it hosts a busy fileserver.

For the course of this file we'll treat single-port lines and bundled lines alike.

Let's see how a system without a classic topology will look like. We will pick the approach of Kademlia routing and apply it to this network:

  • We'll arrange all networks as a ring (each network has 2 neighbors, 1 hop away)
  • We'll then connect each network to every network that is 2 hops away
  • We'll connect each network to every network that is 4 hops away
  • [..] and so on until we can't find a network that is far enough away
          ---------------------------------------------
         |                                             |
       Net1 --- Net2 --- Net3 --- Net4 --- Net5 --- Net6
       | |      |  |     |  |     |  |     |  |     |  |
       | `------|--|-----'  `-----|--|-----'  |     |  |
       |        |  |              |  |        |     |  |
       |        |  `--------------'  `--------|-----'  |
       `--------|-----------------------------'        |
                `--------------------------------------'

Every network ends up with 4 connections, or floor(2 * (log N) / log 2) in the case of N networks. This means that doubling the size of the network requires 2 more connections for every subnet. A property that will yield way better scalability compared to the core switch approach which must grow linearly with the number of connected networks.

What about the bottlenecks? What if one subnet is highly popular? We have to compute the Shortest Paths matrix with full paths. OSPF would usually do that in software. Let's say every network wants to talk to Network1.

  • Network1 is local, this shouldn't be a problem
  • Network2 is directly connected.
  • Network3 is directly connected.
  • Network4 can use (n2, n1), (n3,n1), (n5,n1), (n6,n1)
  • Network5 is directly connected.
  • Network6 is directly connected.

This is a bit unfortunate as Network4 will suffer slight starvation without QoS. We have however the case that all networks are balanced accress links. The average distance is 1.2. This factor will only grow logarithmically.

So as long as the 4 links combined are 1.2 times as powerful as the link to the core we're fine. Nice, isn't it?

Anyway, the former worst-case is no longer a worst-case. We will reach maximum speed once we assign 50% of the ports as network interchange ports, simply because we can't saturate more ports! The new worst-case is maximum-hop communication. For our 6 network case it's 2 hops. This means that, if every server talks to a maximum distance server, we'll end up with half of the total provided network bandwidth.

Now let's push this a bit. Let's say we'd take a Arista* 7050T-64 which can support up to 64 10GBit ports.

(*) I'm taking Arista because I can easily lookup the switches and specs. Less products + well organized site == fast lookup. I'd be happy to receive other examples.

Say we'd want to build a network of 6'000 servers. That's less than 200 32-port blocks. 200 networks, with 32 ports, means 64 Tbps network speed, devided by ~4, which means the network would be capable of 16 Tbps worst case performance. Or ~32Tbps average case. Way better than even an awesome core switch.

Great, right? Well, we can do better. Let's say we'd ship containers of

1000 servers, like google does. How about using classic core switches as entry points?

The Arista 7500 offers 384 10Gbit switches. We would again assign 192 ports to local nodes and 192 ports to the network. We would need 6 such machines to connect 1152 servers. Roughly the density of a google container (is this a coincident?).

Every "container" accounts for ~6 subnets. Taking again a maximum of a 4.0 degeneration in network speed we'd come up with up to 35 containers.

Or 35 * 6 * 1152 = 241920 servers. Communicating at near-wire speed. All the time. We are talking about a network speed of more than 100 Tbit.

Given economics / port costs I'd bet that Google and others use a structure like this. There is no way to beat this with a classic topology.

It's even nonsense trying to beat this with smaller networks. Anything that's 1'000 servers and more should evaluate this flat structure.

Why am I writing this?

We are currently doubling our servers every year. We've filled the first subnetworks and racks. I'd bet this kind of stuff will be relevant for us in 1 year. Plus I oppose classic hierarchies. And I recently read "the datacenter as a computer" which outlines a classic hierarchy. I read it and thought "no way, this is in no way optimal!". I thus had to articulate my anti-hierarchical setup.

What's the consequence?

It looks like it's possible to scale a network that feels like it's flat. There is no big speed penalty in talking to a random server. Yet all subnets are small and redundancy is extremly high.

This means that, given a virtual overlay, we'll be able to scale networks to the size of a warehouse yet keeping it virtually like a flat network with e.g. fake IPs / subnets that are translated on the switch backplane, e.g. via openflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment