rtreffer/gist:6394792

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Are network topologies broken?

I hate hierarchies. Everywhere. Most networks are organized like this:

Assume 6 simple subnets
Every subnet is connected to a "core switch"

It will roughly look like this
          ----------        ----------
         | Network1 |      | Network2 |
          ----------        ----------
                     \    /
 ----------           \  /          ----------
| Network3 | -------  Core ------- | Network4 |
 ----------           /  \          ----------
                     /    \
          ----------        ----------
         | Network1 |      | Network2 |
          ----------        ----------

Let's try to get some numbers out of that. The design has 2 possible
bottlenecks:

The Core switch switching capacity
The lines to a subnet

The former is a huge problem as money can only buy you so much switching
capacity. The second one is a problem if one subnet is attracting way more
traffic, e.g. because it hosts a busy fileserver.
For the course of this file we'll treat single-port lines and bundled lines
alike.
Let's see how a system without a classic topology will look like.
We will pick the approach of Kademlia routing and apply it to this network:

We'll arrange all networks as a ring (each network has 2 neighbors, 1 hop away)
We'll then connect each network to every network that is 2 hops away
We'll connect each network to every network that is 4 hops away
[..] and so on until we can't find a network that is far enough away

          ---------------------------------------------
         |                                             |
       Net1 --- Net2 --- Net3 --- Net4 --- Net5 --- Net6
       | |      |  |     |  |     |  |     |  |     |  |
       | `------|--|-----'  `-----|--|-----'  |     |  |
       |        |  |              |  |        |     |  |
       |        |  `--------------'  `--------|-----'  |
       `--------|-----------------------------'        |
                `--------------------------------------'

Every network ends up with 4 connections, or floor(2 * (log N) / log 2) in the
case of N networks.
This means that doubling the size of the network requires 2 more connections
for every subnet. A property that will yield way better scalability compared
to the core switch approach which must grow linearly with the number of
connected networks.
What about the bottlenecks? What if one subnet is highly popular?
We have to compute the Shortest Paths matrix with full paths. OSPF would
usually do that in software.
Let's say every network wants to talk to Network1.

Network1 is local, this shouldn't be a problem
Network2 is directly connected.
Network3 is directly connected.
Network4 can use (n2, n1), (n3,n1), (n5,n1), (n6,n1)
Network5 is directly connected.
Network6 is directly connected.

This is a bit unfortunate as Network4 will suffer slight starvation without
QoS. We have however the case that all networks are balanced accress
links. The average distance is 1.2. This factor will only grow logarithmically.
So as long as the 4 links combined are 1.2 times as powerful as the link to
the core we're fine. Nice, isn't it?
Anyway, the former worst-case is no longer a worst-case. We will reach maximum
speed once we assign 50% of the ports as network interchange ports, simply
because we can't saturate more ports!
The new worst-case is maximum-hop communication. For our 6 network case it's
2 hops. This means that, if every server talks to a maximum distance server,
we'll end up with half of the total provided network bandwidth.
Now let's push this a bit. Let's say we'd take a Arista* 7050T-64
which can support up to 64 10GBit ports.
(*) I'm taking Arista because I can easily lookup the switches and specs.
Less products + well organized site == fast lookup. I'd be happy to
receive other examples.
Say we'd want to build a network of 6'000 servers. That's less than 200
32-port blocks.
200 networks, with 32 ports, means 64 Tbps network speed, devided by ~4,
which means the network would be capable of 16 Tbps worst case performance.
Or ~32Tbps average case. Way better than even an awesome core switch.
Great, right? Well, we can do better. Let's say we'd ship containers of

1000 servers, like google does. How about using classic core switches as
entry points?

The Arista 7500 offers 384 10Gbit switches. We would again assign 192 ports
to local nodes and 192 ports to the network. We would need 6 such machines
to connect 1152 servers. Roughly the density of a google container
(is this a coincident?).
Every "container" accounts for ~6 subnets. Taking again a maximum of a
4.0 degeneration in network speed we'd come up with up to 35 containers.
Or 35 * 6 * 1152 = 241920 servers. Communicating at near-wire speed.
All the time. We are talking about a network speed of more than 100 Tbit.
Given economics / port costs I'd bet that Google and others use a structure
like this. There is no way to beat this with a classic topology.
It's even nonsense trying to beat this with smaller networks. Anything that's
1'000 servers and more should evaluate this flat structure.
Why am I writing this?

We are currently doubling our servers every year. We've filled the first subnetworks
and racks. I'd bet this kind of stuff will be relevant for us in 1 year.
Plus I oppose classic hierarchies. And I recently read "the datacenter as a computer"
which outlines a classic hierarchy. I read it and  thought "no way, this is in no way optimal!".
I thus had to articulate my anti-hierarchical setup.
What's the consequence?

It looks like it's possible to scale a network that feels like it's flat. There is no big speed
penalty in talking to a random server. Yet all subnets are small and redundancy is extremly high.
This means that, given a virtual overlay, we'll be able to scale networks to the size of a warehouse
yet keeping it virtually like a flat network with e.g. fake IPs / subnets that are translated
on the switch backplane, e.g. via openflow.