cjdelisle/cjdns_packet_priority.md

## cjdns_packet_priority.md

      
    Raw
  

              cjdns_packet_priority.md
            
          
    Cjdns Packet Prioritization System

The packet prioritization system is designed to allow access to a link to be exercised by it's
owner, thus allowing for bandwidth to be sold and used.

A packet SHOULD NOT be dropped unless there is actually not enough bandwidth to send everything.
A bandwidth lease MUST be able to send at least its bandwidth limit.
Unused bandwidth SHOULD be divided between those leases which are currently sending traffic in a
fair way based on the relative size of the leases.

What follows is a description of how bandwidth leases are used to decide which packet has priority
when there is contention and some packet must be dropped.
Background

Every network switch is a series of incoming paths linked to a series of outgoing paths. Data
coming in to the switch follows the incoming path for the interface which that data comes from
and then based on the address of the packet, the data is sent down one of the outgoing paths.
There is always a potential problem that data is coming from all incoming paths and it is all
destine for the same outgoing path, unless the bandwidth of the outgoing path is at least as much
as the sum of bandwidth of all incoming paths, the switch needs to make a decision which packets
will be dropped. In our design, there is a
multiplexer which merges data coming from the
different incoming paths. Data is merged into one outgoing path that is as fast as the sum of
bandwidth of all incoming paths, but this fast path is only on-chip and the priority rules are
applied in order to allow everything to fit on the final outgoing link. See this blog series for
more information about
how network switches work.
Outgoing path

The outgoing path of the switch begins with a multiplexer (mux) which converges traffic coming from
multiple interfaces down to a single high speed outgoing path (denoted by the === path) the high
speed path is fast enough that it can carry as much traffic as all incoming interfaces can send at
the same time. The objective of drop1, drop2 and ringbuf is to limit that traffic to what the actual
network link can handle.
incomingA---+++++++
            |     |
incomingB---+     +
            | mux |===< prioritizer >===<drop1>[ ringbuf ]<drop2>--link-->
incomingC---+     +
            |     |
incomingD---+++++++


incoming[A-D]: Incoming traffic is tagged with the bandwidth lease which that traffic should
be accounted on, see: Virtual Switches to understand how that tag is decided.
mux: This is a multiplexer which takes traffic from n incoming links and merges it into
one highspeed outgoing path, the high speed path is fast enough to accommodate as much traffic as
all incoming links can send at the same time.
prioritizer: The prioritizer reads the bandwidth lease tag and converts this into a priority tag.
Supposing you have a switch supporting 8 leases per link, the prioritizer will contain 8 bandwidth
meters (implemented as IIR functions).
In order to allow network tuning, the alpha for the IIR functions, henceforth prioritizer_alpha
will be configured by the switch owner (or his supernode) and will be public information. When the
amount of bandwidth used on a bandwidth lease is exactly the amount of the lease, the prioritizer
will tag packets on that lease with priority 1. If the amount of bandwidth used is more, it will be
a fractional priority and if it is less, it will be a priority greater than 1.
Drop1 and drop2: Ideally, packets would be placed in a
priority queue and the lowest priority packet would
be replaced when the queue is full and more traffic comes than can be handled. However, this
would require continuously ordering packets by priority, a significant processing overhead. Rather
than a queue, we will use a ring buffer which is
the norm in network device queues. A ring buffer can be thought of like a simple queue: You can
add to the tail and remove from the head but you cannot alter the middle. Therefore we have one
packet dropped in each place where the queue can be easily manipulated, the head and the tail.
The packet droppers simply discard any packet with priority lower than min_priority where
min_priority rises as the ring buffer fills. If the ring buffer remains less than or equal to half
full, min_priority should remain at zero. The size of the ring buffer will be determined by the
hardware maker who designs the switch chip, the mean and the highest value of min_priority in a
given period should be published as well as drop counts for each bandwidth lease.

Packet Priority

The actual packet priority used by the droppers is a 32 bit value made up of 2 16 bit elements, the
most significant 16 bits are the priority as defined by the prioritizer, the least significant 16
bits is the packet priority as defined by a priority field on the packet header. The priority field
allows bandwidth leaseholders to control how their "virtual networks" are used.
real_priority = priority_from_prioritizer * 2**16 +
    priority_from_packet_header

Incoming path

The incoming path is where packets are received from an external interface.
                        ++++++++++++++---> outgoingA
                        |            |
                        +            +---> outgoingB
--< priority updater >--|cjdns switch|
                        +            +---> outgoingC
                        |            |
                        ++++++++++++++---> outgoingD

The 2 main components of the incoming path are the cjdns switch and the priority updater. The cjdns
switch is already well documented elsewhere so we will focus on the priority updater.
All incoming traffic is tagged with the bandwidth lease with which it is associated, if
(for example) the switch supports 8 leases per link, the priority updater will have 8 priority
meters. While the bandwidth meters in the outgoing path are intended to protect leaseholders
from each other, the priority updater is intended to ensure fairness within the virtual network
created by a single bandwidth leaseholder. Also unlike bandwidth meters, priority meters measure
kilobits TIMES priority per second (Kbp/s). In order to allow a bandwidth leaseholder to
interconnect with other leaseholders and customers, there is a configurable called
bandwidth_priority_limit. If for example a customer has a 10Gb link to a bandwidth leaseholder
but they only pay for the equivilant of 1Gb of "network access", the leaseholder can configure
bandwidth_priority_limit to 1Gbp/s for that link, allowing the customer to use all 10Gb when
the network is idle but limiting them to the equivilent of 1Gb/s of network access during times
of network contention.
The way bandwidth_priority_limit works is that when each packet comes in, it's bandwidth-priority
is measured in a IIR function, a number called priority_multiplier is initialized at 1 and
decreased toward zero as the measured bandwidth-priority exceeds bandwidth_priority_limit, when
bandwidth-priority goes below bandwidth_priority_limit, priority_multiplier is increased until
it reaches 1 again. For each incoming packet, the priority is multiplied by priority_multiplier,
thus keeping the total bandwidth-priority per second less than or equal to
bandwidth_priority_limit. The leaseholder is clearly able to change the bandwidth_priority_limit
but they are also able to change the alpha used in the IIR function used for measuring
bandwidth-priority (aka: bandwidth_priority_alpha)
Free tier

Every interface will have a "free tier" lease for which the priority decided by the prioritizer is
always zero. The priority updater for the free tier will function as normal but the configuration
of the bandwidth_priority_alpha and bandwidth_priority_limit will be configurable only by the
switch owner.
Virtual Switches

In order to determine which bandwidth lease a packet will sent on, we introduce the idea of the
virtual switch. Each physical switch device will have a capability of some number of virtual
switches that will be determined by the maker of the hardware. Virtual switches will be a resource
which can be leased to bandwidth leaseholders in order for them to use their bandwidth leases to build
networks.
An owner of a virtual switch will be able to authorize associating their switch with a given lease,
while the holder of a bandwidth lease will also be able to authorize associating their lease with a
given virtual switch. When both sides authorize the other, the lease will be associated with the
virtual switch. When a lease and a virtual switch are associated:

Any incoming traffic on that lease will go to that that virtual switch and
Any outgoing traffic from that switch that is destine for that link will use that bandwidth lease

Any traffic destine for an interface for which there is no associated lease will use the free tier.
Configurables and data-points


Hardware defined:

Ring buffer size
Max leases per interface
Max virtual switches


Data points (read-only):

Per hard interface:

Highest min_priority in time-period
Average min_priority over time-period
Packets lost in transit over time-period


Per lease:

Bytes sent in time-period
Bytes dropped in time-period
Highest priority_multiplier in time-period
Average priority_multiplier in time-period


Per virtual switch

Bytes sent in time-period (to each interface)


Configurables

Global (defined by switch device owner)

prioritizer_alpha
Free tier bandwidth_priority_limit
Free tier bandwidth_priority_alpha


Per lease (defined by leaseholder)

bandwidth_priority_limit (Kbp/s)
bandwidth_priority_alpha
Associated virtual switch ID


Per virtual switch (defined by virtual switch leaseholder)

Associated lease ID (for each interface)


Notes


We need to be careful of the possibility for oscillations between ring buffers filling and
emptying and min_priority rising and falling.

Perhaps we should introduce a configurable amount of randomized jitter in the IIR functions,
in order to frustrate attempts to create oscillations in the network.


"priority_multiplier is initialized at 1 and decreased toward zero as the measured bandwidth-priority exceeds bandwidth_priority_limit" the actual formula is not defined
We need to find a good alpha to define for the free tier (and to use as a default for leases)
Metering bandwidth is a complicated problem.

Electricity is simple: you get paid to send it to the grid and you pay to receive it.
Bandwidth you are potentially "using" the network whether you're sending or receiving so any
system which pays people to (e.g. receive) packets is gameable by someone who can trick others
into sending them packets (e.g. downloading big files all day).
The way bandwidth is metered on the internet depends on the relationship of network operators.

If two network operators have a customer/provider relationship, the customer pays for all
traffic sent and received (in reality they don't pay per packet, they pay per month for a
fixed size link and all traffic that is sent and received goes through that same link).
Suppose you and I have equipment in the same datacenter, we both pay our upstream provider to
have "internet access" but in fact your customers most often want to send traffic to my
customers and vice-versa, we might establish a "peer" relationship.
In a peer relationship, they exchange traffic without either one paying the other anything.
However, one is only allowed to use a peering interface to send traffic which is destine for
the peer's customers. Sending traffic which needs to go to the peer's provider is
effectively stealing bandwidth and can be cause for being disconnected.
A flaw in the BGP system is that operators must be roughly the same size for peering to make
sense, suppose I'm a small ISP with 200 customers in one town and you're a large ISP with 5000
customers all across the country, we could peer in the one datacenter where I have hardware,
but then all of the long distance transport between our customers will be paid for by you.
We can hope to solve this problem by making it possible for large ISPs to segment their
networks such that they appear as multiple small ISPs, then you can peer with me for free, but
only within the same city, and you can then offer me discounted access to your customers in
all other cities, a relationship which is not entirely provider/customer nor entirely peering.


We should consider that one side of a link is the "owner".

This would be the person who paid for the hardware to get from pointA to pointB.
We can define this as the node which initiates the crypto session or the one who receives it.
If we define it as the initiator, then when Alice connects to Bob's wifi, Alice owns linkA->B
and thus can hand it over to her supernode which allows her snode to manage it, which is what
we want.


Providers will likely want to setup two parallel virtual networks, one which is confined to the
links which they already own leases on (their local network) and the other which is interconnected
with their provider. When peering, they will simply link the local virtual net with their peer in
order to prevent the peer from using them as a route to their provider. There may be better
traffic engineering solutions which rely on packet priority, but this is a simple approach which
mirrors how ISPs operate today.