Skip to content

Instantly share code, notes, and snippets.

@ywkaras
Last active August 2, 2022 23:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ywkaras/073c921c5a9d49749dbf581e6005f98f to your computer and use it in GitHub Desktop.
Save ywkaras/073c921c5a9d49749dbf581e6005f98f to your computer and use it in GitHub Desktop.
Next Hop Lookup
Some thoughts…
A general way to model next hop addresses is as an entity stack, entity/sub-entity/sub-sub-entity … . I think the primary data for a next hop address would simply be an up/down flag. A next hop would be down if any entity in its address was down.
The next hop logic could also accept a continuation to schedule when a next hop made a down to up transition.
If DNS lookup was done before next hop lookup, then next hop addresses would naturally be IP-address/IP-port .
If DNS lookup was done as a part of next hop lookup, then next hop addresses would naturally be hostname/IP-port . (A down hostname would be one for which DNS lookup failed, or its IP address is down.) A big disadvantage is that the results of the lookup would have to be provided by scheduling a continuation provided with the lookup request. Another issue is if the DNS lookup results in a load-sharing/protection group of IP addresses. If next hop lookup is going to handle that, it might make sense for it to handle strategies.yaml (or something equivalent) as well.
For next hops over TLS, there would be value in appending /SNI-name to the address entity stack, since just that SNI name could be failing (presumably due to a bad cert or some other config error).
The SNI case is an example of a long-term failure, versus a transient error. We could consider making this distinction in failures in our up/down logic. However, when I worked on wireline L1/2 equipment, the prevailing idea was, don’t try to optimize the retry strategy to mitigate a flakey physical layer. (We were siloed from the wireless people, so maybe it’s different in that case.) It’s rare that a physical layer is so flakey (but still marginally usable) that it can’t be cleaned up by Forward Error Correction. The more common transient error is next hop queue tail drops due to congestion (and no great ways to deal with it).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment