Skip to content

Instantly share code, notes, and snippets.

@azet
Last active January 30, 2024 18:25
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save azet/02c0436923575889358ea586c447fccd to your computer and use it in GitHub Desktop.
Save azet/02c0436923575889358ea586c447fccd to your computer and use it in GitHub Desktop.
Reading material for Operations & Datacenter engineers and managers

Check Out these projects, papers and blog posts if you're working on Geo redundant Datacenters or even if you only need to have your software hosted there. It's good to know what you're in for.

  Collected these for a colleague, these have been super useful over 
  the past 15+ years and and will most likely help and/or entertain you. 
  May be extended in the future.
  -- azet (@azet.org)

load balancing

DNS geo & anycast

tcp/udp at the edge

Good general overview before you dive into any particular project: https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/

open source load balancers / designs & projects (f5 is nothing compared to most of these):

dynamic route based fail-over, load balancing- & sharing at the border

Datacenter level considerations:

Introductory: https://www.youtube.com/watch?v=KAFDI8j_h00
[watch me! very informative and amusing talk on Facebook/Meta switching from classical data-center use to Open Compute Platform from LISA2014 - a few years back - right when the first wave of big impact changes started to hit. It's very interesting to see how much has changed since, but even more so, how much has changed between the speaker starting at the company and giving the talk, with previous experience running Microsoft and Telco Hyperscale Datacenters]

Compliance

https://www.youtube.com/watch?v=Ow9s2c7zYXc

Rack/Power considerations

Note on OCP - Open Compute Platform vs. pick-and-choose Datacenter & Vendor Design:

these infos/recommendations are based on latest generation OCP designs and vendor specific implementations of base specs. - the reason for linking to- and mentioning so many of them is the universal applicability and advanced open design, at the time of writing (late 2023) I'm not aware of any other open design specification that covers the entire data-center infrastructure for storage including universal management APIs (redfish - REST) from compute nodes to PSUs and Backup battery packs to open management (swordfish): block storage servers, NVMe & SSD Flash JBOD arrays etc., SAS/SATA/SM&ModBus/OpenCAPI/NVLink/PCIe Interconnects & Switching - It includes open specifications and independent third party as well as established industry vendor implementations (i.e. finished products you can buy today at very competitive prices) using open-hardware, -software, -firmware (providing manual & automated remote update capabilities for every component) and open-source Baseband Management Controllers (OpenBMC) used throughout -- while other hyperscale providers do use similar designs, most of them are not open or only partially. open designs related to HPC designs and installations are very applications specific and may not fit what you're looking for in a data-center, as such very few are included. The OCP Rack version 3 spec. includes the classical OU (OpenUnit) 21" rack-width for racks, compute, power, interchangeable - i.e. a rack can hold both types of components - and vendors implement as OU and 19" for compute, storage, power, bbu (backup battery unit, etc. - all in all this offers the best and most well understood, tested and researched modular, open specification for data-centers, from a few racks to hyper-scale data-center size. It also offers the best power/energy/thermal efficiency of any open design and as such isn't just nicely green and re-uses a lot of recyclebles for a low carbon footprint, it in effect saves the most money in energy and power consumption and cooling/re-us\e of thermal heat. If you're building a data-center from the ground up, rather than racking up a cage in an existing oen, OCP is very interesting soltuion you absolutely need to look into. If you're in a modern Tier 3-4 Datacenter and own a cage or room, there's still the possibility to use 19" Racks and have built-in rack reundancy and low-cost hardware providing a unified managment interface and API. In most vendor implementations every component offers multiple redundancy and easy field replacable units (FRUs) which can be hot-swapped (while in operation); either without any special tooling/hardware or without tools at all (customer option: e.g. standard for Meta OCP ORv3 Racks, Uber uses ORv3 Racks with EIA 19"-width components in their datacenters, Microsoft has gone a similar way with their Project Olympus Implementation for their Hyperscale datacenter use-cases, while Google uses both variants as well as versions with the 48V Busbar in a different location within the racks among other modifications for integration into existing DCF-Designs). Power and backup power modules may be $N+1$ redundant by default; optionally you could have $2 \times N+1$ PSUs and BBUs per Rack

OCP Open Rack V3 (ORv3):
Rack & specific implementations:
48V Busbar, Power-cables and connectors:
Power system: PSUs & Battery Backup:
Older research & development on 48V use in data-center applications:

Thermals: Cooling (HVAC) - means energy and heat transfer

Power Usage Effectivenesss

PUE stands for Power Usage Effectiveness, it's a value that can define if an entire datacenter was built well - sometimes when it was built and as progress goes on it's not uncommon to see Hyperscale Dataceters or HPC Sites with PUE values of close to 1. Which means no waste.

The calculation of power usage effectiveness is total facility power/IT equipment energy = PUE.

In-depth article: https://www.datacenterknowledge.com/sustainability/what-data-center-pue-defining-power-usage-effectiveness

My two cents on the issue:
[It has become essential to think of waste energy like over-cooled DC rooms. Today many run Cold Ailes at 30-40c instead of 20-25 as was in the past, this makes a huge impact in energy cost, efficiency and it does actually lower power consumption if you have 80-95% efficient PSUs, FANs and all othter kinds of equipment you see everywhere in a Dataceter over and over again running continoouisly. Cooling in data-centers is more than just hot and cold ailse - it's becomming more and more frequent for biigger datacenter prooviders and cloud companies to use cold-air intake. If you have a DC in Norway thats really good in winter! Others use geothermal or water ways to cool. In past HPC engangements I've seen bigger sites use the outpout of thermal heating from their TOP100 Machine to heat the adjacent buildings on campus. It's a concept known as "green computing" but I never liked the term because for the most part cloud providers like AWS will do ANYTHING to save $$$, "if it's good - that's a nice thing to market, if its bad, well lets just see it doesn't become a legal issue" - but credit where credit is due, most of these reallly smart energy efficiency schemes and green computing initiatives are run by engineers that do actually care]

Sustainability, Re-usability & Carbon-footprint

Background information on carbon emissions and sustainability issues:
Open Compute Project DCF related:
Cloud Provider self-reporting:

DCF Security & Logging

Silent Data Corruption / HW Faults at scale

"Planet Scale" Considerations & Solutions (multidisciplinary)

Software Stack for Distributed Cloud Services

Cluster Scheduling / management

Distributed Data storage & Databases

Networking

Network and Interconnect / Peering / Border & Edge

https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
[Spine/Leaf Topologies, now de-facto standard among all major enterprise industry vendors]

Scaling L3 Switching to a unified Open Compute Standard on 100, 200, 400, 800G:

Running a globally-distributed Network, WAN and data-center interconnects at planetary scale:

Net.: Resilience, Congestion Control, QoS/CoS:

Net.: IPv4 "mobility" & exhaustion:

Net.: Topologies:

Net.: Change Management / Life Cycle / Zero Trust

Net.: Storage over Ethernet

NVMe over TCP (SNIA):

Net.: Open Source Network OS (NOS)

(as well as parts of Arista EOS, Juniper JunOS, etc.)

Net.: Open Optical Transport:

Time management / precision timing in the datacenter & network

Open-hardware & source generic solutions for DCs:

Grandmaster Clocks: GPS/GNSS/Glonass/Galileo, Atomic Clocks

Signal Distribution (typically PPS, e.g. 1PPS or 10Mhz, supported by Routers/Switches and Telco Gear)

Issues with Timekeeping: the infamous leap second (don't forget about the leap day)

[Support fault-tolerant, highly-available NTP, PTP. You need to support both realistically if you're not just running a few rented racks. Many Routers, Switches and Telco Gear (4G, 5G,..) require PTP, NTP is for Servers and VMs]

https://developers.google.com/time/smear

Hardware Security Modules + Key Storage

Storage Key Management

https://www.youtube.com/watch?v=ZoTg-wwZ6Yw

Post mortems of outages you can learn from:

(most importantly: write a good post mortem in the first place, provide a concise time-line, response times, things that went well & things that didn't. provide a central place for customers and affected parties to call in - i.e. a "war room" so your engineers can do their work without their mobile rining every 20secs)

https://github.com/aphyr/partitions-post/blob/master/README.markdown
(this is a true gem that was passed around among SRE & software engineers some years back. partitions do exist.)

https://github.com/danluu/post-mortems

Planning for failure: chaos engineering

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

--- https://principlesofchaos.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment