azet/Datacenter-Operations__here_be_dragons.md

## Datacenter-Operations__here_be_dragons.md

      
    Raw
  

              Datacenter-Operations__here_be_dragons.md
            
          
    Check Out these projects, papers and blog posts if you're working on Geo redundant Datacenters or even if you only need to have your software hosted there. It's good to know what you're in for.
  Collected these for a colleague, these have been super useful over 
  the past 15+ years and and will most likely help and/or entertain you. 
  May be extended in the future.
  -- azet (@azet.org)

load balancing

DNS geo & anycast


https://dnsdist.org
https://yetiops.net/posts/anycast-bgp/
https://blog.cloudflare.com/a-brief-anycast-primer/ & https://www.cloudflare.com/en-gb/learning/dns/what-is-anycast-dns/

tcp/udp at the edge

Good general overview before you dive into any particular project: https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/
open source load balancers / designs & projects (f5 is nothing compared to most of these):


https://traefik.io/traefik/ - https://github.com/traefik/traefik
https://github.blog/2016-09-22-introducing-glb/
https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/
https://github.com/google/seesaw + https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf

dynamic route based fail-over, load balancing- & sharing at the border


https://vincent.bernat.ch/en/blog/2018-multi-tier-loadbalancer
https://blog.cloudflare.com/cloudflares-architecture-eliminating-single-p/
https://vincent.bernat.ch/en/blog/2013-exabgp-highavailability - https://github.com/Exa-Networks/exabgp
https://youtu.be/jJTqGFs4LNo?si=CnjAJChNNSA45MaW
https://metallb.universe.tf/concepts/bgp/

Datacenter level considerations:

Introductory: https://www.youtube.com/watch?v=KAFDI8j_h00

[watch me! very informative and amusing talk on Facebook/Meta switching from classical data-center use to Open Compute Platform from LISA2014 - a few years back - right when the first wave of big impact changes started to hit. It's very interesting to see how much has changed since, but even more so, how much has changed between the speaker starting at the company and giving the talk, with previous experience running Microsoft and Telco Hyperscale Datacenters]
Compliance

https://www.youtube.com/watch?v=Ow9s2c7zYXc
Rack/Power considerations

Note on OCP - Open Compute Platform vs. pick-and-choose Datacenter & Vendor Design:


these infos/recommendations are based on latest generation OCP designs and vendor specific implementations of base specs. -
the reason for linking to- and mentioning so many of them is the universal applicability and advanced open design, at the
time of writing (late 2023) I'm not aware of any other open design specification that covers the entire data-center
infrastructure  for storage including universal management APIs (redfish - REST) from compute nodes to PSUs and Backup
battery packs to open management (swordfish): block storage servers, NVMe & SSD Flash JBOD arrays etc.,
SAS/SATA/SM&ModBus/OpenCAPI/NVLink/PCIe Interconnects & Switching - It includes open specifications and independent third
party as well as established industry vendor implementations (i.e. finished products you can buy today at very
competitive prices) using open-hardware, -software, -firmware (providing manual & automated remote update capabilities for
every component) and open-source Baseband Management Controllers (OpenBMC) used throughout -- while other hyperscale
providers do use similar designs, most of them are not open or only partially. open designs related to HPC designs and
installations are very applications specific and may not fit what you're looking for in a data-center, as such very few are
included. The OCP Rack version 3 spec. includes the classical OU (OpenUnit) 21" rack-width for racks, compute, power,
interchangeable - i.e. a rack can hold both types of components - and vendors implement as OU and 19" for compute, storage,
power, bbu (backup battery unit, etc. - all in all this offers the best and most well understood, tested and researched
modular, open specification for data-centers, from a few racks to hyper-scale data-center size. It also offers the best
power/energy/thermal efficiency of any open design and as such isn't just nicely green and re-uses a lot of recyclebles for
a low carbon footprint, it in effect saves the most money in energy and power consumption and cooling/re-us\e of thermal
heat. If you're building a data-center from the ground up, rather than racking up a cage in an existing oen, OCP is very
interesting soltuion you absolutely need to look into. If you're in a modern Tier 3-4 Datacenter and own a cage or room,
there's still the possibility to use 19" Racks and have built-in rack reundancy and low-cost hardware providing a unified
managment interface and API. In most vendor implementations every component offers multiple redundancy and easy field
replacable units (FRUs) which can be hot-swapped (while in operation); either without any special tooling/hardware or
without tools at all (customer option: e.g. standard for Meta OCP ORv3 Racks, Uber uses ORv3 Racks with EIA 19"-width
components in their datacenters, Microsoft has gone a similar way with their Project Olympus Implementation for their
Hyperscale datacenter use-cases, while Google uses both variants as well as versions with the 48V Busbar in a different
location within the racks among other modifications for integration into existing DCF-Designs). Power and backup power modules may be $N+1$ redundant by default; optionally you could have
$2 \times N+1$ PSUs and BBUs per Rack

OCP Open Rack V3 (ORv3):

Rack & specific implementations:


Overview of Open Rack V3 Base Specification

ORV3 Meta Frame Specification

ORv3 Meta Frame Update


Google Implementation of Open Rack V3


48V Busbar, Power-cables and connectors:


Open Rack 48V Busbar and Connectors

Open Rack V3 Busbar and Connector Update


ORV3 48V system power architecture & HSC design consideration
Google/Microsoft 48V Onboard Power Delivery Specifications
ORV3 AC WHIP Power Cable Development

Power system: PSUs & Battery Backup:


Open Rack V3 Power System Overview

Requirements/Considerations of Next Generation of ORv3 PSU and Power Shelves

('HPR Spec' = 83% Power increase: Shelf 18->33kW/each rectifier 3.3->5.5kW)
ORV3 BBU Shelf Technical Highlights
Design Challenges of a BBU Module and Shelf Solution
Deep Dive on Open Rack V3 Power Shelves

ORV3 Power Shelf Development Update


Open Rack V3 Monitoring and Control of Power/Battery Systems
19" Version/'Uber Universal 48V Power Shelf'


Efficiency Improvements and Other Developments in ORv3 Power Solutions
High Power (100KW) Open Rack Architecture

Older research & development on 48V use in data-center applications:


48V direct to CPU goes Mainstream
OCPUS18 – Google 48V Update Flatbed and STC

Thermals: Cooling (HVAC) - means energy and heat transfer


classicial DC HVAC: https://www.youtube.com/watch?v=xv-i4RQLswo
Modular/OCP/Cloud Provider scale cooling: https://www.youtube.com/watch?v=fSzQyybedTs

Power Usage Effectivenesss

PUE stands for Power Usage Effectiveness, it's a value that can define if an entire datacenter was built well - sometimes when it was built and as progress goes on it's not uncommon to see Hyperscale Dataceters or HPC Sites with PUE values of close to 1. Which means no waste.
The calculation of power usage effectiveness is total facility power/IT equipment energy = PUE.
In-depth article: https://www.datacenterknowledge.com/sustainability/what-data-center-pue-defining-power-usage-effectiveness
My two cents on the issue:

[It has become essential to think of waste energy like over-cooled DC rooms. Today many run Cold Ailes at 30-40c instead of 20-25 as was in the past, this makes a huge impact in energy cost, efficiency and it does actually lower power consumption if you have 80-95% efficient PSUs, FANs and all othter kinds of equipment you see everywhere in a Dataceter over and over again running continoouisly. Cooling in data-centers is more than just hot and cold ailse - it's becomming more and more frequent for biigger datacenter prooviders and cloud companies to use cold-air intake. If you have a DC in Norway thats really good in winter! Others use geothermal or water ways to cool. In past HPC engangements I've seen bigger sites use the outpout of thermal heating from their TOP100 Machine to heat the adjacent buildings on campus. It's a concept known as "green computing" but I never liked the term because for the most part cloud providers like AWS will do ANYTHING to save $$$, "if it's good - that's a nice thing to market, if its bad, well lets just see it doesn't become a legal issue" - but credit where credit is due, most of these reallly smart energy efficiency schemes and green computing initiatives are run by engineers that do actually care]
Sustainability, Re-usability & Carbon-footprint

Background information on carbon emissions and sustainability issues:


Carbon Footprint of Data Centers & Data Storage Per Country
Measuring greenhouse gas emissions in data centres: the environmental impact of cloud computing
IEA: Data Centres and Data Transmission Networks
DCD: Meta report shows the company causes far more emissions than it can cover with renewable energy
DCD: Even by cheating, Amazon can't look green
AWS News Blog: New – Customer Carbon Footprint Tool
Google Blog: Our commitment to climate-conscious data center cooling
Google Blog: How Google's data centers help Europe meet its sustainability goals
Microsoft Datacenters: Powering sustainable transformation
Meta: 'Our Path to Net Zero'
Microsoft Corporate Responsibility: 2022 Environmental Sustainability Report

Open Compute Project DCF related:


Data Center Facility - Sustainability Metrics
Designing Datacenter Hardware for Environmental Sustainability
Analyzing Building Elements Relating to Data Center Construction Emissions
PANEL: Sustainability in Action & Future Direction for the Community

Cloud Provider self-reporting:


https://sustainability.fb.com/data-centers/
https://www.microsoft.com/en-us/sustainability/azure + https://azure.microsoft.com/en-us/explore/global-infrastructure/sustainability
https://sustainability.google/operating-sustainably/ + https://www.google.com/about/datacenters/cleanenergy/
https://cloud.google.com/sustainability

DCF Security & Logging


https://www.youtube.com/watch?v=RQElcXn_wRk

https://www.opencompute.org/wiki/Data_Center_Facility/Operation_Technology_and_Security
https://drive.google.com/file/d/1wLRTIugtGmVp2IcoYGulZsYCP7UPHAEv/view?usp=sharing


Silent Data Corruption / HW Faults at scale


https://www.opencompute.org/wiki/Hardware_Management/Hardware_Fault_Management

Academic Background & Measurements:

https://arxiv.org/pdf/2102.11245.pdf
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf


Health Check & Testing Frameworks:

https://github.com/amd/Open-Field-Health-Check
https://github.com/opendcdiag/opendcdiag
https://github.com/google/openhtf
https://github.com/NVIDIA/DCGM
https://docs.openshift.com/container-platform/4.14/machine_management/deploying-machine-health-checks.html
https://www.supermicro.com/en/solutions/management-software/superdoctor


"Planet Scale" Considerations & Solutions (multidisciplinary)

Software Stack for Distributed Cloud Services


Stanford: Building Software Systems At Google and Lessons Learned

(early view of Google's software stack & improvements for a growing Hyperscale Cloud Service Provider [1998-2010])

Cluster Scheduling / management


Large-scale cluster management at Google with Borg
 - Talk - Video

Distributed Data storage & Databases


Spanner: Google's Globally Distributed Database - Talk - Video
Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service - Talk - Video

Networking

Network and Interconnect / Peering / Border & Edge

https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

[Spine/Leaf Topologies, now de-facto standard among all major enterprise industry vendors]
Scaling L3 Switching to a unified Open Compute Standard on 100, 200, 400, 800G:


First gen: https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/
Second gen: https://www.youtube.com/watch?v=eIA_0GnwVBo (with Vendor support and interest by Arista among others)

Running a globally-distributed Network, WAN and data-center interconnects at planetary scale:


https://research.google/pubs/b4-and-after-managing-hierarchy-partitioning-and-asymmetry-for-availability-and-scale-in-googles-software-defined-wan/
https://research.google/pubs/jupiter-evolving-transforming-googles-datacenter-network-via-optical-circuit-switches-and-software-defined-networking/
https://research.google/pubs/taking-the-edge-off-with-espresso-scale-reliability-and-programmability-for-global-internet-peering/ - https://dl.acm.org/doi/pdf/10.1145/3098822.3098854
https://research.google/pubs/design-acceptance-and-capacity-of-subsea-open-cables/

Net.: Resilience, Congestion Control, QoS/CoS:


https://dl.acm.org/doi/10.1145/3603269.3604867
https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/579082c35dd0a00578a334033e0162f328e8213d.pdf
https://www.usenix.org/system/files/nsdi23-wang-weitao.pdf
https://www.youtube.com/watch?v=NP2IrQrHZrY
https://www.usenix.org/conference/atc22/presentation/xu
https://dl.acm.org/doi/10.1145/3603269.3604867
https://arxiv.org/abs/2012.14219 (https://www.youtube.com/watch?v=fSIcPuuI6kk)

Net.: IPv4 "mobility" & exhaustion:


https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/

Net.: Topologies:


https://research.google/pubs/technology-driven-highly-scalable-dragonfly-topology/
https://www.usenix.org/system/files/nsdi22-paper-gibson.pdf
https://research.google/pubs/minimal-rewiring-efficient-live-expansion-for-clos-data-center-networks/

Net.: Change Management / Life Cycle / Zero Trust


https://www.usenix.org/conference/atc23/presentation/al-fares
https://www.usenix.org/publications/loginonline/beyondcorp-and-long-tail-zero-trust
https://conferences.sigcomm.org/hotnets/2023/papers/hotnets23_mogul.pdf
https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/e63dac4b63a6a0e66d388990028fd221a3d4da2f.pdf

Net.: Storage over Ethernet

NVMe over TCP (SNIA):


What NVMe™/TCP Means for Networked Storage
NVMe/TCP is Here for All of Your Hyperscale Storage Needs
NVMe/TCP: Performance, Deployment and Automation
Scaling NVMe over IP Fabric Security

Net.: Open Source Network OS (NOS)


https://sonicfoundation.dev/
https://opencomputeproject.github.io/onie/

https://www.juniper.net/documentation/us/en/software/junos/junos-install-upgrade/topics/task/qfx-series-software-upgrading-onie.html


https://frrouting.org/
SwitchV: Automated SDN Switch Validation with P4 Model (Talk - Video)

(as well as parts of Arista EOS, Juniper JunOS, etc.)
Net.: Open Optical Transport:


OCPUS18 – Open Optical Packet Transponder Leveraging OCP Networking Technology
Open optical communication systems at a hyperscale operator
Open and disaggregated optical transport networks for data center interconnects

Time management / precision timing in the datacenter & network

Open-hardware & source generic solutions for DCs:

https://www.opencompute.org/wiki/Time_Appliances_Project

https://github.com/opencomputeproject/Time-Appliance-Project/tree/master/Open-Time-Server/
https://github.com/opencomputeproject/Time-Appliance-Project/tree/master/Time-Card


Grandmaster Clocks: GPS/GNSS/Glonass/Galileo, Atomic Clocks


https://www.meinbergglobal.com/english/products/grandmaster-clocks.htm
https://store.timebeat.app/products/g0kk-1-mini
https://safran-navigation-timing.com/product/securesync-time-and-frequency-reference-system/
https://www.acquitek.com/product/meridian-2-2/
https://safran-navigation-timing.com/product/gnssource-2500/

Signal Distribution (typically PPS, e.g. 1PPS or 10Mhz, supported by Routers/Switches and Telco Gear)


https://www.meinbergglobal.com/english/products/redundant-frequency-sine-distribution-unit.htm
https://novuspower.com/catalog/item/10-channel-frequency-reference-distribution-amplifier/
https://rts.as/product/rts-pps-distribution-panel/
https://www.acquitek.com/product/csda-1-2/

Issues with Timekeeping: the infamous leap second (don't forget about the leap day)

[Support fault-tolerant, highly-available NTP, PTP. You need to support both realistically if you're not just running a few rented racks. Many Routers, Switches and Telco Gear (4G, 5G,..) require PTP, NTP is for Servers and VMs]
https://developers.google.com/time/smear
Hardware Security Modules + Key Storage


https://cloud.google.com/kms/docs
https://aws.amazon.com/kms/
https://github.com/usbarmory/usbarmory

Storage Key Management

https://www.youtube.com/watch?v=ZoTg-wwZ6Yw
Post mortems of outages you can learn from:

(most importantly: write a good post mortem in the first place, provide a concise time-line, response times, things that went well & things that didn't. provide a central place for customers and affected parties to call in - i.e. a "war room" so your engineers can do their work without their mobile rining every 20secs)
https://github.com/aphyr/partitions-post/blob/master/README.markdown

(this is a true gem that was passed around among SRE & software engineers some years back. partitions do exist.)
https://github.com/danluu/post-mortems
Planning for failure: chaos engineering


Chaos Engineering is the discipline of experimenting on a system in order to build
confidence in the system’s capability to withstand turbulent conditions in production.

--- https://principlesofchaos.org

Chaos Monkey: https://netflix.github.io/chaosmonkey/ - https://github.com/Netflix/chaosmonkey
ACM Queue: Resilience Engineering: Learning to Embrace Failure
Forbes - Interview: How Facebook's Project Storm Heads Off Data Center Disasters