Check Out these projects, papers and blog posts if you're working on Geo redundant Datacenters or even if you only need to have your software hosted there. It's good to know what you're in for.
Collected these for a colleague, these have been super useful over
the past 15+ years and and will most likely help and/or entertain you.
May be extended in the future.
-- azet (@azet.org)
- https://dnsdist.org
- https://yetiops.net/posts/anycast-bgp/
- https://blog.cloudflare.com/a-brief-anycast-primer/ & https://www.cloudflare.com/en-gb/learning/dns/what-is-anycast-dns/
Good general overview before you dive into any particular project: https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/
- https://traefik.io/traefik/ - https://github.com/traefik/traefik
- https://github.blog/2016-09-22-introducing-glb/
- https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/
- https://github.com/google/seesaw + https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf
- https://vincent.bernat.ch/en/blog/2018-multi-tier-loadbalancer
- https://blog.cloudflare.com/cloudflares-architecture-eliminating-single-p/
- https://vincent.bernat.ch/en/blog/2013-exabgp-highavailability - https://github.com/Exa-Networks/exabgp
- https://youtu.be/jJTqGFs4LNo?si=CnjAJChNNSA45MaW
- https://metallb.universe.tf/concepts/bgp/
Introductory: https://www.youtube.com/watch?v=KAFDI8j_h00
[watch me! very informative and amusing talk on Facebook/Meta switching from classical data-center use to Open Compute Platform from LISA2014 - a few years back - right when the first wave of big impact changes started to hit. It's very interesting to see how much has changed since, but even more so, how much has changed between the speaker starting at the company and giving the talk, with previous experience running Microsoft and Telco Hyperscale Datacenters]
https://www.youtube.com/watch?v=Ow9s2c7zYXc
these infos/recommendations are based on latest generation OCP designs and vendor specific implementations of base specs. - the reason for linking to- and mentioning so many of them is the universal applicability and advanced open design, at the time of writing (late 2023) I'm not aware of any other open design specification that covers the entire data-center infrastructure for storage including universal management APIs (redfish - REST) from compute nodes to PSUs and Backup battery packs to open management (swordfish): block storage servers, NVMe & SSD Flash JBOD arrays etc., SAS/SATA/SM&ModBus/OpenCAPI/NVLink/PCIe Interconnects & Switching - It includes open specifications and independent third party as well as established industry vendor implementations (i.e. finished products you can buy today at very competitive prices) using open-hardware, -software, -firmware (providing manual & automated remote update capabilities for every component) and open-source Baseband Management Controllers (OpenBMC) used throughout -- while other hyperscale providers do use similar designs, most of them are not open or only partially. open designs related to HPC designs and installations are very applications specific and may not fit what you're looking for in a data-center, as such very few are included. The OCP Rack version 3 spec. includes the classical OU (OpenUnit) 21" rack-width for racks, compute, power, interchangeable - i.e. a rack can hold both types of components - and vendors implement as OU and 19" for compute, storage, power, bbu (backup battery unit, etc. - all in all this offers the best and most well understood, tested and researched modular, open specification for data-centers, from a few racks to hyper-scale data-center size. It also offers the best power/energy/thermal efficiency of any open design and as such isn't just nicely green and re-uses a lot of recyclebles for a low carbon footprint, it in effect saves the most money in energy and power consumption and cooling/re-us\e of thermal heat. If you're building a data-center from the ground up, rather than racking up a cage in an existing oen, OCP is very interesting soltuion you absolutely need to look into. If you're in a modern Tier 3-4 Datacenter and own a cage or room, there's still the possibility to use 19" Racks and have built-in rack reundancy and low-cost hardware providing a unified managment interface and API. In most vendor implementations every component offers multiple redundancy and easy field replacable units (FRUs) which can be hot-swapped (while in operation); either without any special tooling/hardware or without tools at all (customer option: e.g. standard for Meta OCP ORv3 Racks, Uber uses ORv3 Racks with EIA 19"-width components in their datacenters, Microsoft has gone a similar way with their Project Olympus Implementation for their Hyperscale datacenter use-cases, while Google uses both variants as well as versions with the 48V Busbar in a different location within the racks among other modifications for integration into existing DCF-Designs). Power and backup power modules may be
$N+1$ redundant by default; optionally you could have$2 \times N+1$ PSUs and BBUs per Rack
- Open Rack 48V Busbar and Connectors
- ORV3 48V system power architecture & HSC design consideration
- Google/Microsoft 48V Onboard Power Delivery Specifications
- ORV3 AC WHIP Power Cable Development
- Open Rack V3 Power System Overview
- Requirements/Considerations of Next Generation of ORv3 PSU and Power Shelves
('HPR Spec' = 83% Power increase: Shelf 18->33kW/each rectifier 3.3->5.5kW) - ORV3 BBU Shelf Technical Highlights
- Design Challenges of a BBU Module and Shelf Solution
- Deep Dive on Open Rack V3 Power Shelves
- Open Rack V3 Monitoring and Control of Power/Battery Systems
- 19" Version/'Uber Universal 48V Power Shelf'
- Requirements/Considerations of Next Generation of ORv3 PSU and Power Shelves
- Efficiency Improvements and Other Developments in ORv3 Power Solutions
- High Power (100KW) Open Rack Architecture
- classicial DC HVAC: https://www.youtube.com/watch?v=xv-i4RQLswo
- Modular/OCP/Cloud Provider scale cooling: https://www.youtube.com/watch?v=fSzQyybedTs
PUE stands for Power Usage Effectiveness, it's a value that can define if an entire datacenter was built well - sometimes when it was built and as progress goes on it's not uncommon to see Hyperscale Dataceters or HPC Sites with PUE values of close to 1. Which means no waste.
The calculation of power usage effectiveness is total facility power/IT equipment energy = PUE.
In-depth article: https://www.datacenterknowledge.com/sustainability/what-data-center-pue-defining-power-usage-effectiveness
My two cents on the issue:
[It has become essential to think of waste energy like over-cooled DC rooms. Today many run Cold Ailes at 30-40c instead of 20-25 as was in the past, this makes a huge impact in energy cost, efficiency and it does actually lower power consumption if you have 80-95% efficient PSUs, FANs and all othter kinds of equipment you see everywhere in a Dataceter over and over again running continoouisly. Cooling in data-centers is more than just hot and cold ailse - it's becomming more and more frequent for biigger datacenter prooviders and cloud companies to use cold-air intake. If you have a DC in Norway thats really good in winter! Others use geothermal or water ways to cool. In past HPC engangements I've seen bigger sites use the outpout of thermal heating from their TOP100 Machine to heat the adjacent buildings on campus. It's a concept known as "green computing" but I never liked the term because for the most part cloud providers like AWS will do ANYTHING to save $$$, "if it's good - that's a nice thing to market, if its bad, well lets just see it doesn't become a legal issue" - but credit where credit is due, most of these reallly smart energy efficiency schemes and green computing initiatives are run by engineers that do actually care]
- Carbon Footprint of Data Centers & Data Storage Per Country
- Measuring greenhouse gas emissions in data centres: the environmental impact of cloud computing
- IEA: Data Centres and Data Transmission Networks
- DCD: Meta report shows the company causes far more emissions than it can cover with renewable energy
- DCD: Even by cheating, Amazon can't look green
- AWS News Blog: New – Customer Carbon Footprint Tool
- Google Blog: Our commitment to climate-conscious data center cooling
- Google Blog: How Google's data centers help Europe meet its sustainability goals
- Microsoft Datacenters: Powering sustainable transformation
- Meta: 'Our Path to Net Zero'
- Microsoft Corporate Responsibility: 2022 Environmental Sustainability Report
- Data Center Facility - Sustainability Metrics
- Designing Datacenter Hardware for Environmental Sustainability
- Analyzing Building Elements Relating to Data Center Construction Emissions
- PANEL: Sustainability in Action & Future Direction for the Community
- https://sustainability.fb.com/data-centers/
- https://www.microsoft.com/en-us/sustainability/azure + https://azure.microsoft.com/en-us/explore/global-infrastructure/sustainability
- https://sustainability.google/operating-sustainably/ + https://www.google.com/about/datacenters/cleanenergy/
- https://cloud.google.com/sustainability
- https://www.opencompute.org/wiki/Hardware_Management/Hardware_Fault_Management
- Academic Background & Measurements:
- Health Check & Testing Frameworks:
- https://github.com/amd/Open-Field-Health-Check
- https://github.com/opendcdiag/opendcdiag
- https://github.com/google/openhtf
- https://github.com/NVIDIA/DCGM
- https://docs.openshift.com/container-platform/4.14/machine_management/deploying-machine-health-checks.html
- https://www.supermicro.com/en/solutions/management-software/superdoctor
- Stanford: Building Software Systems At Google and Lessons Learned
(early view of Google's software stack & improvements for a growing Hyperscale Cloud Service Provider [1998-2010])
- Spanner: Google's Globally Distributed Database - Talk - Video
- Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service - Talk - Video
https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
[Spine/Leaf Topologies, now de-facto standard among all major enterprise industry vendors]
- First gen: https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/
- Second gen: https://www.youtube.com/watch?v=eIA_0GnwVBo (with Vendor support and interest by Arista among others)
- https://research.google/pubs/b4-and-after-managing-hierarchy-partitioning-and-asymmetry-for-availability-and-scale-in-googles-software-defined-wan/
- https://research.google/pubs/jupiter-evolving-transforming-googles-datacenter-network-via-optical-circuit-switches-and-software-defined-networking/
- https://research.google/pubs/taking-the-edge-off-with-espresso-scale-reliability-and-programmability-for-global-internet-peering/ - https://dl.acm.org/doi/pdf/10.1145/3098822.3098854
- https://research.google/pubs/design-acceptance-and-capacity-of-subsea-open-cables/
- https://dl.acm.org/doi/10.1145/3603269.3604867
- https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/579082c35dd0a00578a334033e0162f328e8213d.pdf
- https://www.usenix.org/system/files/nsdi23-wang-weitao.pdf
- https://www.youtube.com/watch?v=NP2IrQrHZrY
- https://www.usenix.org/conference/atc22/presentation/xu
- https://dl.acm.org/doi/10.1145/3603269.3604867
- https://arxiv.org/abs/2012.14219 (https://www.youtube.com/watch?v=fSIcPuuI6kk)
- https://research.google/pubs/technology-driven-highly-scalable-dragonfly-topology/
- https://www.usenix.org/system/files/nsdi22-paper-gibson.pdf
- https://research.google/pubs/minimal-rewiring-efficient-live-expansion-for-clos-data-center-networks/
- https://www.usenix.org/conference/atc23/presentation/al-fares
- https://www.usenix.org/publications/loginonline/beyondcorp-and-long-tail-zero-trust
- https://conferences.sigcomm.org/hotnets/2023/papers/hotnets23_mogul.pdf
- https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/e63dac4b63a6a0e66d388990028fd221a3d4da2f.pdf
- What NVMe™/TCP Means for Networked Storage
- NVMe/TCP is Here for All of Your Hyperscale Storage Needs
- NVMe/TCP: Performance, Deployment and Automation
- Scaling NVMe over IP Fabric Security
- https://sonicfoundation.dev/
- https://opencomputeproject.github.io/onie/
- https://frrouting.org/
- SwitchV: Automated SDN Switch Validation with P4 Model (Talk - Video)
(as well as parts of Arista EOS, Juniper JunOS, etc.)
- OCPUS18 – Open Optical Packet Transponder Leveraging OCP Networking Technology
- Open optical communication systems at a hyperscale operator
- Open and disaggregated optical transport networks for data center interconnects
Open-hardware & source generic solutions for DCs:
- https://www.meinbergglobal.com/english/products/grandmaster-clocks.htm
- https://store.timebeat.app/products/g0kk-1-mini
- https://safran-navigation-timing.com/product/securesync-time-and-frequency-reference-system/
- https://www.acquitek.com/product/meridian-2-2/
- https://safran-navigation-timing.com/product/gnssource-2500/
Signal Distribution (typically PPS, e.g. 1PPS or 10Mhz, supported by Routers/Switches and Telco Gear)
- https://www.meinbergglobal.com/english/products/redundant-frequency-sine-distribution-unit.htm
- https://novuspower.com/catalog/item/10-channel-frequency-reference-distribution-amplifier/
- https://rts.as/product/rts-pps-distribution-panel/
- https://www.acquitek.com/product/csda-1-2/
[Support fault-tolerant, highly-available NTP, PTP. You need to support both realistically if you're not just running a few rented racks. Many Routers, Switches and Telco Gear (4G, 5G,..) require PTP, NTP is for Servers and VMs]
https://developers.google.com/time/smear
- https://cloud.google.com/kms/docs
- https://aws.amazon.com/kms/
- https://github.com/usbarmory/usbarmory
https://www.youtube.com/watch?v=ZoTg-wwZ6Yw
(most importantly: write a good post mortem in the first place, provide a concise time-line, response times, things that went well & things that didn't. provide a central place for customers and affected parties to call in - i.e. a "war room" so your engineers can do their work without their mobile rining every 20secs)
https://github.com/aphyr/partitions-post/blob/master/README.markdown
(this is a true gem that was passed around among SRE & software engineers some years back. partitions do exist.)
https://github.com/danluu/post-mortems
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.