Skip to content

Instantly share code, notes, and snippets.

@nponeccop
Last active January 12, 2016 17:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nponeccop/d43168fe07c301b9ef4c to your computer and use it in GitHub Desktop.
Save nponeccop/d43168fe07c301b9ef4c to your computer and use it in GitHub Desktop.
Proposed Components

Minimal Cloud Management Interface

We assume it's for OpenVZ scenario, so chroot isolation level is used.

The images are self-contained - they include enough metadata for starting and configuration. No incremental imaging and persistence are necessary for a minimal production-ready run. Stopping can delete the chroot as well, and packed images can be deleted just after unpacking.

No provision for per-deployment configuration is required for initial operations. If the same image is to be deployed with slightly different configurations, many different images should be prepared instead.

A rate-limited infinite auto-restart is a reasonable default that is enough for initial production operations.

Unless minimalism is desired, bsdtar/libarchive can be used for unified support of many image formats.

cloud start

cloud start 1.2.3.4:5678 bar/foo.txz deploys image file bar/foo.txz to 1.2.3.4:5678 location under /containers/foo.txz.

cloud stop

cloud stop 1.2.3.4:5678 foo.txz stops a container under /containers/foo.txz at 1.2.3.4:5678.

Iteration 1

  • client-side /bin/cloud script - accepts the start/stop commands above
  • server-side Erlang daemon - accepts the connections from /bin/cloud, accepts the images and puts them to files, runs /bin/cloud-runner helper script with 2 arguments: start/stop and the basename of image file as passed to the cloud command.
  • authentication and transport security are not required - I can use ssh or tor for secure TCP tunnels.

Container runtime

A private single-tenant cloud within VPSes and public clouds:

  • scripted re-deployment/updates
  • scripted resource elasticity
  • secure monitoring
  • load balancing and fault tolerance (optional)

CoreOS replacement

Supports a wider range of VPSes and applications:

  • more secure
  • runs on OpenVZ (in a limited chroot mode)
  • runs via PyGrub
  • runs via kexec
  • runs within a single partition
  • installs off CentOS/Debian/Ubuntu
  • smaller images (optional)
  • smaller (optional)

Docker Build replacement

  • builds images without build dependencies (optional)
  • compare images and generate deltas (optional)

Docker Registry replacement

Resistry ceases to exist as a separate exposed component and it is integrated into Fleet master instead.

  • images are pushed from registry to the slave
  • there is no way to hack a registry by connecting to it and get access to all images
  • there is no way to know the registry location
  • multiple architectures (optional)

Flannel Replacement

  • a Darknet Connector - allows hiding ips of master from slaves
  • non-star topology (optional)

Etcd Removal

Etcd ceases to exist as a separate exposed component and it is integrated into Fleet master instead.

  • etcd is not exposed at all, not even over a darknet - it lives fully within the master
  • can really be a single point of failure

Fleet replacement

  • master connects to slaves over a darknet only when changes are necessary
  • a better control over which images nodes support which images ("metadata" and "global units" in fleet terms)
  • slaves are designed not to break but to keep their current configuration during partitions from master

Locksmith removal

  • updates are actively controlled by Fleet Master

Montoring (optional)

  • as slaves cannot talk to etcd anymore, all coordination is performed via application-specific protocols
  • however, overall node health reports independent of application are beneficial
  • a separate component that reports image status and resource consumption
  • could record metrics every minute, but be fetched by Master once in an hour

Message Queue

It is more a high-level middleware for distributed DAG traversal than a message queue.

CiteSeerX Analyzer example

You develop a web site which allows users to enter CiteSeerX documentId and see some other statistics about the reference DAG - e.g. the total number of unique articles it refers to recursively.

To achieve that, when user enters documentId, you use CiteSeerX API to fetch ids the document refers to, and so on.

To avoid blocking you

  1. use a network of geographically distributed worker servers
  2. cache the results so each id is queried only once

So you have a realtime sequence of documentIds users enter, and return a realtime sequence of total reference counts. As generally replies come out of order, the operation is essentially

runner :: [a] -> [(a, b)]

Log Analyzer

You develop a log analyzer. The log contains host names. You resolve the host names and use a GeoIP database to highlight host names resolving to IPs from Iraq.

As the log is huge, you don't want to overload your ISP DNS servers, overload origin DNS servers or flood your ISP network with UDP packets (which is generally bad as there is no backpressure). So you:

  1. use a network of geographically distributed worker servers
  2. cache the results so each hostname is queried only once

Note that DNS lookup is recursive, and you need to cache intermediate results too.

Web Crawler

Same idea: recursive queries, distributed slaves, caching.

Semi-formalized algorithm in Haskell

The code below is without caching. The caching is basically a memoization of worker:

worker :: Ord a => a -> [(a, b)]

runner :: Ord a => [a] -> M.Map a b

runner = M.fromList . concatMap deepWorker

deepWorker [] = []
deepWorker (h:t) = worker h ++ deepWorker t   

PID 1 reimplementation

TL;DR

A custom statically built PID 1 will work everywhere including the lowest grade OpenVZ VPSes. So we can implement our cloud controller as PID 1 (i.e. /sbin/init or systemd replacement) and avoid the usual bloat. The system management interface will be exposed as a Tor hidden service. The interface will be used to push container images, start containers and perform other container management operations mentioned in [containers.md]

We can remove all or almost all the usual UNIX management tool set from production system. Ideally, the system will consist of a single /sbin/init executable, a state database with configuration data and cryptographic key material and a set of container file systems.

The containers will be running as chroots (which is the only level of sub-container isolation available on OpenVZ) or at higher levels of isolation if VPS provider allows that. For example, Xen and KVM readily allow Docker, and some KVM providers allow full nested KVMS (i.e. different kernels) within.

Why PID 1

I consider the following minimal scenario for my application: a C++ application and 2 node.js applications deployed from images to a OpenVZ host.

PID 1 is the lowest level facility under our control there. So we can leave it as is and run under it or replace it.

If we leave it - it's basically SysVInit which is bloated and shell-based. Many OpenVZ providers run very old (albeit patched for vulnerabilities) kernels that cannot run systemd.

If we only utilize pid 1 and inittab, it is not really much different from just having our own pid 1.

Note that if we need, we can have almost normal shell using a dropbox or even dropbox+dropbear in a chroot container deployed the same way as application containers.

File system contents

We need to store:

  • /sbin/init
  • /chroots/app1/*
  • /chroots/app2/*
  • /chroots/app3/*
  • Tor hidden service private key
  • Few essential /proc /dev /sys /etc parts

/sbin/init can be self-contained (e.g. with statically built syscall interface using ghc-musl). One question remains if we can have reasonable feature set without relying on shared .so implemented in C.

A minimal node.js container seems to be around 5 mbs even with normal glibc inside the chroot, so the total disk footprint will never be around 300+ mbs of an empty CoreOS (it's actually around 1 GB for various reasons). And tinycore relies heavily on kernel functionality we miss (e.g. compressed file systems) to be small.

Target Metrics

Disk, CPU, memory and bandwitdth usages are non-issues today. The smallest VPS you can get for $3 is about 256MB RAM 5GB HDD 300 GBs of traffic, and even the bloated OSes use about 64-96 MBs of RAM, so it's 25-30% overhead on the tiniest systems which is acceptable.

So we should focus on reducing the important metrics instead:

  • Overall system complexity and manageability
  • Deployment Time
  • Reproducable Deployment
  • Hosting costs
  • Elasticity
  • Disaster recovery

Подсистема хранения

Для начала надо показать путём микробенчмарков, что существующие подсистемы-говно, сделанное имбецилами, и что на уровне PoC выходят лучшие характеристики. У меня нагрузка в основном аналитика, т.е. можно сфокусироваться на гетах а на путы пока забить. Т.е. показать, что адекватных баз для работы в ридонли-режиме нету.

Гипотетические проблемы:

  • последовательная работа с накопителями
  • последовательная работа с индексами
  • высокий оверхед ядра
  • недостаточная буферизация
  • высокий оверхед хранения по сравнению с текстовыми файлами и ручной упаковкой
  • недостаточная координация IO параллельных независимых задач

Последовательная работа с накопителями

Тут идея в том, что условный мапредьюс будет быстрее последовательной обработки в цикле, даже на одном накопителе и SSD. За счёт компенсации латенси в NCQ

Два бенчмарка:

  • показать, что IOPS растёт при увеличении очереди на широком классе накопителей
  • показать, что увеличенная очередь достижима при работе с key-value store (т.е. что типичные проходы допускают распараллеливание)

Последовательная работа с индексами

Мультигет быстрее последовательности гетов.

Опять же два бенчмарка:

  • показать, что мультигет может быть быстрее как последовательных, так и параллельных индивидуальных гетов за счёт использования других алгоритмов доступа.
  • показать, что типичные проходы сводятся к мультигету

Высокий оверхед ядра

Тут идея такая: если наша задача упирается в диск, снижение утилизации процессора по-прежнему имеет смысл, потому что можно параллельно запустить задачу, ограниченную CPU. Т.е. у нас не выделенные машины, а своего рода кластер, на котором исполняются микросервисы. Т.е. свободные ресурсы - не всегда простаивающие.

В применении к БД это означает, что экономия процентиков ЦПУ и контекст-свитчей имеет смысл, даже если не влияет на время исполнения прохода. И тут на винде, например, помогают IOCP, укрупнение буферов и всякие scatter-gather IO. Ну, и системное ПО - пишется один раз, юзается много.

Бенчмарк:

  • показать, что FILE*-в-тредпуле сосёт (на старых виндовых ядрах и libc он точно сосал)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment