Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active November 26, 2024 22:13
Show Gist options
  • Save scyto/f4624361c4e8c3be2aad9b3f0073c7f9 to your computer and use it in GitHub Desktop.
Save scyto/f4624361c4e8c3be2aad9b3f0073c7f9 to your computer and use it in GitHub Desktop.
My Docker Swarm Architecture

This (and related gists) captures how i created my docker swarm architecture. This is intended mostly for my own notes incase i need to re-creeate anything later! As such expect some typos and possibly even an error...

Installation Step-by-Step

Each major task has its own gist, this is to help with maitainability long term.

  1. Install Debian VM for each docker host
  2. install Docker
  3. Configure Docker Swarm
  4. Install Portainer
  5. Install KeepaliveD
  6. glusterFS disk prep, install & config
  7. gluster FS plugin for docker (optional )
  8. example stack templates:

More Details on What and Why

design goals:

  • ensure every container stays running if any of the following fail (one VM, one hypervisor, one docker service)
  • remove chance of blackhole requests (aka eliminate the use of DNS round robin to address the service)
  • enable the use of replicated state so any container can start on any single docker swarm node and fail between nodes and see the data it needs to
  • enable safe replicated shared volume across all nodes that allow state to be replicated and accessible from all nodes and allows for use of datatbases like mariadb which will corrupt if placed on NFS or CIFS/SMB shares across the network
  • make it easy to backup with my synology (this model enabled me to easily backup using active backup for business)

current state 9/30/2024

  • all still working
  • all running now on top of my proxmox cluster (see here)
  • only issue is fragile GlusťerFS plugin that needs me to re-enable it when the inteface on the hosts becomes available (e.g upgrading my unifi switch)

current state 8/26/2023

  • all seems to be functioning nearly a year layte
  • I switched fully from native nginx container to NPM
  • i elimnated NFS and iSCSI and moved all containers with state to running on GlusterFS inlcuding things with databases like wordpress
  • i plan to move the VMs from Hyper-V to my new proxmox cluster

Architecture

image

Design Assumptions

  • I wanted to continue to use docker, docker-compose, docker swarm & portainer due to existing skills
  • I have no interest at this time in k8s (i don't use it at work and never will)
  • Start simple, even if that means i do what i shouldn't (this is just a home network)
  • This is small, the containers include (nginx reverse proxy, oauth2-proxy, wordpress site + database, mqtt, upoller, cloudflare ddns) so bear that in mind, this isn't designed for super throuput or scale - its designed for some resilliency.
  • I want to deploy all services (containers) with stack templates and possibly contribute back to portainer template repo
  • The clustered file system must support databases on it (like mariadb)

Design Decisions

  • Debian for my docker host VMs - i seem to gel with debian and it (and other debian derivatives) seems to play nice with most contaniners
  • I will only use package versions included in the debian distri (bullseye stable)
  • I chosee glusterfs as my clustered, replicated file system
  • Gluster volumes will be deployed in dispersed mode
  • I mapped seperate VHDs into the docker hosts one for OS and one for gluster - this is to prevent risk of infinite boot loops
  • my gluster service will be installed on the docker host VMs. Best practice dicates they should be seperate VMs for scale. But as all VMs share the same host CPU this really gives no benefit. If this turns out to be bad decision i will change.
  • I wont tear down my current NFS and iSCSI mapped volumes (not shown) until glusterfs has been shown to run ok and survive reboots etc

A note on docker swarm and state (assume you know docker already)

Docker containers are ephemeral and generally loose all their data when they are stopped. For most docker containers there is some level of confguration state you need to pass to the container (variable, file, folders of data). Simillarly many containers want to persist data state (databases, files etc)

On a single node docker most people map a directory or file on the host into the container as a volumen or bind mount. We also see the following more advanced techniques used:

  1. mount a shared CIFS or NFS volume at bootime on the docker hosts
  2. defining a CIFS volume and mapping it into the container at runtime (this avoids editing fstab on the host)
  3. same as aove but with NFS
  4. using configs - if you have just a single, readi only, confg file that needs to be read this can be defined.

In a swarm where you want a container to run on any node you need to find a way to make the data available on all nodes in a safe effective way.

If you have a simple container that only needs environment variables to be cofigure you can do that directly when you deploy the portainer template as a portaineer stack. See this cloudflare dynamic dns updater as an example.

  • Only #4 offers a safe way to make this happen (the 'config' is available to all nodes) - but this is super restrictive and doesn't help with containers that need to store more state and read/write that state. See this mosquitto mqtt example
  • #1 this can work and you can mount the shares to multiple nodes via fstab. Typically databases cannot be placed on these shares and will ultimately corrupt. You do have to be careful to only have one container writing to any given file to avoid potentials issues.
  • #2 and #3 - thishas the advantage of not being generall mounted to the host OS, but mount on demand by the container, this reduced all the tedious mucking about is hyperspace fstab. You do need to use the volumes UI in portaine for this.

and for nost folks NFS/CIFS shares are not replicated for high availability.

This is why in this architecture i have chose to see if I can overcome these limitations uings glusterfs.

@ociotec
Copy link

ociotec commented Sep 3, 2022

Hi, did you consider CephFS as shared storage solution?

@scyto
Copy link
Author

scyto commented Sep 3, 2022

Hi, did you consider CephFS as shared storage solution?

yes, i hadn't used either, i looked at 1) for the simplest tutorial i could find 2)something that would let me have one VHD disk per node i could dedicate to clustered

i chose glusterfs just because it looked simpler - might look at ceph down the line, for now this has been rock solid (fatal last words i am sure, lol)

@BritHefty
Copy link

This is quite literally the same path I'm learning my way through. had recommendations to use Ceph but didn't have raw disks to throw at it, Gluster worked as it seats in to the existing filesystem. keepalived is the MVP for handling inbound requests to anywhere in the cluster. Thank you for putting this together.

@scyto
Copy link
Author

scyto commented Mar 15, 2023

you are welcome! how did it your build out go in the end?

@v-bulynkin
Copy link

Thank you for the description! If you use Hyper-V, why not using a VHD Set instead of GlusterFS? Will you extend your cluster outside the hypervisor?

@scyto
Copy link
Author

scyto commented Jun 13, 2023

Thank you for the description! If you use Hyper-V, why not using a VHD Set instead of GlusterFS? Will you extend your cluster outside the hypervisor?

I have 3 NUC nodes as my hyperv machines. They are not windows clustered. I want no single point of failure so I created one vhd per NUC for the gluster volume. If there is a better way to create a replicated shared volume across 3 physical machines I am definitely interested. (I don't want the shared volume to be stored on a single NAS).

@v-bulynkin
Copy link

Thanks, I see. I've come across many articles and posts which not recommend using GlusterFS due to its poor performance/unstability. They recommend Ceph, etc.
Have you measured the performance? It it stable enough?

@ociotec
Copy link

ociotec commented Jun 13, 2023

Happy combo is using CephFS mounted on Docker swarm nodes to share data on RW mode.

Ceph & CephFS is quite overwhelming to configure but Proxmox VE (replacing VMware hypervisor) allows to setup & monitor it easily.

In deed Proxmox “replaces” vSphere, vCenter, vSan (Ceph & CephFS), Veeam (backups)…

@scyto
Copy link
Author

scyto commented Aug 16, 2023

Happy combo is using CephFS mounted on Docker swarm nodes to share data on RW mode.

Ceph & CephFS is quite overwhelming to configure but Proxmox VE (replacing VMware hypervisor) allows to setup & monitor it easily.

In deed Proxmox “replaces” vSphere, vCenter, vSan (Ceph & CephFS), Veeam (backups)…

I will give it a go, i just got 3 new nucs (so i don't have to touch my running cluster) for Proxmox - wow it seems fragile-AF (networking) - basic comms doesn't seem to work properly on anything but the default bridge.... i am on my 3rd reinstall of the entire cluster + ceph to esnure i build up enough understanding to know if it is me that the issue or v8 just isn't baked enough yet...

@scyto
Copy link
Author

scyto commented Aug 27, 2023

Happy combo is using CephFS mounted on Docker swarm nodes to share data on RW mode.

Ceph & CephFS is quite overwhelming to configure but Proxmox VE (replacing VMware hypervisor) allows to setup & monitor it easily.

In deed Proxmox “replaces” vSphere, vCenter, vSan (Ceph & CephFS), Veeam (backups)…

I got this far with my proxmox cluster
How do I get docker in a VM on proxmox to see my CephFS?

@ociotec
Copy link

ociotec commented Aug 27, 2023

@scyto
Copy link
Author

scyto commented Sep 8, 2023

Try this guide https://drupal.star.bnl.gov/STAR/blog/mpoat/how-mount-cephfs
Thanks

I have ceph up and running on the proxmox hosts node and mounted, i store VM disks on this.
I have a cephFS pool and assume this would be the ideal place to store ~50GB bind mounts for each container.

I am struggling to understand, but let me try, are you saying the Debian docker VMs should have ceph installed on them and then mount over the network?

AKA
docker VM1 uses auto mount to mount a ceph network target on host1
docker VM2 uses auto mount to mount a ceph network target on host2
etc

and then I pin VM1 to host1 (never let it migrate), VM2 to host2 (etc)

lastly i don't get why i would use automount and not directly do this by fstab on the debian docker vm?
https://docs.ceph.com/en/nautilus/cephfs/fstab/

@ociotec
Copy link

ociotec commented Sep 9, 2023

Try this guide https://drupal.star.bnl.gov/STAR/blog/mpoat/how-mount-cephfs
Thanks

I have ceph up and running on the proxmox hosts node and mounted, i store VM disks on this. I have a cephFS pool and assume this would be the ideal place to store ~50GB bind mounts for each container.

I am struggling to understand, but let me try, are you saying the Debian docker VMs should have ceph installed on them and then mount over the network?

AKA docker VM1 uses auto mount to mount a ceph network target on host1 docker VM2 uses auto mount to mount a ceph network target on host2 etc

and then I pin VM1 to host1 (never let it migrate), VM2 to host2 (etc)

lastly i don't get why i would use automount and not directly do this by fstab on the debian docker vm? https://docs.ceph.com/en/nautilus/cephfs/fstab/

No, you should mount (with autofs or fstab) CephFS directly, no via host mounted path… via host is not redundant and you add another not needed piece in the middle.

@Drallas
Copy link

Drallas commented Sep 21, 2023

Thanks for sharing all this.

I noticed that the link shepherd to update swarm images doesn’t work (404). Though the Gist can be found in your library.

@scyto
Copy link
Author

scyto commented Sep 22, 2023

@Drallas typo in url fixed, thanks for flagging

and i am glad it was useful to someone else - i do this mainly for myself - making it public means it has to be higher quality than my normal notes and means its actually useful 2 years later when i need to remeber WTF i did :-)

@scyto
Copy link
Author

scyto commented Sep 22, 2023

No, you should mount (with autofs or fstab) CephFS directly, no via host mounted path… via host is not redundant and you add another not needed piece in the middle.

TBH you have lost me, and so did that guide i always use fastab to mount volumes so not sure what you are telling me to do
also if you use something like ceph -fstype=ceph,name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime cephmon01.starp.bnl.gov:6789:/ this is a single point of failure - what happens if cephmon01 goes down?

maybe i am not clear enough on my architectre, let me try

promox setup
I have 3 proxmox nodes (pve1, pve2, pve3)
each node runs VMs on top of that
each node is a ceph host (this is a homelab)

current swarm
3 VMs (running on hyperv today, an OS VHD and a gluster VHD
each VM participates in a disbursed glusetr volume
as such a file written by any one node is replicated to each of the other 3 nodes

It is these VMs i want to move to proxox.
 
I am exploring based on your suggestion ceph to provide the same for the docker VMs
the question is how the docker VMs would access ceph

the only way i see to do that is to have an fastab entry like ceph -fstype=ceph,name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime pve1.mydomain.com:6789:/` but - what if the ceph OSD / monitor/ daemon goes down on pve1? how can the VM on node pve1 still access ceph on one of the other nodes?

tbh it seems far simpler to either:

  1. expose a cephFS up into the VM using virtioFS
  2. continue using gluster inside the VM on its own virtual disk - this approach has worked perfectly since i built this....

what am i missing?

@Drallas
Copy link

Drallas commented Sep 22, 2023

@Drallas typo in url fixed, thanks for flagging

and i am glad it was useful to someone else - i do this mainly for myself - making it public means it has to be higher quality than my normal notes and means its actually useful 2 years later when i need to remeber WTF i did :-)
.

Your Gists are very helpful to me, especially as a reference to validate my own assumptions.
It's nice to discover 'better ways' like (expose a CephFS up into the VM using virtioFS) that I read about in the previous comment. If this works reliable I probably don't need LXC containers anymore for Docker Swarm and move that into proper VM's.

BTW, I mainly use Notion for the 'remember WTF I did' part 😀.

But you inspired me to take some of those, like Write-up: Docker Swarm in LXC Containers and share them too.

@Drallas
Copy link

Drallas commented Sep 22, 2023

@scyto 'expose a cephFS up into the VM using virtioFS'

Do you have a write-up / notes on that? Or perhaps some useful links.

Found some:
https://blog.domainmess.org/post/virtiofs/
https://forum.proxmox.com/threads/virtiofsd-in-pve-8-0-x.130531/

But still trying to figure it out.

@alkajazz
Copy link

How do you handle read/write databses with glusterfs?

@scyto
Copy link
Author

scyto commented Sep 23, 2023

How do you handle read/write databses with glusterfs?
This is my strategy

  • I used disbursed not replicated volumes.
  • I only ever allow a service with a database to start on one node.
  • I only ever allow a service that accesses a database to start on one node
  • I only ever had one service access a database (this is fine in a home lab - if i wanted more services to access the same db i would implement a true HA replicated db service where i used the native ha / replication the db offers)

in terms of implementation - any service where i need this constraint gets this in its service definition

    deploy:
      mode: replicated
      replicas: 1    

I went gluster because every database person says storing a database on NFS or CIFS and accessing will result in corruption eventually. so i didn't want to store the database on my synology and i wanted to also protect the databases from a loss of one node.

I only have see an issue once (touch wood) and that was with the nginx proxy manager mariadb - this was when i hard killed a node by accident as such this wasn't a gluster issue (I didn't need to fix gluster) and it was fixed running normal mariadb fix commands.

I have no complaints with gluster - its certainly much simpler to setup than ceph and less fragile when messing with networks - as the docker swarm is currently being moved from hyper-v to proxmox i will revist the choice of file system... if you are interested you can watch my progress in realtime here as i note my progress.... https://gist.github.com/scyto/042c8c41b23bd5ddb31d1e4e38156dff

@scyto
Copy link
Author

scyto commented Sep 23, 2023

Your Gists are very helpful to me, especially as a reference to validate my own assumptions.

Glad they are of use to folks, the more we all share and see different ways of doing things the more we all learn IMO.

I only assert 'i did it this way' never 'my way is the right way' - but i do like to learn the root pros/cons of the way i did it to figure out if there are better ways OR if the different ways are a matter of preference (too many arguments on the interwebs are poeople arguing about preferences - which is insane to me, thats like arguing is strawberry or chocolate ice cream the best - missing the point that all ice cream is awesome [for the record its strawberry of course]).

And really i write these like this and make them public - because it forces to me make half-decent record of what i did. So for example i recently came back to the gluster when i started to think about how do i expand it to a 4th node and i realized because it is a disbursed volume i can't - but because i wrote it up it was easy to figure out how to create new 4 node dispersed volume and migrate ...

I wish i had written up my steps for running Windows Hello for Business at home (no one else in the world should do this) it was one of the hardest things i ever setup in 30 years of doing systems stuff.

@Drallas
Copy link

Drallas commented Sep 24, 2023

And really i write these like this and make them public - because it forces to me make half-decent record of what i did.

Yup, making the notes public forces one to think. Often it’s helpful for others too, comparing notes and discovering different ways to do the same thing.

I find ‘how to do stuff’ often poorly documented or ‘all over the place’; I’m writing it down to help myself, and other avoiding to waste precious time.

We should not waste our time on something trivial, that someone else could have properly documented while they were at it!

Instead weshould use our (collective) time to solve unsolved problems, so that we get even better software and platform to play with, or to go out and eat ice-cream! 😋

@scyto
Copy link
Author

scyto commented Oct 3, 2023

Do you have a write-up / notes on that? Or perhaps some useful links.

no but this guy call Drallas did and does

:-p

@Drallas
Copy link

Drallas commented Oct 3, 2023

Do you have a write-up / notes on that? Or perhaps some useful links.

no but this guy call Drallas did and does

:-p

😊👊🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment