protosam/0 - S3 Storage in K8S.md

## 0 - S3 Storage in K8S.md

      
    Raw
  

              0 - S3 Storage in K8S.md
            
          
    In my storage quests, I finally decided I want to lazily use S3 for ReadWriteMany and to do do some experiments with.
There are a few options, but to save you some time if you just want what I landed on, I like csi-s3.
S3FS Mounted in Pod Containers

Well... this works great! The only problem was that it needed security privileges for mounting. That would be terrible if a container with this power got compromised, so I immediately moved on to getting this a layer away from being managed in-pod.
NFS Provisioner with Mounted S3

My initial plan was to just use the nfs-subdir-external-provisioner on top of a multi-replica S3 backed deployment of NFS Ganesha.
When running time echo hi > /mnt-path/hello.txt against s3fs directly and NFS Ganesha, I was finding that there was roughly 0.5 seconds of time before NFS Ganesha completed it's work where-as directly using s3fs was responsive. So responsive that time had been reporting 0.000. This alone was a big turn off for me.
So I moved on to trying to implement the in kernel NFS implementation. Admittently, I have no clue why, but this defeated me. I couldn't win and this is something I've done professionally for the fortune 500 for half a decade on RHEL based systems. This experiment never made it past testing in plain docker containers.
I had showmount -e showing my exports and I even had it wide open to the world with a wildcard. Anytime I would go to mount -t nfs ... mount would just hang. Spending hours trying different formulas and seeing how other people implemented nfs in Alpine, Ubuntu, and CentOS; I restarted docker one last time to rid the hung processes and hung up my hat on this.
A highly available NFS share with S3 lost all appeal to me at this point. There's still block volumes and DRBD testing I want to do later here though.
Datashim.io

I must confess before continuing, I am affiliated with IBM at the time of writing this. However this doesn't change my opinion on datashim.
The first time I saw datashim.io it looked appealing, but I wasn't interested in using S3 at the time. It looks like it can mount Apache Hive as well.
In my testing, it worked as well as s3fs did inside the container as far as writes go. It also took away the need for having a privileged container.
The only downsides I found were:

it doesn't support symlinks, which is a deal breaker for my own needs
there's an additional CRD called Dataset that you use to make your PhysicalVolumeClaims.

Overall though, it does work great and I have a bit of trust behind a bigger name like IBM for stuff like this.
So I moved on to the idea of looking for an S3 specific CSI, or if one didn't exist, finding out how to write my own.
CSI-S3

Thankfully, someone out there was already on point and made a CSI for S3: https://github.com/ctrox/csi-s3
Also this is a really simple CSI if you need some example code to work off of for making your own.
This is the bachelor chow I'm about to consume. It provides 4 different ways to mount S3 buckets, including my favorite pal s3fs.
First problem I ran into was getting the storageclass example in README.md is incomplete, found the example here to be complete though.
For my development purposes, I have formulated local-s3.yaml which is an inclusive local development kit using minio for S3, for it to work, the cluster nodes must be able to resolve cluster DNS. On random providers, using resolve-host-patcher should work. CSI-S3 will have a bucket created per pvc and when you delete a PVC the reclaimPolicy is Delete from this manifest.
On a live cluster, I would store a secret with the command below and use live-storage-class.yaml. The reclaimPolicy is Retain and it will create the PVCs inside a bucket called lantern.
kubectl -n kube-system create secret generic csi-s3-secret \
    --from-literal="accessKeyID=..." \
    --from-literal="secretAccessKey=..." \
    --from-literal="endpoint=https://nyc3.digitaloceanspaces.com" \
    --from-literal="region="

Backups

With S3 buckets, backups should be pretty easy to accomplish. For places like digital ocean, I plan to just run a job that uses the secret for the CSI. It's fine to be a privileged container, so I can just build an alpine utility container like so for the job.
FROM alpine:3
RUN apk add -U --no-cache bash curl && \
    apk add -U --no-cache s3fs-fuse kubectl helm --repository=http://dl-cdn.alpinelinux.org/alpine/edge/testing/
Considerations

The reliability of reading the files will be dependent on the underlying S3 storage consistency guarantees.
There doesn't appear to be a way to add the flag to s3fs for caching. If needed, patching functionality in for this will be required. Ref: ctrox/csi-s3/pkg/mounter/s3fs.go
It looks like S3FS has implemented SlowDown handling.
While I have never been able to break s3fs, I'm sure there's a way. There's always a way when you have people using your systems in the wild.

  
## live-storage-class.yaml
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: s3
provisioner: ch.ctrox.csi.s3-driver
reclaimPolicy: Retain
parameters:
  # if we don't set a bucket, it will create pvc named buckets
  bucket: lantern

  mounter: s3fs

  csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
  csi.storage.k8s.io/provisioner-secret-namespace: kube-system

  csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
  csi.storage.k8s.io/controller-publish-secret-namespace: kube-system

  csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
  csi.storage.k8s.io/node-stage-secret-namespace: kube-system

  csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
  csi.storage.k8s.io/node-publish-secret-namespace: kube-system

## local-s3.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio
  namespace: kube-system
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: minio
  namespace: kube-system
spec:
  selector:
    app: minio
  ports:
    - name: minio
      protocol: TCP
      port: 9000
      targetPort: 9000
    - name: minio-webconsole
      protocol: TCP
      port: 9001
      targetPort: 9001
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  labels:
    app: minio
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
      - name: minio
        image: minio/minio:latest
        ports:
        - containerPort: 9000
        - containerPort: 9001
        args:
        - server
        - /data
        - --console-address
        - :9001
        volumeMounts:
        - mountPath: /data
          name: minio
      volumes:
      - name: minio
        persistentVolumeClaim:
          claimName: minio
---
apiVersion: v1
kind: Secret
metadata:
  creationTimestamp: null
  name: csi-s3-secret
  namespace: kube-system
data:
  # id is minioadmin
  accessKeyID: bWluaW9hZG1pbg==
  # key is minioadmin
  secretAccessKey: bWluaW9hZG1pbg==
  # endpoint is http://minio.kube-system.svc.cluster.local:9000
  endpoint: aHR0cDovL21pbmlvLmt1YmUtc3lzdGVtLnN2Yy5jbHVzdGVyLmxvY2FsOjkwMDA=
  # just leave blank, we're not using aws ;)
  region: ""

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: s3
provisioner: ch.ctrox.csi.s3-driver
reclaimPolicy: Delete
parameters:
  mounter: s3fs

  csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
  csi.storage.k8s.io/provisioner-secret-namespace: kube-system

  csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
  csi.storage.k8s.io/controller-publish-secret-namespace: kube-system

  csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
  csi.storage.k8s.io/node-stage-secret-namespace: kube-system

  csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
  csi.storage.k8s.io/node-publish-secret-namespace: kube-system

## s3-test.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: paper-tiger-delete-me
spec:
  storageClassName: s3
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        volumeMounts:
          - mountPath: /var/lib/www/html
            name: paper-tiger-delete-me
      volumes:
      - name: paper-tiger-delete-me
        persistentVolumeClaim:
          claimName: paper-tiger-delete-me
          readOnly: false
	---
	kind: StorageClass
	apiVersion: storage.k8s.io/v1
	metadata:
	name: s3
	provisioner: ch.ctrox.csi.s3-driver
	reclaimPolicy: Retain
	parameters:
	# if we don't set a bucket, it will create pvc named buckets
	bucket: lantern

	mounter: s3fs

	csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
	csi.storage.k8s.io/provisioner-secret-namespace: kube-system

	csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
	csi.storage.k8s.io/controller-publish-secret-namespace: kube-system

	csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
	csi.storage.k8s.io/node-stage-secret-namespace: kube-system

	csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
	csi.storage.k8s.io/node-publish-secret-namespace: kube-system
	---
	apiVersion: v1
	kind: PersistentVolumeClaim
	metadata:
	name: minio
	namespace: kube-system
	spec:
	accessModes:
	- ReadWriteOnce
	resources:
	requests:
	storage: 10Gi
	---
	apiVersion: v1
	kind: Service
	metadata:
	name: minio
	namespace: kube-system
	spec:
	selector:
	app: minio
	ports:
	- name: minio
	protocol: TCP
	port: 9000
	targetPort: 9000
	- name: minio-webconsole
	protocol: TCP
	port: 9001
	targetPort: 9001
	---
	apiVersion: apps/v1
	kind: Deployment
	metadata:
	name: minio
	labels:
	app: minio
	namespace: kube-system
	spec:
	replicas: 1
	selector:
	matchLabels:
	app: minio
	template:
	metadata:
	labels:
	app: minio
	spec:
	containers:
	- name: minio
	image: minio/minio:latest
	ports:
	- containerPort: 9000
	- containerPort: 9001
	args:
	- server
	- /data
	- --console-address
	- :9001
	volumeMounts:
	- mountPath: /data
	name: minio
	volumes:
	- name: minio
	persistentVolumeClaim:
	claimName: minio
	---
	apiVersion: v1
	kind: Secret
	metadata:
	creationTimestamp: null
	name: csi-s3-secret
	namespace: kube-system
	data:
	# id is minioadmin
	accessKeyID: bWluaW9hZG1pbg==
	# key is minioadmin
	secretAccessKey: bWluaW9hZG1pbg==
	# endpoint is http://minio.kube-system.svc.cluster.local:9000
	endpoint: aHR0cDovL21pbmlvLmt1YmUtc3lzdGVtLnN2Yy5jbHVzdGVyLmxvY2FsOjkwMDA=
	# just leave blank, we're not using aws ;)
	region: ""

	---
	kind: StorageClass
	apiVersion: storage.k8s.io/v1
	metadata:
	name: s3
	provisioner: ch.ctrox.csi.s3-driver
	reclaimPolicy: Delete
	parameters:
	mounter: s3fs

	csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
	csi.storage.k8s.io/provisioner-secret-namespace: kube-system

	csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
	csi.storage.k8s.io/controller-publish-secret-namespace: kube-system

	csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
	csi.storage.k8s.io/node-stage-secret-namespace: kube-system

	csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
	csi.storage.k8s.io/node-publish-secret-namespace: kube-system