Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@protosam
Last active March 29, 2024 21:23
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save protosam/50a70683aa6e2bbb75709cf13e4470e9 to your computer and use it in GitHub Desktop.
Save protosam/50a70683aa6e2bbb75709cf13e4470e9 to your computer and use it in GitHub Desktop.
S3 Storage Goodness in K8S

In my storage quests, I finally decided I want to lazily use S3 for ReadWriteMany and to do do some experiments with.

There are a few options, but to save you some time if you just want what I landed on, I like csi-s3.

S3FS Mounted in Pod Containers

Well... this works great! The only problem was that it needed security privileges for mounting. That would be terrible if a container with this power got compromised, so I immediately moved on to getting this a layer away from being managed in-pod.

NFS Provisioner with Mounted S3

My initial plan was to just use the nfs-subdir-external-provisioner on top of a multi-replica S3 backed deployment of NFS Ganesha.

When running time echo hi > /mnt-path/hello.txt against s3fs directly and NFS Ganesha, I was finding that there was roughly 0.5 seconds of time before NFS Ganesha completed it's work where-as directly using s3fs was responsive. So responsive that time had been reporting 0.000. This alone was a big turn off for me.

So I moved on to trying to implement the in kernel NFS implementation. Admittently, I have no clue why, but this defeated me. I couldn't win and this is something I've done professionally for the fortune 500 for half a decade on RHEL based systems. This experiment never made it past testing in plain docker containers.

I had showmount -e showing my exports and I even had it wide open to the world with a wildcard. Anytime I would go to mount -t nfs ... mount would just hang. Spending hours trying different formulas and seeing how other people implemented nfs in Alpine, Ubuntu, and CentOS; I restarted docker one last time to rid the hung processes and hung up my hat on this.

A highly available NFS share with S3 lost all appeal to me at this point. There's still block volumes and DRBD testing I want to do later here though.

Datashim.io

I must confess before continuing, I am affiliated with IBM at the time of writing this. However this doesn't change my opinion on datashim.

The first time I saw datashim.io it looked appealing, but I wasn't interested in using S3 at the time. It looks like it can mount Apache Hive as well.

In my testing, it worked as well as s3fs did inside the container as far as writes go. It also took away the need for having a privileged container.

The only downsides I found were:

  • it doesn't support symlinks, which is a deal breaker for my own needs
  • there's an additional CRD called Dataset that you use to make your PhysicalVolumeClaims.

Overall though, it does work great and I have a bit of trust behind a bigger name like IBM for stuff like this.

So I moved on to the idea of looking for an S3 specific CSI, or if one didn't exist, finding out how to write my own.

CSI-S3

Thankfully, someone out there was already on point and made a CSI for S3: https://github.com/ctrox/csi-s3

Also this is a really simple CSI if you need some example code to work off of for making your own.

This is the bachelor chow I'm about to consume. It provides 4 different ways to mount S3 buckets, including my favorite pal s3fs.

First problem I ran into was getting the storageclass example in README.md is incomplete, found the example here to be complete though.

For my development purposes, I have formulated local-s3.yaml which is an inclusive local development kit using minio for S3, for it to work, the cluster nodes must be able to resolve cluster DNS. On random providers, using resolve-host-patcher should work. CSI-S3 will have a bucket created per pvc and when you delete a PVC the reclaimPolicy is Delete from this manifest.

On a live cluster, I would store a secret with the command below and use live-storage-class.yaml. The reclaimPolicy is Retain and it will create the PVCs inside a bucket called lantern.

kubectl -n kube-system create secret generic csi-s3-secret \
    --from-literal="accessKeyID=..." \
    --from-literal="secretAccessKey=..." \
    --from-literal="endpoint=https://nyc3.digitaloceanspaces.com" \
    --from-literal="region="

Backups

With S3 buckets, backups should be pretty easy to accomplish. For places like digital ocean, I plan to just run a job that uses the secret for the CSI. It's fine to be a privileged container, so I can just build an alpine utility container like so for the job.

FROM alpine:3
RUN apk add -U --no-cache bash curl && \
    apk add -U --no-cache s3fs-fuse kubectl helm --repository=http://dl-cdn.alpinelinux.org/alpine/edge/testing/

Considerations

The reliability of reading the files will be dependent on the underlying S3 storage consistency guarantees.

There doesn't appear to be a way to add the flag to s3fs for caching. If needed, patching functionality in for this will be required. Ref: ctrox/csi-s3/pkg/mounter/s3fs.go

It looks like S3FS has implemented SlowDown handling.

While I have never been able to break s3fs, I'm sure there's a way. There's always a way when you have people using your systems in the wild.

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: s3
provisioner: ch.ctrox.csi.s3-driver
reclaimPolicy: Retain
parameters:
# if we don't set a bucket, it will create pvc named buckets
bucket: lantern
mounter: s3fs
csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
csi.storage.k8s.io/provisioner-secret-namespace: kube-system
csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
csi.storage.k8s.io/controller-publish-secret-namespace: kube-system
csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
csi.storage.k8s.io/node-stage-secret-namespace: kube-system
csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
csi.storage.k8s.io/node-publish-secret-namespace: kube-system
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: minio
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: minio
namespace: kube-system
spec:
selector:
app: minio
ports:
- name: minio
protocol: TCP
port: 9000
targetPort: 9000
- name: minio-webconsole
protocol: TCP
port: 9001
targetPort: 9001
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
labels:
app: minio
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:latest
ports:
- containerPort: 9000
- containerPort: 9001
args:
- server
- /data
- --console-address
- :9001
volumeMounts:
- mountPath: /data
name: minio
volumes:
- name: minio
persistentVolumeClaim:
claimName: minio
---
apiVersion: v1
kind: Secret
metadata:
creationTimestamp: null
name: csi-s3-secret
namespace: kube-system
data:
# id is minioadmin
accessKeyID: bWluaW9hZG1pbg==
# key is minioadmin
secretAccessKey: bWluaW9hZG1pbg==
# endpoint is http://minio.kube-system.svc.cluster.local:9000
endpoint: aHR0cDovL21pbmlvLmt1YmUtc3lzdGVtLnN2Yy5jbHVzdGVyLmxvY2FsOjkwMDA=
# just leave blank, we're not using aws ;)
region: ""
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: s3
provisioner: ch.ctrox.csi.s3-driver
reclaimPolicy: Delete
parameters:
mounter: s3fs
csi.storage.k8s.io/provisioner-secret-name: csi-s3-secret
csi.storage.k8s.io/provisioner-secret-namespace: kube-system
csi.storage.k8s.io/controller-publish-secret-name: csi-s3-secret
csi.storage.k8s.io/controller-publish-secret-namespace: kube-system
csi.storage.k8s.io/node-stage-secret-name: csi-s3-secret
csi.storage.k8s.io/node-stage-secret-namespace: kube-system
csi.storage.k8s.io/node-publish-secret-name: csi-s3-secret
csi.storage.k8s.io/node-publish-secret-namespace: kube-system
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: paper-tiger-delete-me
spec:
storageClassName: s3
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /var/lib/www/html
name: paper-tiger-delete-me
volumes:
- name: paper-tiger-delete-me
persistentVolumeClaim:
claimName: paper-tiger-delete-me
readOnly: false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment