Skip to content

Instantly share code, notes, and snippets.

@arno01
Last active September 12, 2023 07:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arno01/8a97a3f7bdbc3ec8d82d4aa20fd9fab2 to your computer and use it in GitHub Desktop.
Save arno01/8a97a3f7bdbc3ec8d82d4aa20fd9fab2 to your computer and use it in GitHub Desktop.
Partially reconstruct manifests in the event you lost them

Partially reconstruct manifests in the event you lost them

Here is how you can partially reconstruct the deleted manifests to make sure your provider will keep monitoring them, withdrawing from their leases as well as allow users to re-send their deployment manifest (without redeploying) to restore it & their ingresses (uri's), avoiding the Pod restart, hence keeping the ephemeral data on a Pod.

You can also reconstruct the manifests even when the deployments were gone (but bid/lease are still active/open). This requires the tenants to re-submit their manifests.

1. Partially reconstruct missing manifests

This reconstruction is based on existing namespaces, there is no way to entirely reconstruct the manifests unless they were manually backed up.

NOTE: look at the comment section if you do not even have ns (namespace) and still want to recover the active/open bid/lease.

MANIFEST_TEMPLATE='{"apiVersion":"akash.network/v2beta2","kind":"Manifest","metadata":{"generation":2,"labels":{"akash.network":"true","akash.network/lease.id.dseq":"$dseq","akash.network/lease.id.gseq":"$gseq","akash.network/lease.id.oseq":"$oseq","akash.network/lease.id.owner":"$owner","akash.network/lease.id.provider":"$provider","akash.network/namespace":"$ns"},"name":"$ns","namespace":"lease"},"spec":{"lease_id":{"dseq":"$dseq","gseq":$gseq,"oseq":$oseq,"owner":"$owner","provider":"$provider"}}}'

kubectl get ns -A -l akash.network=true -o json \
  | jq --arg lid 'akash.network/lease.id'  -r '.items[].metadata.labels | select(.[$lid+".dseq"] != null) | [.[$lid+".owner", $lid+".dseq", $lid+".gseq", $lid+".oseq", $lid+".provider"]] | @tsv' \
    | while read owner dseq gseq oseq provider; do \
        ns=$(akash provider show-cluster-ns --owner $owner --dseq $dseq --gseq $gseq --oseq $oseq --provider $provider)
        echo "$MANIFEST_TEMPLATE" | owner=$owner dseq=$dseq gseq=$gseq oseq=$oseq provider=$provider ns=$ns envsubst | kubectl create -f -
      done

2. Restart akash-provider pod

kubectl -n akash-services delete pods -l app=akash-provider

Now akash-provider should keep the track of the partially reconstructed manifests, withdraw from the leases, check deployment status, etc.

The drawbacks

  • The provider will report these kind of errors on start, only once. They are related to the reconsctructed deployments.
E[2022-06-01|12:16:16.242] cleaning stale resources                     module=provider-cluster-kube err="values: Invalid value: []string{}: for 'in', 'notin' operators, values set can't be empty"
  • Clients will have to re-send their deployment manifest once again should they want to regain the ability to akash provider lease-shell into their deployment again as well as to restore the nginx ingresses (uris) to their deployments.
    • akash provider send-manifest will do. Also, the cleaning stale resources error seen above will disappear.

Reference

@arno01
Copy link
Author

arno01 commented Sep 8, 2023

post-mainnet6, bid/lease based recovery

provider-services 0.4.6

Two leases disappeared because of this error:

E[2023-09-08|00:48:41.045] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9964422/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-active err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"

Leaving no namespaces which were used in the pre-mainnet6 recovery (kubectl get ns ...).

And yet, it is possible recover the deployments when bids/leases/deployments have active/open status.
The only caveat: the client has to re-submit the manifests.

The reason one would want to recover is usually to keep the hostnames (uri).

Recovering

One can recover the deployments when bids/leases/deployments have active/open status.

1) prepare manifest templates

Need to match up the amount of services otherwise will get this kind of error:

E[2023-09-08|13:08:46.234] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9965092/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud err="kube-builder: reservation does not have SchedulerParams for ResourcesID (2)"
E[2023-09-08|13:08:46.234] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9965092/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-active err="kube-builder: reservation does not have SchedulerParams for ResourcesID (2)"
  • manifest template for one service:
MANIFEST_TEMPLATE_SVC[1]='{"apiVersion":"akash.network/v2beta2","kind":"Manifest","metadata":{"generation":1,"labels":{"akash.network":"true","akash.network/lease.id.dseq":"$dseq","akash.network/lease.id.gseq":"$gseq","akash.network/lease.id.oseq":"$oseq","akash.network/lease.id.owner":"$owner","akash.network/lease.id.provider":"$provider","akash.network/namespace":"$ns"},"name":"$ns","namespace":"lease"},"spec":{"group":{"name":"$group_name","services":[{"count":1,"expose":[{"endpoint_sequence_number":0,"external_port":80,"global":true,"http_options":{"max_body_size":1048576,"next_cases":["error","timeout"],"next_tries":3,"read_timeout":60000,"send_timeout":60000},"port":80,"proto":"TCP"}],"image":"bsord/tetris","name":"app","resources":{"cpu":{"units":1000},"gpu":{"units":0},"id":1,"memory":{"size":"536870912"},"storage":[{"name":"default","size":"536870912"}]}}]},"lease_id":{"dseq":"$dseq","gseq":$gseq,"oseq":$oseq,"owner":"$owner","provider":"$provider"}}}'
MANIFEST_TEMPLATE_SVC[1]='{
  "apiVersion": "akash.network/v2beta2",
  "kind": "Manifest",
  "metadata": {
    "generation": 1,
    "labels": {
      "akash.network": "true",
      "akash.network/lease.id.dseq": "$dseq",
      "akash.network/lease.id.gseq": "$gseq",
      "akash.network/lease.id.oseq": "$oseq",
      "akash.network/lease.id.owner": "$owner",
      "akash.network/lease.id.provider": "$provider",
      "akash.network/namespace": "$ns"
    },
    "name": "$ns",
    "namespace": "lease"
  },
  "spec": {
    "group": {
        "name": "$group_name",
        "services": [
            {
                "count": 1,
                "expose": [
                    {
                        "endpoint_sequence_number": 0,
                        "external_port": 80,
                        "global": true,
                        "http_options": {
                            "max_body_size": 1048576,
                            "next_cases": [
                                "error",
                                "timeout"
                            ],
                            "next_tries": 3,
                            "read_timeout": 60000,
                            "send_timeout": 60000
                        },
                        "port": 80,
                        "proto": "TCP"
                    }
                ],
                "image": "bsord/tetris",
                "name": "app",
                "resources": {
                    "cpu": {
                        "units": 1000
                    },
                    "gpu": {
                        "units": 0
                    },
                    "id": 1,
                    "memory": {
                        "size": "536870912"
                    },
                    "storage": [
                        {
                            "name": "default",
                            "size": "536870912"
                        }
                    ]
                }
            }
        ]
    },
    "lease_id": {
      "dseq": "$dseq",
      "gseq": $gseq,
      "oseq": $oseq,
      "owner": "$owner",
      "provider": "$provider"
    }
  }
}'
  • manifest template for two services:
MANIFEST_TEMPLATE_SVC[2]='{"apiVersion":"akash.network/v2beta2","kind":"Manifest","metadata":{"generation":1,"labels":{"akash.network":"true","akash.network/lease.id.dseq":"$dseq","akash.network/lease.id.gseq":"$gseq","akash.network/lease.id.oseq":"$oseq","akash.network/lease.id.owner":"$owner","akash.network/lease.id.provider":"$provider","akash.network/namespace":"$ns"},"name":"$ns","namespace":"lease"},"spec":{"group":{"name":"$group_name","services":[{"count":1,"expose":[{"endpoint_sequence_number":0,"external_port":80,"global":true,"http_options":{"max_body_size":1048576,"next_cases":["error","timeout"],"next_tries":3,"read_timeout":60000,"send_timeout":60000},"port":80,"proto":"TCP"}],"image":"bsord/tetris","name":"app","resources":{"cpu":{"units":1000},"gpu":{"units":0},"id":1,"memory":{"size":"536870912"},"storage":[{"name":"default","size":"536870912"}]}},{"count":1,"expose":[{"endpoint_sequence_number":0,"external_port":80,"global":true,"http_options":{"max_body_size":1048576,"next_cases":["error","timeout"],"next_tries":3,"read_timeout":60000,"send_timeout":60000},"port":80,"proto":"TCP"}],"image":"bsord/tetris","name":"app2","resources":{"cpu":{"units":1000},"gpu":{"units":0},"id":2,"memory":{"size":"536870912"},"storage":[{"name":"default","size":"536870912"}]}}]},"lease_id":{"dseq":"$dseq","gseq":$gseq,"oseq":$oseq,"owner":"$owner","provider":"$provider"}}}'
MANIFEST_TEMPLATE_SVC[2]='{
  "apiVersion": "akash.network/v2beta2",
  "kind": "Manifest",
  "metadata": {
    "generation": 1,
    "labels": {
      "akash.network": "true",
      "akash.network/lease.id.dseq": "$dseq",
      "akash.network/lease.id.gseq": "$gseq",
      "akash.network/lease.id.oseq": "$oseq",
      "akash.network/lease.id.owner": "$owner",
      "akash.network/lease.id.provider": "$provider",
      "akash.network/namespace": "$ns"
    },
    "name": "$ns",
    "namespace": "lease"
  },
  "spec": {
    "group": {
        "name": "$group_name",
        "services": [
          {
            "count": 1,
            "expose": [
              {
                "endpoint_sequence_number": 0,
                "external_port": 80,
                "global": true,
                "http_options": {
                  "max_body_size": 1048576,
                  "next_cases": [
                    "error",
                    "timeout"
                  ],
                  "next_tries": 3,
                  "read_timeout": 60000,
                  "send_timeout": 60000
                },
                "port": 80,
                "proto": "TCP"
              }
            ],
            "image": "bsord/tetris",
            "name": "app",
            "resources": {
              "cpu": {
                "units": 1000
              },
              "gpu": {
                "units": 0
              },
              "id": 1,
              "memory": {
                "size": "536870912"
              },
              "storage": [
                {
                  "name": "default",
                  "size": "536870912"
                }
              ]
            }
          },
          {
            "count": 1,
            "expose": [
              {
                "endpoint_sequence_number": 0,
                "external_port": 80,
                "global": true,
                "http_options": {
                  "max_body_size": 1048576,
                  "next_cases": [
                    "error",
                    "timeout"
                  ],
                  "next_tries": 3,
                  "read_timeout": 60000,
                  "send_timeout": 60000
                },
                "port": 80,
                "proto": "TCP"
              }
            ],
            "image": "bsord/tetris",
            "name": "app2",
            "resources": {
              "cpu": {
                "units": 1000
              },
              "gpu": {
                "units": 0
              },
              "id": 2,
              "memory": {
                "size": "536870912"
              },
              "storage": [
                {
                  "name": "default",
                  "size": "536870912"
                }
              ]
            }
          }
        ]
    },
    "lease_id": {
      "dseq": "$dseq",
      "gseq": $gseq,
      "oseq": $oseq,
      "owner": "$owner",
      "provider": "$provider"
    }
  }
}'

2) scale down Akash Provider

NOTE: It is possible that you will hit err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))" error if you scale down the Akash Provider before you reconstruct the missing manifest (!). If that happens, then try just skipping this step and bouncing the akash-provider after the next step to reconstruct the missing manifests. More details in this issue akash-network/support#121

kubectl -n akash-services scale statefulsets akash-provider --replicas=0

3) Reconstruct the missing manifests, namespace and basic (skeleton) deployment

export AKASH_NODE="http://$(kubectl -n akash-services get ep akash-node-1 -o jsonpath='{.subsets[0].addresses[0].ip}'):26657"

provider-services query market lease list --state active --provider akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk -o json \
  | jq -r '.leases[].lease.lease_id | [.owner,.dseq,.gseq,.oseq,.provider] | @tsv' | \
    while read owner dseq gseq oseq provider; do \
      ns=$(provider-services show-cluster-ns --owner $owner --dseq $dseq --gseq $gseq --oseq $oseq --provider $provider)
      kubectl get ns $ns >/dev/null 2>&1
      if [[ $? -ne 0 ]]; then
        group_name=$(provider-services query deployment get --dseq $dseq --owner $owner -o json | jq -r '.groups[0].group_spec.name')
        services_number=$(provider-services query deployment get --dseq $dseq --owner $owner -o json | jq -r '.groups[0].group_spec.resources | length')
        echo "DEBUG: owner=$owner dseq=$dseq gseq=$gseq oseq=$oseq provider=$provider ns=$ns group_name=$group_name services_number=$services_number"
        echo "${MANIFEST_TEMPLATE_SVC[$services_number]}" | owner=$owner dseq=$dseq gseq=$gseq oseq=$oseq provider=$provider ns=$ns group_name=$group_name envsubst | kubectl create -f -
        kubectl create ns $ns
        kubectl label ns $ns akash.network=true akash.network/lease.id.dseq=$dseq akash.network/lease.id.gseq=$gseq akash.network/lease.id.oseq=$oseq akash.network/lease.id.owner=$owner akash.network/lease.id.provider=$provider akash.network/namespace=$ns
        echo
      fi
    done
  • 9964422 staging-console-proxy.akash.network nbggm2ss0lcc56p7br9mhbv2p8.ingress.hurricane.akash.pub
DEBUG: owner=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90 dseq=9964422 gseq=1 oseq=1 provider=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk ns=ug61v2c6t6uateu77rblsmgbos7r0g2062ubagjeg30rc group_name=dcloud services_number=1
manifest.akash.network/ug61v2c6t6uateu77rblsmgbos7r0g2062ubagjeg30rc created
namespace/ug61v2c6t6uateu77rblsmgbos7r0g2062ubagjeg30rc created
namespace/ug61v2c6t6uateu77rblsmgbos7r0g2062ubagjeg30rc labeled
  • 9965092 staging-console.akash.network 4ugffb94mpbnh6ulugc4vkefps.ingress.hurricane.akash.pub
DEBUG: owner=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90 dseq=9965092 gseq=1 oseq=1 provider=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk ns=rt9a4lkes3k7cj6lu99aqjq1a8kgo8a32b4qqsns4jc8e group_name=dcloud services_number=2
manifest.akash.network/rt9a4lkes3k7cj6lu99aqjq1a8kgo8a32b4qqsns4jc8e created
namespace/rt9a4lkes3k7cj6lu99aqjq1a8kgo8a32b4qqsns4jc8e created
namespace/rt9a4lkes3k7cj6lu99aqjq1a8kgo8a32b4qqsns4jc8e labeled

4) scale Akash Provider back up

kubectl -n akash-services scale statefulsets akash-provider --replicas=1
  • 9964422 staging-console-proxy.akash.network nbggm2ss0lcc56p7br9mhbv2p8.ingress.hurricane.akash.pub
D[2023-09-08|13:04:28.811] deploy complete                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9964422/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
  • 9965092 staging-console.akash.network 4ugffb94mpbnh6ulugc4vkefps.ingress.hurricane.akash.pub
D[2023-09-08|13:37:38.956] deploy complete                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9965092/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud

5) re-send the manifests

I[2023-09-08|13:05:44.524] manifest received                            module=provider-cluster cmp=provider cmp=service lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9964422/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
...
D[2023-09-08|13:05:44.855] deploy complete                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9964422/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
I[2023-09-08|13:38:11.926] manifest received                            module=provider-cluster cmp=provider cmp=service lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9965092/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
...
D[2023-09-08|13:38:12.488] deploy complete                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/9965092/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud

6) remove the old "fake" (/tetris) ones:

root@control-01:~# ns=ug61v2c6t6uateu77rblsmgbos7r0g2062ubagjeg30rc
root@control-01:~# kubectl -n $ns get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                     IMAGE
app-6d4cdc6d95-vz556     bsord/tetris
proxy-5f4dbcfd79-5jm8n   ghcr.io/akash-network/console-proxy:v0.1.303
root@control-01:~# kubectl -n $ns get deployment
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
app     1/1     1            1           14m
proxy   1/1     1            1           13m
root@control-01:~# kubectl -n $ns delete deployment app
deployment.apps "app" deleted
root@control-01:~# ns=rt9a4lkes3k7cj6lu99aqjq1a8kgo8a32b4qqsns4jc8e
root@control-01:~# kubectl -n $ns get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                     IMAGE
app-6d668cbff-jnzgh      bsord/tetris
app2-85db97649f-sxdqk    bsord/tetris
nginx-79db85769b-fpgm8   beevelop/nginx-basic-auth
web-5b5f9549cc-6whhj     ghcr.io/akash-network/console:v0.1.303
root@control-01:~# kubectl -n $ns get deployment
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
app     1/1     1            1           85s
app2    1/1     1            1           85s
nginx   1/1     1            1           52s
web     1/1     1            1           51s
root@control-01:~# kubectl -n $ns delete deployment app
deployment.apps "app" deleted
root@control-01:~# kubectl -n $ns delete deployment app2
deployment.apps "app2" deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment