mlushpenko/multitenancy-on-kubernetes.md Secret

## multitenancy-on-kubernetes.md

      
    Raw
  

              multitenancy-on-kubernetes.md
            
          
    Banner image


A way to achieve multitenancy on kubernetes with Istio, External Authentication Server and OpenID Connect (Part 1 - Authentication)

Before we dive into any technical details, it makes sense to note that multitenancy is a complex topic and often understood differently, depending on the task you are trying to achieve and the person you are talking to. To set the stage for this article, let me explain what I mean by multitenancy.
At HAL24K we provide our clients with AI-based decision support on a daily basis. We combine data science services and solutions with Dimension, our SaaS-based data science platform. When you subscribe to our platform, you can choose which platform modules are relevant to your business: data processing (dataflow), model training (datalab), dashboards and more. You can give all your users access to those modules or only specific modules. This kind of multitenancy can be represented in a standard matrix permissions structure as depicted by the diagram below.

I will focus on an application-level multitenancy on kubernetes (how people access applications in the browser), not on the kubernetes-level multitenancy (how people perform tasks on the kubernetes cluster)
High-level solution overview

To base our story on a concrete example, we will be showing how JupyterHub can be used in a multitenant way on kubernetes. We take JupyterHub because our data scientists use it a lot and it has some integration with kubernetes to spin up Jupyter notebooks as pods, so we don't need to write our own management software for that. If you manage some other application as independent application instances (without management layer), I will show how multitenancy can be achieved with the same tooling as well. Take a look at the following diagram:

We have Alice and Bob working at different companies and both using our platform. When either of them tries to access JupyterHub, they will have to go through the authentication and after that their request will be routed to their own instance of Jupyter notebook (or another application that they request and have access to, JupyterHub is just an example). This is possible due to three core components:

OpenID Connect (OIDC): used to verify the identity of the end-user and obtain the user permissions described above
External Authentication Server (EAS): a service that will perform the actual authentication and can use various authentication schemes (OIDC in our case but it has many more)
Istio: a service-mesh used for routing user requests to appropriate backends, securing traffic inside kubernetes and enforcing policies. Put another way: it checks which applications the user wants to access, checks their permissions and allows or denies the request.

You can also see that each tenant has a separate namespace which is needed for resource management, billing, security and routing (discussed later in more detail).
Technical overview

Let's talk a bit more about how the actual separation between tenants, their users and platform modules happens. First, we separate the tenant and platform modules via DNS with the following convention module_name.tenant_name.example.com as our tenant may have only some modules enabled and we don't want to provision resources if they are not in use.
We are using Istio Gateway and VirtualService resources to route traffic to the specific module. The gateway allows the specification of hostnames, ports and TLS certificates for incoming requests, while VirtualService handles URL paths, request methods and destination backends.
Before requests can reach any endpoint inside the cluster, we have to make sure they are authenticated, so we added an authz envoy filter to the Istio gateway. Envoy filters allow extending proxy functionality with custom logic. In our case, incoming requests are redirected from the gateway to the EAS service that in turn talks to our identity provider using tenant-specific client credentials.
One great thing about EAS is that with a single instance we can configure multiple OIDC client connections (one for each tenant). In EAS all information about the connection to the identity provider is embedded in the config_token that has to be generated in advance and then provided to your reverse proxy (via EnvoyFilter in our case). Embedding those tokens in URL makes the config ugly and you may hit URL length limits. So, luckily for us, EAS has a notion of server_side tokens that allows tokens to be stored on the backend and only puts a reference to that token in the proxy configuration. Nevertheless, we didn't want to configure that reference manually for each tenant, so my colleague edited EAS code a bit to fetch token references dynamically based on a domain name regex and we are now discussing how this feature can be added to the main repository. More specifically, if the domain name is *.tenantA.example.com, we will get tenantA config_token from the backend and use it to communicate with our identity provider. EAS is a great project and its sole maintainer Travis Hansen is super responsive, so I encourage you to check it out.
Now, after the authentication process is done, we have to check if a given user is allowed to access the requested module. Information about the user is provided via claims in OIDC id_token
which is encoded as JSON Web Token (JWT). If I decode such a token for my user, it will look similar to this stripped-down version:
{
  "nbf": 1568720155,
  "exp": 1568723755,
  "name": "lushpenko",
  "email": "maksym.lushpenko@hal24k.com",
  "current_tenant": "tenantA",
  "permissions": [
    "datalab",
    "dataflow",
  ],
}

The important bits are the information about current_tenant, name and permissions and this info should be enough to make a decision to allow or deny a user request. Our EAS token configuration in generate-config-token.js for each tenant looks as follows:
eas: {
  plugins: [
    {
      type: "oidc",
      issuer: {
        discover_url: "https://example.com/.well-known/openid-configuration",
      },
      client: {
        client_id: "tenantA",
        client_secret: "tenantSecret",
      },
      scopes: ["openid", "profile", "email", "login_info"], // must include openid
      redirect_uri: "https://auth.example.com:445/oauth/callback",
      features: {
        authorization_token: "id_token",
      },
      assertions: {
        exp: true,
        /**
        * assert the 'not before' attribute of the token(s)
        */
        nbf: false,
        iss: true,
        userinfo: [],
        id_token: [],
      },
      cookie: {
        domain: "example.com", //defaults to request domain, could do sso with more generic domain
      },
    },
  ], // list of plugin definitions, refer to PLUGINS.md for details
}

If you are familiar with OIDC, we use authorization code flow. Things to notice:

login_info is our custom scope which signals our single sign-on (SSO) server to provide information about the login session
redirect_uri has a specific host and port https://auth.example.com:445 because the EAS service should not be protected with authentication itself and we already have EnvoyFilter that enforces authentication for all https traffic on port 443
authorization_token is set to id_token, so we get all relevant info in the Authorization header that will be used later for request access decision making
nbf: false is set because we have a time difference between our Identity Provider service and EAS pod, so not before validation fails if we set it to true and authentication doesn't work. It's not ideal, but I've encountered the identical issue at multiple companies already, so there is a good chance you may run into that as well.
The cookie domain is set to example.com, so we can have SSO for all applications on subdomains like module_name.tenant_name.example.com

This set up completes the authentication flow and allows us to deploy simple applications per tenant that will be shared between tenant users. Example architecture is shown below:

Next week, we will publish the second part of this blog, in which I will explore how to limit user access within a single tenant to ensure users can only access their copy of the application.
A way to achieve multitenancy on kubernetes with Istio, External Authentication Server and OpenID Connect (Part 2 - Authorization)

In the previous blog post, I discussed how an External Authentication Server is used at HAL24K, together with OIDC and some parts of Istio, to perform user authentication and direct users to a shared application within a specific kubernetes namespace. The blog ended with the receipt of an id_token from our Identity Provider. Today, I will look at how an id_token allows us to make decisions about whether a given user has access to the specific application instance. As a reminder, this token contains information about the current_tenant, name of the user and  permissions to access some platform modules. The token itself is passed by the EAS service to our application via Authorization header.
The id_token that we get from the Authorization header is still encoded as a JSON Web Token and we need some component to parse it. Istio has a concept of End User Authentication which basically works by extracting the JWT token from the Authorization header (other custom headers are possible as well), validating it with the Identity Provider (OIDC) and parsing it into the request.auth object that can be used by other Istio components.
Now, having all the information about the user in the proper format, we have to make a decision about allowing or denying the request. Istio is a quite complex and powerful piece of software and can make such decisions with its Authorization functionality that works both on HTTP and TCP services. It is pretty impressive as well as flexible: ServiceRole allows you to specify which service you want to protect inside the cluster and exactly how (methods and paths), while ServiceRoleBinding specifies who can use a given ServiceRole, and that's actually the place where we make a decision to allow or deny the request based on request.auth claims described above. So, essentially, for every module inside the tenant namespace, we define ServiceRole and then in ServiceRoleBinding, we check if the current user is part of the tenant and has the appropriate permissions.
Multi-user configuration

So far, what I’ve covered makes sure that the user can log in and how we can decide if the user is allowed to access a specific application while being part of the given tenant. Even though we know the user has access to JupyterHub in general, we have to make sure that access is limited to the specific Jupyter notebook. That's where the real difference in set up between our example application (JupyterHub) and a generic solution will be visible. I will first consider JupyterHub as I referred to it as our main example and I’ll then take a look at the generic set up for any application.
JupyterHub

I mentioned in my first blog [link to mention? ## High-level solution overview] that JupyterHub has some integration with kubernetes. Kubernetes-specific tasks in JupyterHub are handled by kubespawner: it manages the life-cycle of single-user Jupyter notebooks. To make Kubespawner work in conjunction with EAS (basically, to make JupyterHub aware of our authenticated users), my colleague used this project. With the following two lines of code in jupyterhub_config.py file:
c.JupyterHub.authenticator_class = RemoteUserAuthenticator
c.RemoteUserAuthenticator.header_name = "X-User-Id"

we are able to propagate current user info to JupyterHub via the X-User-Id header. The only question is: how can we pass this custom header and how can we populate it with the actual username? Istio to the rescue, again. We are using an Istio policy rule to append the X-User-Id header based on request.auth.claims["name"] value. This becomes possible only after enabling Istio Policy Enforcement which is handled by the component called Mixer. So, this allows us to complete multi-user separation within a single tenant for JupyterHub. The complete diagram of this is shown below:

Generic solution

To make new applications multitenant on kubernetes without writing your own authentication/authorization, you will need to find a way to manage each application's instance access and routing outside of your application (JupyterHub manages access and routing for single-user notebooks in the previous example). This kind of access management can be done via path-based or even header-based request routing. That's actually what JupyterHub does under the hood, but I will show how to do it with Istio.
Header-based routing can be done via matching a specific header to the username and then routing the request to a user-specific instance like jupyter-maksym. The drawback here is that you still need an Istio policy functionality to get that header and that involves extra work and complexity.
Path-based routing can be done by creating a VirtualService that rewrites a request from http://module_name.tenant_name.example.com/maksym to http://module_name-maksym that will reach the module_name-maksym pod in the tenant_name namespace.
You may say that anyone can change the URL path to another user and get access to the application instance where they don't have permission to, but that's where Istio ServiceRole and ServiceRoleBinding are needed to make sure that only the user maksym has access to the module_name-maksym service. 
The drawback for both approaches (header-based and path-based routing) is that you have to pre-create pods and roles for each user upfront. But that's something you have to do anyway if your application doesn't have some management layer and kubernetes integration as JupyterHub does. To clarify this point - when users go to JupyterHub and launch their notebook, JupyterHub will take care of pod creation and routing. At the same time, for your example application without the management layer, several Istio components have to be in place before the user hits application URL in the browser. One solution could be to have some webhook that creates those components when the user gets specific permissions in your identity provider database.
 This is what a generic set up could look like for the path-based routing approach:
You may say that anyone can change the URL path to another user and get access to the application instance where they don’t have permission to, but that's where Istio ServiceRole and ServiceRoleBinding are needed to make sure that only the user maksym has access to the module_name-maksym service. Those roles have to be pre-created upfront and managed on a per-user basis. This is what a generic set up could look like:

As you can see, it is very similar to the previous diagram, but without Istio Rule. Another difference is that Istio roles, bindings and auth policies are user-specific, not a single policy/role/binding per platform module.
I assume that with some user-specific labels that you will apply to all application instances which belong to that user (pods datalab-maksym, dataflow-maksym labelled with user=maksym), you could reduce the number of ServiceRole and AuthPolicy resources. But I haven't tested it myself.
Final remarks

I included a very information-intensive diagram in the article banner while describing simplified versions of it throughout the blog. Knowing Istio’s complexity, I would be happy to spend more time referring to that diagram and posting full code snippets of Istio configuration. Please let me know in the comments if you would like to see a more detailed JupyterHub set up and I can follow up on that in the next blog post.
Special thanks to Samuel Hessel, Tim Stokman and Travis Hansen for their collaboration in making this work.