New to kPow 79 are temporary policies and a whole suite of new admin features!
Temporary policies allow admins the ability to assign access control policies for a fixed duration. A common use-case would be providing a user TOPIC_INSPECT access to read data from a topic for an hour while resolving an issue in a Production environment.
This blog post introduces temporary policies with an all-to-common real-world scenario.
You wake up one morning to a dreaded sight: a poison message has taken down one of your services! Your team decides the simplest solution is to skip the message by incrementing your consumer group's offset for the topic.
Now here's the problem: access to production is limited, and for such a simple action (incrementing the offset) a team member generally has to jump through the hoops of configuring the VPN, connecting to the jumpbox, and making sure they execute the right combination of bash commands against the Kafka cluster.
It always feels like such operations are unnecessarily time-consuming, brittle, and overall a frustrating process in a time-critical moment when you need to restore production access. And not only that, the jumpbox generally has full access to the Kafka cluster and there is no audit log recording the actions being committed.
Temporary policies in combination with kPow's existing Role-based access control and powerful mutation actions aim to improve this experience and give teams the confidence they need to easily effect change in a secured environment like production when things go wrong.
In this example, two roles are coming from our Identity provider: devs
and owners
.
We will assign anyone with the role owners
an admin for kPow and give them GROUP_EDIT
access to the production cluster. The devs
role will be implicitly denied if we don't configure any access policies.
Our example RBAC yaml file might look something like:
admin_roles:
- "owners"
authorized_roles:
- "owners"
- "devs"
policies:
-
actions:
- GROUP_EDIT
effect: Allow
resource:
- "*"
role: "owners"
Today is the unfortunate day when your team has to make a change to the production cluster.
Your team lead has been briefed on the plan and has decided to grant the devs
role Allow
access for GROUP_EDIT
on the cluster.
This has been done thru the Temporary Policies section of kPow's settings:
Once created, all team members get the notification thru Slack that the temporary policy was created:
Now a member of your team has been tasked with the job of incrementing the offset of the consumer group for the problematic topic.
The dev looks to the application logs and notices that it is partition 3 of topic tx_trade1
that contains the poison message. The erroring consumer group is named trade_b2
.
The developer then opens kPow, navigates to the "Workflows" tab, and selects the consumer group.
From within the consumer group view, the dev selects the partition and selects "Skip Offset".
This will schedule the mutation. The offset will be incremented once someone on the team scales down the trade_b2
service.
After the production incident, you can use kPow to aid in your incidents post-mortem.
All actions that were undertaken were persisted in kPow's audit log topic for data governance.
You can see a full recorded history of all actions that were taken to restore the production service.
Inspecting the audit log message reveals the offset that was skipped.
You can use kPow's data inspect functionality to view the poison message to help investigate why that message took down the consumer group.
Explore following articles in our documentation to learn more about kPow's features:
You might also be interested in the following articles: