Bad boy(s) of DevOps: Azure

Showing posts with label Azure. Show all posts

Thursday, December 7, 2023

Green IT: Kubernetes to Optimize Systems for Efficiency

Green IT does not mean increased maintenance costs or lower performance. It just means that the system and infrastructure must be created carefully.

Cleaner version from NorthCode site.

Background

The web store is probably the most common kind of interactive website. Our customer has one of the biggest web store in Finland. It’s international, customers are all over the world, there isn’t such a thing as quiet days. Actually there’s only a few busy or even more busy days like Black Friday. The infrastructure is at Azure so when we talk about Kubernetes it means AKS.

When we started there the customer didn’t have any autoscaling. If the site started to slow down more virtual machines were added. They didn’t have any idea how much the application used the CPU or memory of the virtual machine so the used virtual machines were quite big and expensive. At the end there were 14 virtual machines to make sure that even the Black Friday had enough calculation power. That's an expensive way to have the system up and running at the slower times.

Planning and action

What are the steps to get the expenses and CO2 emission lower? The first is boring stuff. It means going through how the application is done. Luckily their application is stateless. It means that if the load balancer decides to route the traffic to another virtual machine it doesn’t lose the shopping cart. The application was also in the container already. One container per virtual machine is sometimes a good strategy.

Next step was to investigate how the instances were actually used. It's a good rule of thumb that at least 60% of memory and CPU is in use at peak times. To investigate it there is one good way to do it: Performance testing. Run as much traffic to it as it can stand. It ran until the application crashed. The result was a bit depressing. The maximum memory use was under 10%. CPU usage wasn’t much better. Actually it used most of the time less than 50% of the single core. So 4 core instances with 64GB memory didn’t improve the performance.

At this point we had some idea how the application behaves. It needs auto scaling, it could run more than one container on a single virtual machine. That is a good candidate for the Kubernetes - stateless, doesn’t need a whole virtual machine and the test environment must be flexible.

Implementation

First step is to create the Kubernetes cluster for the testing and see how well the application works there. At the testing you don’t have to think about reliability. Single system node is enough. The workload is using agent nodes. The agent node pool is autoscaling. So when there are more pods to be deployed than the current agent node pool can run the new instances are created. For the reliability and security reasons the system node pool is only for the core components of the Kubernetes.

When we were sure that Kubernetes installation worked we set up the production AKS cluster. We already knew the CPU and memory usages. It was easy to set the proper limits for the Kubernetes specs. System node pool must be such that single failure doesn’t take down the whole cluster. Three system nodes is a good amount for that. We calculated the usage during a normal day. In that case the agent node pool could contain a single much lighter instance than the original setup. That would be the single point of failure and risk the stability of the system. So at the normal time we set it to have 2 virtual machines.

Autoscaling was tested (again) with the performance tests. It scaled up the pods automatically as it was supposed to do. It also scaled up the size of the agent node pool. After the excessive traffic stopped it scaled down the pods and the agent node pool size.

Did this have any other impact? Yes. The testing was improved. The CI pipeline was built so that each pull request created its own test environment. Feedback cycle was improved. Issues at the production reduced. Developers, testers and marketing were happy. We were also able to start testing with the architecture. We managed to improve the caching operations. It meant less hit to the CPU intensive backend operations. Also the traffic to the 3rd party APIs were reduced.

And some numbers

Started with 14 extra large virtual machines - cost ~7000€/month.

Ended with 10 - 16 medium VMs (~ 1300€/month)

Conclusion

So “Green IT” does not mean “Expensive IT”. It means better utilized IT. That usually also means more cost consciousness..

Thursday, August 26, 2021

Azure RBAC in use

Azure identity and access management is the dragon. He sits on pile of gold. You have to beat him to win, to get the gold. Or to get your Azure secured but still easy to use for developers and DevOps guys. Here are some ideas on how to beat the beast.

First and the most important information is that forget the AD and Azure AD when you think about Azure RBAC. AAD is storing some of the identities. It’s actually the Identity Provider for the Azure RBAC users and groups. It’s not storing the RBAC principles. RBAC is the authorization method for Azure.

After we have cleared our understanding of what AAD is not, we can go deeper into Azure RBAC.

Let’s start with the example:

az role assignment create --role "User Access Administrator" \
    --assignee testuser_1@myazuredomain.onmicrosoft.com \
    --scope  /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group

The parts in this RBAC role assignment are:

Assignee - who gets the role. This can be user, group or service principle. It’s recommended that instead of assigning the roles to users you assign them to the user groups.
Role - this is a list of the access rights which the user gets. Azure has built-in roles which can be used. They can be used with all AAD subscriptions. There is also the possibility to use custom roles but it requires Premier P1 or P2 AAD subscription.
Scope - the scope is the ‘path’ for the resources which are under this role for this assignee.

Scope is the path to the resources. It allows the role for everything under the path. The previous role assignment allows testuser_1 to modify the access rights of all resources under resource group test-group.

If the resource structure is following:

Subscription 11111111-2222-3333-4444-555555555555

Resource group: test-group

Virtual network: test-network

Subnet: test-subnet

Resource group: another-group

Virtual network: another-network

The role assignment covers resources test-group, test-network and test-subnet. It doesn’t allow the user to do any user administration at the resource group another-group.

If the user has the role "User Access Administrator" he does not have any administrator access to the AAD itself. He cannot change the password of the users. He can’t create the users to AAD. But AAD has the option (which is enabled by default) to allow guest invites. It can be disabled from the AAD User Settings. The user can create new service principles with the scope where he is the User Access Administrator.

Examples

Creating the service principal with the scope:

az ad sp create-for-rbac --name testServicePrincipal 
    --scope /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group

Adding the role for the service principal:

az role assignment create --role "Network Contributor" \
    --assignee aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
    --scope  /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group/providers/Microsoft.Network
/virtualNetworks/test-network

Attempts to create the role assignment or service principal outside the user’s scope will fail.

Friday, August 14, 2020

Kubernetes (and Azure AKS) RBAC description

Part of the Kubernetes security is to use RBAC for the authentication and authorization. There’s plenty of short articles about that, but I didn’t find any good and complete “how to”-instructions. I hope this will be such. If you want me to clarify something, add it to the comments please. This is done from the Azure AKS point of view when it is integrated with AAD. But many things are the same in other clusters also.

Here’s the description what kind of parts the Kubernetes role and role binding has:

In this terminology “Role” is describing what the binded identity can do. The identity can be a user, group or service account. Roles can be binded to multiple identities. Let’s start to look at the things backwards and start from the Role binding.

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: <name for the binding>
subjects:
 - kind: Group
   name: <AAD group ID>
roleRef:
 kind: Role
 name: <name of the role>
 apiGroup: rbac.authorization.k8s.io

Role binding is describing what identity can use the role. For humans the identity is group or single user. The service account is for those pods which have to access the apiserver. User is a single user (like testuser_1@youaaddomain.onmicrosoft.com). The role binding to the single user is useful only if you have a few (less than two) users. With more than one user it will become complex and time consuming to maintain. At AKS AAD integration the Group is the object ID of the AD group. E.g. 6ec5b8f7-823c-491c-97d6-977ae68afbf3.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 namespace: <mandatory for Role>
 name: <name of the role>
rules:
 - apiGroups:
     - ""
     - <some other API group>
   resources:
     - <resource>
     - <another resource>
   verbs:
     - <verb 1>
     - <verb 2>
 - apiGroups: # Another block - there can be any number of objects
     - <some other API group>
   resources:
     - <resource>
     - <another resource>
   verbs:
     - <verb 1>
     - <verb 2>

The verbs are actions which are allowed. The resources has the following verbs: Create, Get, List, Watch, Update, Patch, Delete, Deletecollection. Addition to those there are several special verbs:

use verb for the podsecuritypolicies in the policy API group
bind and escalate verbs on roles and clusterroles resources in the rbac.authorization.k8s.io API group
impersonate verb on user

You have to read API documentation what each verb exactly does for each resource.

API Group is the group where the resource belongs to. If the resource is a member of the API group, it must be mentioned in the apiGroups part. The empty string is core. All others must be mentioned. When the resource is searched, the Kubernetes checks all API groups which have been defined for this access right object. If you have defined ‘*’ resource for the access rights, it means that any resource from the defined API groups match with this access.

The resources are divided into two separate groups: Namespace resources and cluster resources. For example, a pod is a namespace resource while a node is a cluster resource. The ClusterRoleis only object which can allow access to the cluster resource. ClusterResource can also allow access to namespace resources. In that case the access is to the resource in all namespaces. The resource binding is done with the ClusterResourceBinding.

If the namespace resources are meant to give access to the specific namespace, the Role is used. The Role defines what namespace is in use. The binding is done with the RoleBinding.

I've created the Kubernetes RBAC Matrix for better readability.