Bad boy(s) of DevOps

Thursday, December 7, 2023

Green IT: Kubernetes to Optimize Systems for Efficiency

Green IT does not mean increased maintenance costs or lower performance. It just means that the system and infrastructure must be created carefully.

Cleaner version from NorthCode site.

Background

The web store is probably the most common kind of interactive website. Our customer has one of the biggest web store in Finland. It’s international, customers are all over the world, there isn’t such a thing as quiet days. Actually there’s only a few busy or even more busy days like Black Friday. The infrastructure is at Azure so when we talk about Kubernetes it means AKS.

When we started there the customer didn’t have any autoscaling. If the site started to slow down more virtual machines were added. They didn’t have any idea how much the application used the CPU or memory of the virtual machine so the used virtual machines were quite big and expensive. At the end there were 14 virtual machines to make sure that even the Black Friday had enough calculation power. That's an expensive way to have the system up and running at the slower times.

Planning and action

What are the steps to get the expenses and CO2 emission lower? The first is boring stuff. It means going through how the application is done. Luckily their application is stateless. It means that if the load balancer decides to route the traffic to another virtual machine it doesn’t lose the shopping cart. The application was also in the container already. One container per virtual machine is sometimes a good strategy.

Next step was to investigate how the instances were actually used. It's a good rule of thumb that at least 60% of memory and CPU is in use at peak times. To investigate it there is one good way to do it: Performance testing. Run as much traffic to it as it can stand. It ran until the application crashed. The result was a bit depressing. The maximum memory use was under 10%. CPU usage wasn’t much better. Actually it used most of the time less than 50% of the single core. So 4 core instances with 64GB memory didn’t improve the performance.

At this point we had some idea how the application behaves. It needs auto scaling, it could run more than one container on a single virtual machine. That is a good candidate for the Kubernetes - stateless, doesn’t need a whole virtual machine and the test environment must be flexible.

Implementation

First step is to create the Kubernetes cluster for the testing and see how well the application works there. At the testing you don’t have to think about reliability. Single system node is enough. The workload is using agent nodes. The agent node pool is autoscaling. So when there are more pods to be deployed than the current agent node pool can run the new instances are created. For the reliability and security reasons the system node pool is only for the core components of the Kubernetes.

When we were sure that Kubernetes installation worked we set up the production AKS cluster. We already knew the CPU and memory usages. It was easy to set the proper limits for the Kubernetes specs. System node pool must be such that single failure doesn’t take down the whole cluster. Three system nodes is a good amount for that. We calculated the usage during a normal day. In that case the agent node pool could contain a single much lighter instance than the original setup. That would be the single point of failure and risk the stability of the system. So at the normal time we set it to have 2 virtual machines.

Autoscaling was tested (again) with the performance tests. It scaled up the pods automatically as it was supposed to do. It also scaled up the size of the agent node pool. After the excessive traffic stopped it scaled down the pods and the agent node pool size.

Did this have any other impact? Yes. The testing was improved. The CI pipeline was built so that each pull request created its own test environment. Feedback cycle was improved. Issues at the production reduced. Developers, testers and marketing were happy. We were also able to start testing with the architecture. We managed to improve the caching operations. It meant less hit to the CPU intensive backend operations. Also the traffic to the 3rd party APIs were reduced.

And some numbers

Started with 14 extra large virtual machines - cost ~7000€/month.

Ended with 10 - 16 medium VMs (~ 1300€/month)

Conclusion

So “Green IT” does not mean “Expensive IT”. It means better utilized IT. That usually also means more cost consciousness..

Thursday, February 10, 2022

JavaScript/TypeScript Promises

Summary - Two rules: Always return the Promise, always resolve or reject the Promise. If this was not enough, continue reading.

I’ve tried to understand how the asynchronous side of TypeScript and JavaScript works and how to avoid the problems related to it. My background is in development. I’ve been working with the concurrent systems. I’m used to locks, semaphores and all that stuff which are important to concurrent programming. But now I have to survive with the asynchronous JavaScript.

Most often my use case is that I read the data from somewhere, process it, and then do some new tricks with that data. So it’s much like the pipeline. You can’t process the data unless you’ve read it. You can’t do the new tricks for the data unless your processing has finished. You also have to be sure that you don’t exit before all data is handled.

The best way to manage this situation is to use Promise -class. Using it requires some understanding. I tried to google tutorials etc, but I always got lost. Rest of this blog article is for myself, but I hope it will help others.

The basic structure for the promise is:

promise.then(...).then(...).then(...).catch(...).finally(...)

To get that chain working properly we have to take a look at the real code. The code is simple: I have the array of numbers. I want to wait that many seconds for each “processor”. First we write to the console when we come to the processor. Then we write when we come out from the processor. And at the end we display: "Goodbye” for the developer.

const array1 = [9, 4, 6, 7, 6, 8];
 
async function testfunc(test: string, tst: number) {
 console.log(`In ${test}`);
 return new Promise((res, rej) => {
   setTimeout(res, tst * 1000);
 }).then(() => {
   console.log(`Out ${test}`);
   return new Promise((res) => {
     res(1);
   });
 });
}
 
const testidata = array1.map((item, index) =>
 testfunc(`Number ${index}`, item)
);
 
Promise.all(testidata).then(() => {
 console.log("Goodbye");
});

There’s two rules to create the promise chain which really works.

Rule number one: The promise must ALWAYS be resolved or rejected. The resolve and reject functions are parameters of the promise parameter function. You see this at the lines 5 and 9.

Rule number two: Always return the Promise if you want the Promise chain to continue. This is shown in line 9.

Promise.all() blocks until all promises on the list are resolved successfully or exception is thrown. Promise.allSettled() blocks until all promises are resolved. It’s easy to test other cases if you just remember those two rules above.

Hopefully this helps others also. I hope I’ll end up on this page when I next have to start wondering how the Promise works.

Thursday, August 26, 2021

Azure RBAC in use

Azure identity and access management is the dragon. He sits on pile of gold. You have to beat him to win, to get the gold. Or to get your Azure secured but still easy to use for developers and DevOps guys. Here are some ideas on how to beat the beast.

First and the most important information is that forget the AD and Azure AD when you think about Azure RBAC. AAD is storing some of the identities. It’s actually the Identity Provider for the Azure RBAC users and groups. It’s not storing the RBAC principles. RBAC is the authorization method for Azure.

After we have cleared our understanding of what AAD is not, we can go deeper into Azure RBAC.

Let’s start with the example:

az role assignment create --role "User Access Administrator" \
    --assignee testuser_1@myazuredomain.onmicrosoft.com \
    --scope  /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group

The parts in this RBAC role assignment are:

Assignee - who gets the role. This can be user, group or service principle. It’s recommended that instead of assigning the roles to users you assign them to the user groups.
Role - this is a list of the access rights which the user gets. Azure has built-in roles which can be used. They can be used with all AAD subscriptions. There is also the possibility to use custom roles but it requires Premier P1 or P2 AAD subscription.
Scope - the scope is the ‘path’ for the resources which are under this role for this assignee.

Scope is the path to the resources. It allows the role for everything under the path. The previous role assignment allows testuser_1 to modify the access rights of all resources under resource group test-group.

If the resource structure is following:

Subscription 11111111-2222-3333-4444-555555555555

Resource group: test-group

Virtual network: test-network

Subnet: test-subnet

Resource group: another-group

Virtual network: another-network

The role assignment covers resources test-group, test-network and test-subnet. It doesn’t allow the user to do any user administration at the resource group another-group.

If the user has the role "User Access Administrator" he does not have any administrator access to the AAD itself. He cannot change the password of the users. He can’t create the users to AAD. But AAD has the option (which is enabled by default) to allow guest invites. It can be disabled from the AAD User Settings. The user can create new service principles with the scope where he is the User Access Administrator.

Examples

Creating the service principal with the scope:

az ad sp create-for-rbac --name testServicePrincipal 
    --scope /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group

Adding the role for the service principal:

az role assignment create --role "Network Contributor" \
    --assignee aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
    --scope  /subscriptions/11111111-2222-3333-4444-555555555555
/resourceGroups/test-group/providers/Microsoft.Network
/virtualNetworks/test-network

Attempts to create the role assignment or service principal outside the user’s scope will fail.

Friday, January 15, 2021

Kubernetes Service Account debugging notes

Kubernetes and RBAC are horrible monsters. Debugging them is time consuming activity. Here’s several hints on how I'm doing that.

First you have to get out the secret which stores the token to the system account. This happens with the command:

kubectl get sa <service account name> -n <name space> \
    -o=jsonpath='{.secrets[*].name}'

I’m using the Helm Data Tool to create the proper Kubernetes configuration file. It needs the access token and server certificate. It also needs the URL to the Kubernetes API server. The ca.crt and token files must be in the same directory. This example creates them in the directory ./tmp.

Next step is to generate the access token and certificate. First the certificate is created:

kubectl get secret my-secret-12345 -n ingress \
    -o=jsonpath="{.data['ca\.crt']}" | base64 -d > tmp/ca.crt

Then the access token is created:

kubectl get secret my-secret-12345 -n ingress \
    -o=jsonpath='{.data.token}' | base64 -d >tmp/token

If we’re now at the directory ~/helm-data-tool, and the kubeconfig-creator.sh is at the bin directory, you will create the Kubernetes configuration file with the command:

bin/kubeconfig-creator.sh -b tmp -h https://my-api:443 >sa-kubeconfig

One global kubectl parameter is --kubeconfig. You can give sa-kubeconfig for it. After that you can test your API calls. E.g. to check if the System Account has global access to list the roles:

kubectl get role -A --kubeconfig=sa-kubeconfig 

Helm is not that well supporting the setting of configuration from the command line. But those commands which are supporting that have option --kubeconfig.

helm upgrade -i --kubeconfig sa-kubeconfig …

These are my personal notes. I hope you like them too. If you have own hints how to debug Kubernetes configuration, please let me know.

Thursday, November 12, 2020

Microservices for the better performance

I'm starting to be a fan of API based communication and content loading. In this blog I shortly describe why.

Let’s have a blog page which is a bit like this page. It has following components:

Menu
Content (this text+title)
Comments

Let’s first look at the life cycles of these parts:

Comment - it’s changing whenever someone sends the comment. So it’s changing quite often at the famous blog. Each blog entry has its own comments.
Menu - it’s changing when new content is coming or the titles are updated. The menu is practically the same at every page.
Content - every page has its own content and it’s not changing very often after it has been published. In most cases it’s not changing at all. (Well - maybe some typo fixes but not much more than that.)

First we have the traditional architecture which e.g. Wordpress is using. It doesn’t have any API. It just constructs the whole page at the server and returns it. So you’re every time loading the menu, content and comments. You can’t cache any of this data easily or you risk that people are missing the comments. Or if you think it’s possible to create the cache and then invalidate it whenever there are any changes the process is quite complex. With pseudo code:

If menu changes -> Invalidate all pages which has menu - this is the loop and the invalidation process must know what pages has the menu
If content changes -> Invalidate the content of that page
If there is comment -> Invalidate the content of that page

The menu changes are expensive. After that all page loads are hitting the backend for a while.

What if we create API based communication? The ‘static’ web page is a bit of HTML without any content, JavaScript and CSS files. The APIs are Menu, Content and Comments. Below is the architecture picture of the system. User's cache can be e.g the internal cache of the browser or the proxy of Internet Service Provider.

There’s good chances that the Content does not have to hit the real storage ever after it has been loaded for the first time. Content cache TTL for the local cache can be forever. We can easily invalidate that. The story for the remote caches are different. The TTL can be e.g. 30 seconds. In that case the user’s cache does not store the data for a long time. But instead of hitting our Content service it hits our local cache.

When the data at the Menu changes we don’t have to create a complex loop which invalidates the cache. We have only one call which invalidates the cache of the menu of all pages. This simplifies our rules a lot. The rule for the local cache can be “forever”, but for the users’ caches it can be e.g. 30 seconds or even shorter.

The caching of Comments API depends what the features are. If it gives the user the possibility to modify or delete his comment, then this API cannot be cached for the user who is logged in. There can be more complex rules for caching the Comments API. User logged in -> Never cache. Anonymous user -> Always cache, but invalidate when new comments are written.

Good microservice architecture can improve the performance with the good caching policies. The APIs can have their own life cycles and caching rules should follow those. In many cases it’s enough that the component sets proper caching headers. But to separate the different caching rules for local cache and user’s cache the caching application must be able to modify those.

P.S. Good caching lowers also the infrastructure costs and increases the reliability of the system.

Friday, August 14, 2020

Kubernetes (and Azure AKS) RBAC description

Part of the Kubernetes security is to use RBAC for the authentication and authorization. There’s plenty of short articles about that, but I didn’t find any good and complete “how to”-instructions. I hope this will be such. If you want me to clarify something, add it to the comments please. This is done from the Azure AKS point of view when it is integrated with AAD. But many things are the same in other clusters also.

Here’s the description what kind of parts the Kubernetes role and role binding has:

In this terminology “Role” is describing what the binded identity can do. The identity can be a user, group or service account. Roles can be binded to multiple identities. Let’s start to look at the things backwards and start from the Role binding.

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: <name for the binding>
subjects:
 - kind: Group
   name: <AAD group ID>
roleRef:
 kind: Role
 name: <name of the role>
 apiGroup: rbac.authorization.k8s.io

Role binding is describing what identity can use the role. For humans the identity is group or single user. The service account is for those pods which have to access the apiserver. User is a single user (like testuser_1@youaaddomain.onmicrosoft.com). The role binding to the single user is useful only if you have a few (less than two) users. With more than one user it will become complex and time consuming to maintain. At AKS AAD integration the Group is the object ID of the AD group. E.g. 6ec5b8f7-823c-491c-97d6-977ae68afbf3.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 namespace: <mandatory for Role>
 name: <name of the role>
rules:
 - apiGroups:
     - ""
     - <some other API group>
   resources:
     - <resource>
     - <another resource>
   verbs:
     - <verb 1>
     - <verb 2>
 - apiGroups: # Another block - there can be any number of objects
     - <some other API group>
   resources:
     - <resource>
     - <another resource>
   verbs:
     - <verb 1>
     - <verb 2>

The verbs are actions which are allowed. The resources has the following verbs: Create, Get, List, Watch, Update, Patch, Delete, Deletecollection. Addition to those there are several special verbs:

use verb for the podsecuritypolicies in the policy API group
bind and escalate verbs on roles and clusterroles resources in the rbac.authorization.k8s.io API group
impersonate verb on user

You have to read API documentation what each verb exactly does for each resource.

API Group is the group where the resource belongs to. If the resource is a member of the API group, it must be mentioned in the apiGroups part. The empty string is core. All others must be mentioned. When the resource is searched, the Kubernetes checks all API groups which have been defined for this access right object. If you have defined ‘*’ resource for the access rights, it means that any resource from the defined API groups match with this access.

The resources are divided into two separate groups: Namespace resources and cluster resources. For example, a pod is a namespace resource while a node is a cluster resource. The ClusterRoleis only object which can allow access to the cluster resource. ClusterResource can also allow access to namespace resources. In that case the access is to the resource in all namespaces. The resource binding is done with the ClusterResourceBinding.

If the namespace resources are meant to give access to the specific namespace, the Role is used. The Role defines what namespace is in use. The binding is done with the RoleBinding.

I've created the Kubernetes RBAC Matrix for better readability.

Wednesday, February 27, 2019

MFA, cross account roles and command line

One primary #AWS account #security tool is #IAM roles. My practice is that user without MFA can't do anything. I force the user to assume the role before she can do anything. This can be real pain if you have to manage multiple accounts. Also Terraform has some “issues” with MFA so assuming the role and setting up the credentials to the environment variables is the simplest solution.

The best tool to manage this chaos is awsume. Before using it you have to setup your credentials properly to the shared credentials.

To ~/.aws/credentials I set up the "main account".

1
2
3

[mainaccount]
aws_access_key_id = <accesskey> 
aws_secret_access_key = <secret key>

The IAM policy does not require MFA for this yet. It doesn’t allow many actions either. Actually if the MFA is not used, then this account is only allowed to set the virtual MFA device and change the console password. (But I’ll have another post about that later…)

At ~/.aws/config I have:

[profile dev-website-admin]
role_arn = arn:aws:iam::1234566543321:role/Admin
source_profile = otherprofile
mfa_serial = arn:aws:iam::1234566543324:mfa/myaccountt

Now the credentials are properly set. To assume the role with awsume you only need:
awsume dev-website-admin

It sets the proper temporary credentials and asks the MFA token if that's needed.