Thursday, December 7, 2023

Green IT: Kubernetes to Optimize Systems for Efficiency

Green IT does not mean increased maintenance costs or lower performance. It just means that the system and infrastructure must be created carefully.

Cleaner version from NorthCode site.

Background

The web store is probably the most common kind of interactive website. Our customer has one of the biggest web store in Finland. It’s international, customers are all over the world, there isn’t such a thing as quiet days. Actually there’s only a few busy or even more busy days like Black Friday. The infrastructure is at Azure so when we talk about Kubernetes it means AKS. 

When we started there the customer didn’t have any autoscaling. If the site started to slow down more virtual machines were added. They didn’t have any idea how much the application used the CPU or memory of the virtual machine so the used virtual machines were quite big and expensive. At the end there were 14 virtual machines to make sure that even the Black Friday had enough calculation power. That's an expensive way to have the system up and running at the slower times.

Planning and action

What are the steps to get the expenses and CO2 emission lower? The first is boring stuff. It means going through how the application is done. Luckily their application is stateless. It means that if the load balancer decides to route the traffic to another virtual machine it doesn’t lose the shopping cart. The application was also in the container already. One container per virtual machine is sometimes a good strategy.

Next step was to investigate how the instances were actually used. It's a good rule of thumb that at least 60% of memory and CPU is in use at peak times.  To investigate it there is one good way to do it: Performance testing.  Run as much traffic to it as it can stand. It ran until the application crashed. The result was a bit depressing. The maximum memory use was under 10%. CPU usage wasn’t much better. Actually it used most of the time less than 50% of the single core. So 4 core instances with 64GB memory didn’t improve the performance. 

At this point we had some idea how the application behaves. It needs auto scaling, it could run more than one container on a single virtual machine. That is a good candidate for the Kubernetes - stateless, doesn’t need a whole virtual machine and the test environment must be flexible.

Implementation

First step is to create the Kubernetes cluster for the testing and see how well the application works there. At the testing you don’t have to think about reliability. Single system node is enough. The workload is using agent nodes. The agent node pool is autoscaling. So when there are more pods to be deployed than the current agent node pool can run the new instances are created. For the reliability and security reasons the system node pool is only for the core components of the Kubernetes.

When we were sure that Kubernetes installation worked we set up the  production AKS cluster. We already knew the CPU and memory usages. It was easy to set the proper limits for the Kubernetes specs. System node pool must be such that single failure doesn’t take down the whole cluster. Three system nodes is a good amount for that. We calculated the usage during a normal day. In that case the agent node pool could contain a single much lighter instance than the original setup. That would be the single point of failure and risk the stability of the system. So at the normal time we set it to have 2 virtual machines.

Autoscaling was tested (again) with the performance tests. It scaled up the pods automatically as it was supposed to do. It also scaled up the size of the agent node pool. After the excessive traffic stopped it scaled down the pods and the agent node pool size.

Did this have any other impact? Yes. The testing was improved. The CI pipeline was built so that each pull request created its own test environment. Feedback cycle was improved. Issues at the production reduced. Developers, testers and marketing were happy. We were also able to start testing with the architecture. We managed to improve the caching operations. It meant less hit to the CPU intensive backend operations. Also the traffic to the 3rd party APIs were reduced. 

And some numbers

Started with 14 extra large virtual machines - cost ~7000€/month.

Ended with 10 - 16 medium VMs (~ 1300€/month)

Conclusion

So “Green IT” does not mean “Expensive IT”. It means better utilized IT. That usually also means more cost consciousness..