December 31, 2024
Highlights from Milestone M26 (Dec 16 - Dec 29, 2024)
While users are billed based on their dyno's CPU and memory consumption, in reality, NeetoDeploy is billed for EC2 machines. So it is imperative that we distribute dynos across EC2 machines so that they are utilized to a maximum.
By default, the Kubernetes Scheduler manages pod distribution across nodes using the "LeastAllocated" policy, which aims for equal resource distribution across all nodes. However, this approach leads to suboptimal resource utilization, resulting in a high number of partially utilized nodes. Our goal is to implement a "MaxAllocated" policy to maximize individual node utilization before allocating pods to new nodes, thereby reducing our overall node count and associated costs. However, this modification presents challenges in EKS (managed Kubernetes) where the default scheduler configuration is restricted.
To address this limitation, we developed a solution involving a custom kube-scheduler deployment within our cluster, configured with the MaxAllocated policy. While initial testing proved successful, we encountered compatibility issues with Cluster Autoscaler (CA), which exclusively supports pods scheduled by the default scheduler. Our cluster currently employs two node management solutions: CA for addon services (PostgreSQL, Redis, ElasticSearch) and Karpenter for general dyno workloads. Through testing, we discovered that Karpenter, being scheduler-agnostic, successfully handles pods regardless of the scheduler used. Consequently, we plan to migrate all node management to Karpenter, enabling the implementation of our custom scheduling strategy.
The scheduler migration has been thoroughly tested in our development environment. To ensure a smooth transition, we will execute the rollout in planned phases over the coming weeks.
CastAI (https://cast.ai/) implemented for cost monitoring and optimization
Robusta (https://home.robusta.dev/) integrated for resource usage monitoring, anomaly detection, and error state management
We use free-tier of both these services.
Successfully implemented PGbackRest for PostgreSQL database backups.
Integrated Velero for cluster configuration backup and recovery.
These implementations complete our disaster recovery infrastructure.
We have implemented a dynamic resource allocation system that enables users to specify precise CPU and RAM requirements for their dynos. This approach replaces the previous fixed-plan approach.
Leveraging Robusta's analytical insights and our application metrics, combined with our new fine-grained resource allocation system, we conducted a comprehensive audit of all Neeto applications. This resulted in optimized resource allocation across our infrastructure, significantly improving overall resource usage efficiency.
AWS advertises Graviton as having better price performance. Benchmarks also find Graviton to be more performance-optimized. This is partly due to Arm-based processors' lower power consumption and more competitive margins since Amazon owns Graviton.
Intel and AMD have x86-based architecture, while Graviton has 64-bit Arm Neoverse cores. The architecture affects software compatibility and performance. Arm-based architecture, in particular, has potential compatibility issues with certain software that may not be supported.
We migrated the machines on which we run our add-ons to Graviton-based EC2 instances, which allowed us to achieve a drop without any performance impact. Our add-ons (PostgreSQL, Redis, and ElasticSearch) support ARM. However, since ARM support is not guaranteed for all software and libraries that user applications may have, we will continue to use AMD64 machines for our general-purpose dyno deployments.
Subscribe to get future posts via email.
Let's get started now.