NeetoDeploy BlogMilestone M26 - EC2 optimization, cluster observability and more

Milestone M26 - EC2 optimization, cluster observability and more

December 31, 2024

Highlights from Milestone M26 (Dec 16 - Dec 29, 2024)

EC2 Instance Optimization Strategy

While users are billed based on their dyno's CPU and memory consumption, in reality, NeetoDeploy is billed for EC2 machines. So it is imperative that we distribute dynos across EC2 machines so that they are utilized to a maximum.

By default, the Kubernetes Scheduler manages pod distribution across nodes using the "LeastAllocated" policy, which aims for equal resource distribution across all nodes. However, this approach leads to suboptimal resource utilization, resulting in a high number of partially utilized nodes. Our goal is to implement a "MaxAllocated" policy to maximize individual node utilization before allocating pods to new nodes, thereby reducing our overall node count and associated costs. However, this modification presents challenges in EKS (managed Kubernetes) where the default scheduler configuration is restricted.

To address this limitation, we developed a solution involving a custom kube-scheduler deployment within our cluster, configured with the MaxAllocated policy. While initial testing proved successful, we encountered compatibility issues with Cluster Autoscaler (CA), which exclusively supports pods scheduled by the default scheduler. Our cluster currently employs two node management solutions: CA for add-on services (PostgreSQL, Redis, ElasticSearch) and Karpenter for general dyno workloads. Through testing, we discovered that Karpenter, being scheduler-agnostic, successfully handles pods regardless of the scheduler used. Consequently, we plan to migrate all node management to Karpenter, enabling the implementation of our custom scheduling strategy.

The scheduler migration has been thoroughly tested in our development environment. To ensure a smooth transition, we will execute the rollout in planned phases over the coming weeks.

Cluster observability.

CastAI (https://cast.ai/) implemented for cost monitoring and optimization
Robusta (https://home.robusta.dev/) integrated for resource usage monitoring, anomaly detection, and error state management

We use free-tier of both these services.

Disaster Recovery

Successfully implemented PGbackRest for PostgreSQL database backups.
Integrated Velero for cluster configuration backup and recovery.

These implementations complete our disaster recovery infrastructure.

Fine-grained Resource Allocation System

We have implemented a dynamic resource allocation system that enables users to specify precise CPU and RAM requirements for their dynos. This approach replaces the previous fixed-plan approach.

Resource Optimization Initiative

Leveraging Robusta's analytical insights and our application metrics, combined with our new fine-grained resource allocation system, we conducted a comprehensive audit of all Neeto applications. This resulted in optimized resource allocation across our infrastructure, significantly improving overall resource usage efficiency.

Switched to machines with ARM-based Graviton processor

AWS advertises Graviton as having better price performance. Benchmarks also find Graviton to be more performance-optimized. This is partly due to Arm-based processors' lower power consumption and more competitive margins since Amazon owns Graviton.

Intel and AMD have x86-based architecture, while Graviton has 64-bit Arm Neoverse cores. The architecture affects software compatibility and performance. Arm-based architecture, in particular, has potential compatibility issues with certain software that may not be supported.

We migrated the machines on which we run our add-ons to Graviton-based EC2 instances, which allowed us to achieve a drop without any performance impact. Our add-ons (PostgreSQL, Redis, and ElasticSearch) support ARM. However, since ARM support is not guaranteed for all software and libraries that user applications may have, we will continue to use AMD64 machines for our general-purpose dyno deployments.

View archive

Ready to get started?

Let's get started now.