In the fast-paced world of cloud computing, efficiency and cost-effectiveness are paramount. At SourceFuse we leverage Elastic Kubernetes Service (EKS) clusters to host multiple development and test environments of our client workloads. We faced the challenge of reducing the cost of running our EKS while maintaining high-performance standards for these non-mission-critical workloads. In this blog, we walk you through the steps we took to identify the issues and implement solutions, resulting in substantial cost savings and improved resource management.
Addressing the High EKS Costs
a) Leveraging Spot EC2 Instances: Our approach to cost optimization revolved around the strategic use of AWS Spot EC2 instances. By intelligently implementing a priority-based expander for the Kubernetes Cluster Autoscaler (CA), we transformed our workload management. The key highlight of this dynamic approach was achieving an ideal balance between Spot and On-Demand EC2 instances. We identified workloads that were non-mission-critical yet could benefit from Spot instances’ significant cost savings. By assigning higher priority to Spot instances, we could further prioritize based on the type of instances required in the CA’s scaling decisions, for example m5.xlarge, c5.xlarge etc. We ensured that whenever Spot capacity was available, it would be utilized first, reducing compute costs considerably during periods of increased demand.
Based on our experience, we know that Spot instances come with the inherent risk of price fluctuations or capacity constraints. To address this concern, we ingeniously configured the CA to automatically fall back to provisioning On-Demand instances if Spot capacity became scarce. This crucial safeguard ensured that our applications always remained responsive and accessible, even during unforeseen Spot instance unavailability. The screenshot below depicts the remarkable cost reduction achieved in EC2 instances after implementing our Spot instance strategy.
After implementing our Spot instance strategy, we observed an astounding 74.66% reduction in EC2 instance costs, resulting in substantial savings.
Our next challenge was the escalating EKS network cost. These expenses were primarily driven by data transfer costs and the use of costly components like NAT Gateways. The below screenshot shows the filter applied around EC2-Others:
We knew we needed a targeted strategy to optimize costs without compromising performance. Our approach was as follows:
b) Leveraging AWS VPC Endpoints: We began optimizing data transfer costs by utilizing AWS VPC Endpoints to call AWS services. This approach enabled us to keep data transfers within the Amazon network, bypassing the public internet, and thus avoiding unnecessary data transfer costs. For example, our workloads that frequently interacted with S3 saw significant cost reductions using VPC endpoints.
c) Implementing NodeLocal DNSCache: By incorporating NodeLocal DNSCache, we minimized the data transfer incurred by resolving DNS queries across the cluster. This efficient solution resulted in reduced data transfer overhead and contributed to significant cost savings. As a result, we experienced a noticeable reduction in data transfer costs for our web applications that required extensive DNS lookups.
d) Utilizing Istio for Zone Awareness: Our applications frequently spanned multiple Availability Zones (AZs), incurring cross-AZ data transfer costs. To address this challenge, we adopted Istio for zone awareness, which intelligently routed traffic within the same AZ whenever possible. This optimization significantly reduced unnecessary multi-AZ calls among pods, leading to substantial data transfer cost savings.
Upon successful implementation of the aforementioned three steps, we achieved substantial reductions in network costs of around 90%:
Let’s move ahead to the next steps carried out:
e) Automating Non-Working Hours’ Shutdowns: To further optimize cost, we deployed Kube-green, a straightforward Kubernetes add-on that automatically shuts down non-essential workloads during non-working hours. This intelligent approach significantly reduced idle compute hours, translating into a considerable reduction in expenses. The idle compute hours were drastically reduced, translating into a considerable reduction in expenses. For example, our staging environments now automatically scale down during nights and weekends, yielding significant cost savings.
Streamlining Resource Management
a) Workload Separation: We segregated workloads into separate AWS accounts, allowing for better organization, resource isolation, and streamlined management. This move significantly simplified resource tracking and ensured better governance.
b) Custodian Policies for Each Account: To enforce governance and cost control across various AWS accounts, we implemented customized Custodian policies for each account. These policies effectively enforced resource usage guidelines and compliance measures, ensuring a cost-optimized and secure environment.
By strategically addressing the high EKS costs and streamlining resource management, we successfully reduced expenses as a result, we witnessed an incredible transformation in our EC2 instance costs:
Total EC2 Instance Costs before Implementation May: $6672.55
Total EC2 Instance Costs After Implementation June: $1692.72
This represents an astounding 74.66% reduction in EC2 instance costs after implementing the spot instance strategy, while enhancing overall efficiency and 90% saving in network cost. Embracing innovative AWS features, employing intelligent Kubernetes addons, and leveraging Spot instances allowed us to strike the perfect balance between performance and cost-effectiveness.
At SourceFuse, we are committed to optimizing our cloud infrastructure continually. This success story stands as a testament to our dedication to excellence in cost management, ensuring that we make the most out of the AWS cloud while delivering top-notch services to our customers.
With the power of AWS and the expertise of our Cloud Architect team, we are confident in our ability to take on future challenges, achieving even greater cost savings and operational efficiency.