Amazon EKS as a Production Battleship

The Opportunity

The customer at the center of this case study is a reputed global brand when it comes to the maritime industry. It provides a comprehensive range of ship management services to cargo ship owners across the globe, and has a professional workforce, both ashore and at sea. The company runs a comprehensive web-based platform having the following notable features:

  • A B2B service across Industry.
  • Monitors in real-time critical metrics for ship/vessel owners, including crew details, maintenance schedules, ship performance, and many more.
  • Tracks more than 600 ships/vessels.

The customer appointed SourceFuse to carry out an assessment and remediation of its current architecture and application environment. After getting access to the account and doing detailed discovery, SourceFuse presented the report which included the following highlights:

  • Their system has gone through various organic revisions to reach the present state which is stable and reliable. 
  • The company was looking to improve the overall application visibility to help elevate the Developer experience and better debugging techniques. 
  • It was keen to revisit the current security posture to further reduce any kind of cybersecurity risks. 
  • During the successful journey to the current state, there were various kinds of computing services being utilized in AWS like EC2, Lambda, Containers on ECS, etc.

The Solution

As part of the phased remediation approach, SourceFuse followed a proven approach, migrating the customer’s compute to Amazon EKS in a highly reliable and available configuration. At the same time, SourceFuse would enhance security using AWS security best practices.

While EKS provides flexibility and controls that enable advanced users to manage their clusters on AWS, EKS offers a managed solution that simplifies deployment, management, and integration with other AWS native services. These features make it an attractive choice for organizations looking to leverage default Kubernetes capabilities while minimizing operational overhead and benefiting from the broader AWS ecosystem.

Based on the deep experience that SourceFuse has in EKS deployment, we have delivered a well-architected, secure, and reliable architecture on the following lines:

  • Automated Cluster Rollout and Updates: The complete Cluster and its config parameters were controlled via Terraform CLI. Terraform provides a variety of providers to automate infrastructure provisioning. We have extended the open-source version of Terraform to create enterprise-grade ARC-IaC by SourceFuse which embeds all of the AWS best practices and can be configured as per custom requirements.
Block Diagram — Infra Deployment
  • Automated Integration and Deployment: The customer has a requirement to deploy a wide variety of applications on the Cluster. The codebase and IaC are being managed on BitBucket and so there was a requirement for a reliable infra management tool. SourceFuse implemented ArgoCD to leverage the Declarative Delivery methodology across the EKS cluster. We also leveraged Jenkins to set a continuous testing pipeline.
ArgoCD — EKS Continuous Deployment Tool
  • Multi-Tenancy with Resource Isolation: Multi-tenancy isolation does not come by default in EKS but was achieved using the following practices:

a) Namespaces – Each tenant participates in a shared cluster, wherein their respective workloads are confined exclusively to a designated set of namespaces. Meanwhile, all control plane resources, such as the scheduler, API server, CPU, and memory, remain accessible to all tenants throughout the entire cluster. 

As tenant workloads are isolated, their namespaces encompass essential components like role bindings, resource quotas, and network policies. These components are intentionally integrated into the namespaces to serve specific functions. Role bindings ensure controlled access within the namespace, while network policies contribute to the isolation of network traffic within and across tenants, bolstering overall security.

In essence, this approach within the shared cluster ecosystem strikes a balance between resource sharing and workload segregation. By limiting the workloads to allocated namespaces, tenants can harness cluster resources effectively, while the embedded components in each namespace serve as tools for access control, usage limitation, and network traffic prevention.

b) Role Based Access Control (RBAC) – RBAC was implemented using ClusterRoles and Roles which define the actions a user can perform within a cluster or namespace, respectively. SourceFuse assigned these roles to Kubernetes subjects (users, groups, or service accounts) with role bindings and cluster role bindings.

c) Resource Quotas – Resource quotas were configured at the very granular level. When a Pod runs on a node with sufficient available resources, it is both possible and permissible for a container to exceed its requested resource allocation for that particular resource. This practice sets usage limits for each tenant, preventing resource depletion. Nonetheless, the container must not surpass its defined resource limit.

  • Monitoring and Observability:  A cloud-agnostic OBF stack was rolled out to help customers have maximum visibility of the workload. Tools stitched to create the complete stack included:
a) Prometheus and Grafana were deployed using Helm Chart and ArgoCD. On top of this, swagger-stats were implemented to expose metrics in Prometheus format, so we can use the same set of tools for API monitoring and alerting. b) A combination of Fluent-bit, Logstash, and OpenSearch was leveraged to scrape and push all application logs and POD logs to OpenSearch from the EKS Cluster to create a centralized searchable dump.
Centralized Log Analytics
  • Security and Compliance:

a) Runtime Security – We restricted the containers to not execute beyond a certain boundary within the Container. Seccomp is a Linux Kernel feature that we leveraged to restrict our application’s access. This way we can limit the blast radius of our architecture in case of a compromise by disabling the system calls.

b) Image Scanning – We enabled Container image scanning to scan-on-push feature for all private registry repositories. The Amazon ECR provides a list of scan findings. Additionally, each container image is scanned at least once a day. Amazon ECR uses the Common Vulnerabilities and Exposures (CVEs) database from the open-source Clair project and provides a list of scan findings.

All the above features which are not available out of the box were taken care of and SourceFuse enabled the production workload on EKS to tick all the AWS best practice checkboxes for a containerized workload.

About The Customer

Headquartered in Hong Kong SAR, China, the company operates on a global scale having 27 offices in 12 countries. Its client base spans over 100 world-class ship owners, including Fortune 500 companies from China, Greece, India, Japan, Korea, Netherlands, Norway, Turkey and the USA, among others. Today, it is one of the largest independent third-party ship management companies managing over 650 diverse types of vessels.

Download Case Study PDF