Inspiration
My goal was to reduce compute costAs workloads grow and cloud costs skyrocket, designing an efficient and scalable Kubernetes infrastructure becomes more critical than ever. In this blog, I share how I evolved from a basic EKS setup into a production-grade, cost-optimized, and resilient Kubernetes cluster using AWS EKS, Terraform, Karpenter, and Spot Instances.s, simplify node scaling, and ensure applications could reliably run at scale. This implementation is ideal for startups, mid-sized teams, or individual developers aiming to balance performance and budget without compromising on automation.
What it does
The project provisions a fully production-ready Amazon EKS cluster using infrastructure as code (IaC) via Terraform. It automates the creation of networking (VPC), IAM roles and policies, EKS cluster resources, and integrates Karpenter for intelligent node lifecycle management. It dynamically provisions nodes based on workload requirements and intelligently balances between Spot and On-Demand EC2 instances. It also includes observability components such as CloudWatch, Prometheus, and FluentBit for centralized metrics and logging. The result is a highly scalable Kubernetes platform with automatic healing, scaling, and cost awareness baked in.
How we built it
- I used a modular Terraform setup with clearly separated modules for VPC, EKS, IAM, and Karpenter to ensure reusability and cleaner state management.
- The VPC module provisioned a multi-AZ network with public/private subnets, route tables, and NAT Gateways.
- The EKS module used the terraform-aws-eks community module to deploy the control plane and managed node groups with necessary IAM roles and Kubernetes config maps.
- I installed essential EKS add-ons including:
- VPC CNI plugin (for pod networking)
- CoreDNS (for internal DNS resolution)
- kube-proxy (for networking rules)
- Amazon EBS CSI driver (for dynamic volume provisioning) -I created Karpenter provisioners using YAML, defining capacity types (Spot + On-Demand), taints, tolerations, and consolidation policies.
- These provisioner specs were applied to the cluster via Terraform using kubernetes_manifest resources.
- IRSA (IAM Roles for Service Accounts) was enabled to let pods securely access AWS resources like S3 or CloudWatch without exposing credentials.
Challenges we ran into
Challenges I ran into The first major challenge was getting Karpenter to work reliably with Spot Instances, especially around provisioning and lifecycle edge cases. Spot interruptions required me to design workloads that could tolerate disruptions and gracefully failover. IAM was another complexity: the permissions needed for Karpenter to interact with EC2 APIs had to be carefully scoped. I also faced issues with DNS throttling during cluster scale-ups, which I mitigated by tweaking the VPC’s DNS settings and adding limits on kube-dns. Managing Terraform state across multiple modules and ensuring consistency between Helm-based components and the cluster state also required careful planning.
Accomplishments that we're proud of
One of the biggest wins was achieving a 60%+ reduction in monthly compute costs by offloading workloads to Spot Instances and letting Karpenter make real-time provisioning decisions. I also eliminated all manual scaling operations everything is now dynamic and event-driven. Perhaps most importantly, the entire infrastructure is defined and reproducible via Terraform, which makes it easy to audit, share, and re-deploy with confidence. I’m proud of the resilience and modularity built into this project
What we learned
This project was a deep dive into modern cloud infrastructure, and I came away with a solid understanding of EKS internals, Kubernetes scheduling, Spot Instance management, and Terraform best practices. I learned how crucial observability is in cloud-native environments especially when relying on ephemeral compute. I also developed a deeper appreciation for the trade-offs involved in optimizing for cost without compromising on reliability. Most importantly, I learned how to design infrastructure as code in a modular, reusable, and production-friendly way.
What's next for Production-Ready EKS Using Terraform
Moving forward, I plan to integrate GitOps workflows using ArgoCD for automated application deployment. I also want to implement Velero for EKS cluster state backups and explore multi-tenant patterns using Kubernetes namespaces with RBAC for security isolation. Finally, I aim to open-source the Terraform modules and provide a ready-to-use boilerplate repo so others can easily replicate this architecture in their own environments.
Built With
- amazon-web-services
- eks
- kubernetes
- terraform
- vpc
Log in or sign up for Devpost to join the conversation.