Problem Statement

To migrate 250 applications running on outdated Kubernetes clusters presented a significant challenge, requiring the upgrade of ~127 clusters and transitioning workloads to a modern infrastructure. This migration involved deploying application workloads onto GoCloud, a new Kubernetes-based platform, while utilizing Fabric(more about Fabric in the later sections) for traffic shaping. The effort aimed to improve security, scalability, and operational efficiency while achieving zero downtime and disruptions.

Motivations & Business Drivers

The decision to migrate stemmed from multiple critical factors that impacted the reliability and scalability of the existing infrastructure:

Security & Compliance: Outdated Kubernetes clusters posed security vulnerabilities, lacked vendor support, and risked non-compliance with industry regulations.
Maintainability & Operational Overhead: Managing fragmented and outdated clusters increased operational complexity, requiring excessive manual interventions and workarounds.
Performance & Scalability: The legacy infrastructure struggled to meet the growing demand for efficient scaling and traffic management, affecting application performance.
Standardization & Efficiency: Transitioning to GoCloud and Fabric aimed to provid a unified infrastructure, enabling streamlined operations, better traffic control, and automation capabilities.

This migration was not just an upgrade but a strategic move towards a modern, resilient, and scalable platform.

Engineering Challenges & Solutions

Why So Many of Kubernetes Clusters

Thinking of 127 distinct Kubernetes clusters raises questions but there were conscious choices taken in the past(not going into details) which allowed the numbers to increase. It is crucial to stress that application teams lacked governance when it came to setting up and maintaining Kubernetes clusters. This seems fine in the short term; as an application developer interested in infrastructure management, I am certain that I can oversee both product development and infrastructure deployment; however, such choices necessitate long-term planning. Every team experiences transitions from time to time. While it is possible to have a developer who is comfortable managing both the infrastructure and the application, it is not reasonable to expect the same thing when that developer leaves the team and is replaced. A more distinct division of ownership was required, as was clarification on whether an application developer should concentrate primarily on the features and scalability of the application or if they must manage infrastructure and deployments in addition to the application/business requirements. The organization must establish component ownership and governance. I believe this was not thoroughly looked upon, which is why the organization had almost 127 Kubernetes clusters. What’s funny is that the infrastructure team only managed about 40 of these clusters.

New Deployment Infrastructure - GoCloud

Our team identified the gaps in the previous implementation of managing container deployments and started working on a multi-tenant Kubernetes Cluster Platform. The distinct features of the new platform included -

Namespace Isolation: Kubernetes namespaces are a way to create logical partitions within a cluster, providing resource isolation and segregation(no noisy neighbours 🙅‍♂️). Think of it like kernel namespace which is a feature to isolate resources from each other. Each tenant in GoCloud should have its own namespace, ensuring isolation between resources such as pods, services, and configurations.
Resource Quotas: Use Kubernetes resource quotas to limit the amount of CPU, memory, and other resources that each tenant can consume within their namespace. This prevents one tenant from monopolizing cluster resources. At the same time, this allows us to perform more accurate capacity planning.
Access Control: Namespaces provide a mechanism for access control and permissions. In the GoCloud cluster, we define role-based access control(RBAC) policies/personas, to grant different levels of access to users or groups for resources within a namespace. This enables fine-grained control over who can access and modify(update) resources within a specific cluster and also within a namespace, enhancing security and isolation.
Namespace Labelling: All the applications and namespaces have standard labels in GoCloud clusters to categorize and identify tenants or applications. This helps with resource management, monitoring, billing and automation.
Cluster Autoscaling: GoCloud cluster supports cluster autoscaling policies to ensure that it can handle the resource demands of multiple tenants efficiently. GoCloud also supports HorizontalPodAutoscalar(HPA) to automatically update the workload resources such as Deployments, scaling them to match the demand for applications in the cluster.
Observability: GoCloud clusters are equipped with the default logging tool Barito and the monitoring tool Prometheus to centrally manage logging and monitoring to track the performance and health of the cluster, as well as to detect and respond to security incidents. Applications can use grafana agent as sidecars to push application metrics.
Billing and Cost Allocation: Cost allocation in GoCloud clusters is achieved with the use of labels to attribute cluster costs to individual tenants or teams based on resource consumption.

Implementing Fabric for Traffic Management

Fabric is a multi-tenant traffic-shaping layer that acts as an edge proxy for services inside a VPC. It allows cluster fungibility by providing advanced traffic control for deployed workloads and can be used for North-South as well as East-West traffic. Built on Gloo Edge, Fabric operates as an abstracted proxy and serves as a routing layer between the API Gateway (Kong) and the application deployment infrastructure. It solves the challenges of managing ingress traffic into the application architectures, not limited to Kubernetes. Backend services can be discovered when running or registered in Kubernetes, VMs, etc.

Fabric is an internal traffic management system to bridge the gap, with the following goals.

Simplify service discovery
High performance L4/L7 request routing
Improve service reliability and resiliency
- multiple backends with rich traffic control mechanism
- fast and reliable traffic-draining control
Expedite migration out of legacy network
Better security stance with TLS encryption

Till now it was more about the problems and how we ended up in creating a new infrastructure platform with the desired features. The next section covers the operational challenges and the process we established to achieve the migration(or upgrade) of ~127 clusters 👇

Migrating 250 Applications

Moving 250 applications to the new infrastructure posed multiple challenges, requiring both technical and operational strategies for a smooth transition:

Setting Up Standardized Migration Processes

A well-defined migration process was crucial to ensure consistency of application deployment and minimize risks. We created structured processes around resource sizing of the namespace as well as playbooks for onboarding an application workload covering application dependencies, rollout strategies, and rollback plans.

Aligning Stakeholders Across Teams

Since multiple teams were involved, clear communication and collaboration was essential. Regular cross-team syncs and status updates helped align priorities and mitigate bottlenecks.

Overcoming Resistance to Change

Many teams were hesitant to prioritize migration over their BAU tasks. Conducting awareness sessions, demonstrating the benefits of the new infrastructure, and providing hands-on migration support helped address concerns.

Building Trust in the New Infrastructure

To gain stakeholder confidence, we onboarded one of the most crucial application, driver allocations service by pairing with the owner team and created early adopter programs. Demonstrating successful transitions helped teams trust GoCloud and Fabric as reliable platforms. I cannot emphasize more why this was one of the most important things which we did.

Handling Diverse Tech Stacks & Dependencies

Applications were built on varied frameworks and libraries, requiring compatibility checks.

CI/CD Pipeline Adjustments

Created well defined documentation for Continuous Integration and Deployment pipelines to be reconfigured to align with GoCloud’s deployment framework, ensuring automation and smooth rollouts.

Post-Migration Support & Continuous Improvement

Establishing feedback loops and ongoing monitoring reassured stakeholders that any issues would be promptly addressed, further strengthening trust in the new infrastructure.

Time Sensitivity & Execution Strategy

Given the urgency imposed by external deadlines and security vulnerabilities, a structured execution strategy was critical. The migration was executed in broadly the following key phases:

Pilot Phase: A small subset of applications was migrated first in complete collaboration with application owners to identify potential risks and refine the migration process.
Early Adopters: Key teams volunteered to migrate their applications, validating stability before a wider rollout.
Bulk Migration: Before this phase we had proper documentations available, migration processes were established and the majority of applications were migrated in parallel workstreams, ensuring minimal impact on production workloads.
Post-Migration Stabilization: Monitoring, fine-tuning, and addressing any residual issues ensured a smooth transition.

Key Takeaways & Lessons Learned

Early Buy-in from Teams: Engaging stakeholders early reduced resistance and streamlined decision-making.
Gaining Trust: Application owners were managing their own deployments, understanding their requirements and building a nearly complete product to meet major requirements and helping them out in migration was important to gain trust. Once trust was established, it was a word of mouth which helped application owners to engage themselves in the migration.
Automation is Key: Automating migration workflows significantly reduced errors and manual workload.
Gradual Migration Approach: Phased rollouts helped mitigate risks and provided learnings for subsequent migrations.
Continuous Monitoring & Feedback Loops: During migration, collecting feedbacks and improving the process and Post-migration monitoring allowed for immediate issue resolution and optimization.

Conclusion

The successful migration of 250 applications to GoCloud resulted in improved scalability, security, and operational efficiency. The adoption of Fabric for traffic shaping further enhanced resilience and reliability. This migration not only modernized the infrastructure but also set a strong foundation for future growth.

Problem Statement#

Motivations & Business Drivers#

Engineering Challenges & Solutions#

Why So Many of Kubernetes Clusters#

New Deployment Infrastructure - GoCloud#

Implementing Fabric for Traffic Management#

Migrating 250 Applications#

Setting Up Standardized Migration Processes#

Aligning Stakeholders Across Teams#

Overcoming Resistance to Change#

Building Trust in the New Infrastructure#

Handling Diverse Tech Stacks & Dependencies#

CI/CD Pipeline Adjustments#

Post-Migration Support & Continuous Improvement#

Time Sensitivity & Execution Strategy#

Key Takeaways & Lessons Learned#

Conclusion#