How VMware Tanzu CloudHealth migrated from self-managed Kafka to Amazon MSK


It is a publish co-written with Rivlin Pereira & Vaibhav Pandey from Tanzu CloudHealth (VMware by Broadcom).

VMware Tanzu CloudHealth is the cloud value administration platform of selection for greater than 20,000 organizations worldwide, who depend on it to optimize and govern their largest and most advanced multi-cloud environments. On this publish, we talk about how the VMware Tanzu CloudHealth DevOps group migrated their self-managed Apache Kafka workloads (working model 2.0) to Amazon Managed Streaming for Apache Kafka (Amazon MSK) working model 2.6.2. We talk about the system architectures, deployment pipelines, matter creation, observability, entry management, matter migration, and all the problems we confronted with the prevailing infrastructure, together with how and why we migrated to the brand new Kafka setup and a few classes realized.

Kafka cluster overview

Within the fast-evolving panorama of distributed methods, VMware Tanzu CloudHealth’s next-generation microservices platform depends on Kafka as its messaging spine. For us, Kafka’s high-performance distributed log system excels in dealing with large information streams, making it indispensable for seamless communication. Serving as a distributed log system, Kafka effectively captures and shops numerous logs, from HTTP server entry logs to safety occasion audit logs.

Kafka’s versatility shines in supporting key messaging patterns, treating messages as fundamental logs or structured key-value shops. Dynamic partitioning and constant ordering guarantee environment friendly message group. The unwavering reliability of Kafka aligns with our dedication to information integrity.

The mixing of Ruby companies with Kafka is streamlined by means of the Karafka library, appearing as a higher-level wrapper. Our different language stack companies use related wrappers. Kafka’s sturdy debugging options and administrative instructions play a pivotal position in guaranteeing clean operations and infrastructure well being.

Kafka as an architectural pillar

In VMware Tanzu CloudHealth’s next-generation microservices platform, Kafka emerges as a important architectural pillar. Its capacity to deal with excessive information charges, assist numerous messaging patterns, and assure message supply aligns seamlessly with our operational wants. As we proceed to innovate and scale, Kafka stays a steadfast companion, enabling us to construct a resilient and environment friendly infrastructure.

Why we migrated to Amazon MSK

For us, migrating to Amazon MSK got here down to a few key determination factors:

  • Simplified technical operations – Operating Kafka on a self-managed infrastructure was an operational overhead for us. We hadn’t up to date Kafka model 2.0.0 for some time, and Kafka brokers had been taking place in manufacturing, inflicting points with matters going offline. We additionally needed to run scripts manually for rising replication components and rebalancing leaders, which was further handbook effort.
  • Deprecated legacy pipelines and simplified permissions – We had been trying to transfer away from our present pipelines written in Ansible to create Kafka matters on the cluster. We additionally had a cumbersome technique of giving group members entry to Kafka machines in staging and manufacturing, and we needed to simplify this.
  • Price, patching, and assist – As a result of Apache Zookeeper is totally managed and patched by AWS, shifting to Amazon MSK was going to save lots of us money and time. As well as, we found that working Amazon MSK with the identical sort of brokers on Amazon Elastic Compute Cloud (Amazon EC2) was cheaper to run on Amazon MSK. Mixed with the truth that we get safety patches utilized on brokers by AWS, migrating to Amazon MSK was a simple determination. This additionally meant that the group was freed as much as work on different vital issues. Lastly, getting enterprise assist from AWS was additionally important in our remaining determination to maneuver to a managed resolution.

How we migrated to Amazon MSK

With the important thing drivers recognized, we moved forward with a proposed design emigrate present self-managed Kafka to Amazon MSK. We performed the next pre-migration steps earlier than the precise implementation:

  • Evaluation:
    • Carried out a meticulous evaluation of the prevailing EC2 Kafka cluster, understanding its configurations and dependencies
    • Verified Kafka model compatibility with Amazon MSK
  • Amazon MSK setup with Terraform
  • Community configuration:
    • Ensured seamless community connectivity between the EC2 Kafka and MSK clusters, fine-tuning safety teams and firewall settings

After the pre-migration steps, we carried out the next for the brand new design:

  • Automated deployment, improve, and matter creation pipelines for MSK clusters:
    • Within the new setup, we needed to have automated deployments and upgrades of the MSK clusters in a repeatable vogue utilizing an IaC device. Subsequently, we created customized Terraform modules for MSK cluster deployments in addition to upgrades. These modules the place referred to as from a Jenkins pipeline for automated deployments and upgrades of the MSK clusters. For Kafka matter creation, we had been utilizing an Ansible-based home-grown pipeline, which wasn’t steady and led to a number of complaints from dev groups. Because of this, we evaluated choices for deployments to Kubernetes clusters and used the Strimzi Matter Operator to create matters on MSK clusters. Matter creation was automated utilizing Jenkins pipelines, which dev groups might self-service.
  • Higher observability for clusters:
    • The previous Kafka clusters didn’t have good observability. We solely had alerts on Kafka dealer disk measurement. With Amazon MSK, we took benefit of open monitoring utilizing Prometheus. We stood up a standalone Prometheus server that scraped metrics from MSK clusters and despatched them to our inside observability device. On account of improved observability, we had been capable of arrange sturdy alerting for Amazon MSK, which wasn’t doable with our previous setup.
  • Improved COGS and higher compute infrastructure:
    • For our previous Kafka infrastructure, we needed to pay for managing Kafka, Zookeeper cases, plus any further dealer storage prices and information switch prices. With the transfer to Amazon MSK, as a result of Zookeeper is totally managed by AWS, we solely need to pay for Kafka nodes, dealer storage, and information switch prices. Because of this, in remaining Amazon MSK setup for manufacturing, we saved not solely on infrastructure prices but in addition operational prices.
  • Simplified operations and enhanced safety:
    • With the transfer to Amazon MSK, we didn’t need to handle any Zookeeper cases. Dealer safety patching was additionally taken care by AWS for us.
    • Cluster upgrades grew to become easier with the transfer to Amazon MSK; it’s a simple course of to provoke from the Amazon MSK console.
    • With Amazon MSK, we bought dealer computerized scaling out of the field. Because of this, we didn’t have to fret about brokers working out of disk area, thereby resulting in further stability of the MSK cluster.
    • We additionally bought further safety for the cluster as a result of Amazon MSK helps encryption at relaxation by default, and numerous choices for encryption in transit are additionally obtainable. For extra info, discuss with Knowledge safety in Amazon Managed Streaming for Apache Kafka.

Throughout our pre-migration steps, we validated the setup on the staging setting earlier than shifting forward with manufacturing.

Kafka matter migration technique

With the MSK cluster setup full, we carried out a knowledge migration of Kafka matters from the previous cluster working on Amazon EC2 to the brand new MSK cluster. To realize this, we carried out the next steps:

  • Arrange MirrorMaker with Terraform – We used Terraform to orchestrate the deployment of a MirrorMaker cluster consisting of 15 nodes. This demonstrated the scalability and suppleness by adjusting the variety of nodes primarily based on the migration’s concurrent replication wants.
  • Implement a concurrent replication technique – We carried out a concurrent replication technique with 15 MirrorMaker nodes to expedite the migration course of. Our Terraform-driven strategy contributed to value optimization by effectively managing assets in the course of the migration and ensured the reliability and consistency of the MSK and MirrorMaker clusters. It additionally showcased how the chosen setup accelerates information switch, optimizing each time and assets.
  • Migrate information – We efficiently migrated 2 TB of information in a remarkably brief timeframe, minimizing downtime and showcasing the effectivity of the concurrent replication technique.
  • Arrange post-migration monitoring – We carried out sturdy monitoring and alerting in the course of the migration, contributing to a clean course of by figuring out and addressing points promptly.

The next diagram illustrates the structure after the subject migration was full.
Mirror-maker setup

Challenges and classes realized

Embarking on a migration journey, particularly with giant datasets, is usually accompanied by unexpected challenges. On this part, we delve into the challenges encountered in the course of the migration of matters from EC2 Kafka to Amazon MSK utilizing MirrorMaker, and share priceless insights and options that formed the success of our migration.

Problem 1: Offset discrepancies

One of many challenges we encountered was the mismatch in matter offsets between the supply and vacation spot clusters, even with offset synchronization enabled in MirrorMaker. The lesson realized right here was that offset values don’t essentially have to be equivalent, so long as offset sync is enabled, which makes positive the matters have the proper place to learn the info from.

We addressed this downside through the use of a customized device to run assessments on shopper teams, confirming that the translated offsets had been both smaller or caught up, indicating synchronization as per MirrorMaker.

Problem 2: Gradual information migration

The migration course of confronted a bottleneck—information switch was slower than anticipated, particularly with a considerable 2 TB dataset. Regardless of a 20-node MirrorMaker cluster, the pace was inadequate.

To beat this, the group strategically grouped MirrorMaker nodes primarily based on distinctive port numbers. Clusters of 5 MirrorMaker nodes, every with a definite port, considerably boosted throughput, permitting us emigrate information inside hours as a substitute of days.

Problem 3: Lack of detailed course of documentation

Navigating the uncharted territory of migrating giant datasets utilizing MirrorMaker highlighted the absence of detailed documentation for such situations.

By trial and error, the group crafted an IaC module utilizing Terraform. This module streamlined your entire cluster creation course of with optimized settings, enabling a seamless begin to the migration inside minutes.

Last setup and subsequent steps

On account of the transfer to Amazon MSK, our remaining setup after matter migration seemed like the next diagram.
MSK Blog
We’re contemplating the next future enhancements:

Conclusion.

On this publish, we mentioned how VMware Tanzu CloudHealth migrated their present Amazon EC2-based Kafka infrastructure to Amazon MSK. We walked you thru the brand new structure, deployment and matter creation pipelines, enhancements to observability and entry management, matter migration challenges, and the problems we confronted with the prevailing infrastructure, together with how and why we migrated to the brand new Amazon MSK setup. We additionally talked about all the benefits that Amazon MSK gave us, the ultimate structure we achieved with this migration, and classes realized.

For us, the interaction of offset synchronization, strategic node grouping, and IaC proved pivotal in overcoming obstacles and guaranteeing a profitable migration from Amazon EC2 Kafka to Amazon MSK. This publish serves as a testomony to the ability of adaptability and innovation in migration challenges, providing insights for others navigating an identical path.

If you happen to’re working self-managed Kafka on AWS, we encourage you to attempt the managed Kafka providing, Amazon MSK.


Concerning the Authors

Rivlin Pereira is Employees DevOps Engineer at VMware Tanzu Division. He’s very enthusiastic about Kubernetes and works on CloudHealth Platform constructing and working cloud options which can be scalable, dependable and value efficient.

Vaibhav Pandey, a Employees Software program Engineer at Broadcom, is a key contributor to the event of cloud computing options. Specializing in architecting and engineering information storage layers, he’s enthusiastic about constructing and scaling SaaS functions for optimum efficiency.

Raj Ramasubbu is a Senior Analytics Specialist Options Architect targeted on huge information and analytics and AI/ML with Amazon Internet Providers. He helps clients architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj supplied technical experience and management in constructing information engineering, huge information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped clients in numerous trade verticals like healthcare, medical units, life science, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.

Todd McGrath is a knowledge streaming specialist at Amazon Internet Providers the place he advises clients on their streaming methods, integration, structure, and options. On the private aspect, he enjoys watching and supporting his 3 youngsters of their most well-liked actions in addition to following his personal pursuits resembling fishing, pickleball, ice hockey, and comfortable hour with family and friends on pontoon boats. Join with him on LinkedIn.

Satya Pattanaik is a Sr. Options Architect at AWS. He has been serving to ISVs construct scalable and resilient functions on AWS Cloud. Prior becoming a member of AWS, he performed important position in Enterprise segments with their development and success. Outdoors of labor, he spends time studying “the best way to prepare dinner a flavorful BBQ” and making an attempt out new recipes.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox