Nexthink scales to trillions of occasions per day with Amazon MSK


Actual-time knowledge streaming and occasion processing current scalability and administration challenges. AWS presents a broad choice of managed real-time knowledge streaming providers to effortlessly run these workloads at any scale.

On this put up, Nexthink shares how Amazon Managed Streaming for Apache Kafka (Amazon MSK) empowered them to realize huge scale in occasion processing. Experiencing enterprise hyper-growth, Nexthink migrated to AWS to beat the scaling limitations of on-premises options. With Amazon MSK, Nexthink now seamlessly processes trillions of occasions per day, reaching over 5 GB per second of aggregated throughput.

Within the following sections, Nexthink introduces their product and the necessity for scalability. They then spotlight the challenges of their legacy on-premises utility and current their transition to a cloud-centered software program as a service (SaaS) structure powered by Amazon MSK. Lastly, Nexthink particulars the advantages achieved by adopting Amazon MSK.

Nexthink’s have to scale

Nexthink is the chief in digital worker expertise (DeX). The corporate is shaping the way forward for work by offering IT leaders and C-levels with insights into workers’ each day know-how experiences on the gadget and utility stage. This enables IT to evolve from reactive problem-solving to proactive optimization.

The Nexthink Infinity platform combines analytics, monitoring, automation, and extra to handle the worker digital expertise. By accumulating gadget and utility occasions, processing them in actual time, and storing them, our platform analyzes knowledge to resolve issues and enhance experiences for over 15 million workers throughout 5 continents.

In simply 3 years, Nexthink’s enterprise grew tenfold, and with the introduction of extra real-time knowledge our utility needed to scale from processing 200 MB per second to five GB per second and trillions of occasions each day. To allow this progress, we modernized our utility from an on-premises single-tenant monolith to a cloud-based scalable SaaS answer powered by Amazon MSK.

The subsequent sections element our modernization journey, together with the challenges we confronted and the advantages we realized with our new cloud-centered, AWS-based structure.

The on-premises answer and its challenges

Let’s first discover our earlier on-premises answer, Nexthink V6, earlier than analyzing how Amazon MSK addressed its challenges. The next diagram illustrates its structure.

Nexthink v6

V6 was made up of two monolithic, single-tenant Java and C++ purposes that had been tightly coupled. The portal was a backend-for-frontend Java utility, and the core engine was an in-house C++ in-memory database utility that was additionally dealing with gadget connections, knowledge ingestion, aggregation, and querying. By bundling all these capabilities collectively, the engine grew to become tough to handle and enhance.

V6 additionally lacked scalability. Initially supporting 10,000 gadgets, some new tenants had over 300,000 gadgets. We reacted by deploying a number of V6 engines per tenant, growing complexity and value, hampering person expertise, and delaying time to market. This additionally led to longer proof of idea and onboarding cycles, which damage the enterprise.

Moreover, the absence of a streaming platform like Kafka created dependencies between groups by means of tight HTTP/gRPC coupling. Moreover, groups couldn’t entry real-time occasions earlier than ingestion into the database, limiting characteristic growth. We additionally lacked an information buffer, risking potential knowledge loss throughout outages. Such constraints impeded innovation and elevated dangers.

In abstract, though the V6 system served its preliminary goal, reinventing it with cloud-centered applied sciences grew to become crucial to reinforce scalability, reliability, and foster innovation by our engineering and product groups.

Transitioning to a cloud-centered structure with Amazon MSK

To attain our modernization targets, after thorough analysis and iterations, we carried out an event-driven microservices design on Amazon Elastic Kubernetes Service (Amazon EKS), utilizing Kafka on Amazon MSK for distributed occasion storage and streaming.

Our transition from the v6 on-prem answer to the cloud-centered platform was phased over 4 iterations:

  • Section 1 – We lifted and shifted from on premises to digital machines within the cloud, lowering operational complexities and accelerating proof of idea cycles whereas transparently migrating prospects.
  • Section 2 – We prolonged the cloud structure by implementing new product options with microservices and self-managed Kafka on Kubernetes. Nevertheless, working Kafka clusters ourselves proved overly tough, main us to Section 3.
  • Section 3 – We switched from self-managed Kafka to Amazon MSK, bettering stability and lowering operational prices. We realized that managing Kafka wasn’t our core competency or differentiator, and the overhead was excessive. Amazon MSK enabled us to deal with our core utility, liberating us from the burden of undifferentiated Kafka administration.
  • Section 4 – Lastly, we eradicated all legacy parts, finishing the transition to a totally cloud-centered SaaS platform. This multi-year journey of studying and transformation took 3 years.

In the present day, after our profitable transition, we use Amazon MSK for 2 key capabilities:

  • Actual-time knowledge ingestion and processing of trillions of each day occasions from over 15 million gadgets worldwide, as illustrated within the following determine.

Nexthink Architecture Ingestion

  • Enabling an event-driven system that decouples knowledge producers and shoppers, as depicted within the following determine.

Nexthink Architecture Event Driven

To additional improve our scalability and resilience, we adopted a cell-based structure utilizing the large availability of Amazon MSK throughout AWS Areas. We at the moment function over 10 cells, every representing an unbiased regional deployment of our SaaS answer. This cell-based strategy minimizes the realm of impression in case of points, addresses knowledge residency necessities, and allows horizontal scaling throughout AWS Areas, as illustrated within the following determine.

Nexthink Architecture Cells

Advantages of Amazon MSK

Amazon MSK has been essential in enabling our event-driven design. On this part, we define the primary advantages we gained from its adoption.

Improved knowledge resilience

In our new structure, knowledge from gadgets is pushed on to Kafka matters in Amazon MSK, which gives excessive availability and resilience. This makes certain that occasions may be safely obtained and saved at any time. Our providers consuming this knowledge inherit the identical resilience from Amazon MSK. If our backend ingestion providers face disruptions, no occasion is misplaced, as a result of Kafka retains all printed messages. When our providers resume, they seamlessly proceed processing from the place they left off, due to Kafka’s producer semantics, which permit processing messages exactly-once, at-least-once, or at-most-once primarily based on utility wants.

Amazon MSK allows us to tailor the info retention period to our particular necessities, starting from seconds to limitless period. This flexibility grants uninterrupted knowledge availability to our utility, which wasn’t potential with our earlier structure. Moreover, to safeguard knowledge integrity within the occasion of processing errors or corruption, Kafka enabled us to implement an information replay mechanism, guaranteeing knowledge consistency and reliability.

Organizational scaling

By adopting an event-driven structure with Amazon MSK, we decomposed our monolithic utility into loosely coupled, stateless microservices speaking asynchronously by way of Kafka matters. This strategy enabled our engineering group to scale quickly from simply 4–5 groups in 2019 to over 40 groups and roughly 350 engineers right this moment.

The unfastened coupling between occasion publishers and subscribers empowered groups to deal with distinct domains, equivalent to knowledge ingestion, identification providers, and knowledge lakes. Groups might develop options independently inside their domains, speaking by means of Kafka matters with out tight coupling. This structure accelerated characteristic growth by minimizing the chance of latest options impacting present ones. Groups might effectively devour occasions printed by others, providing new capabilities extra quickly whereas lowering cross-team dependencies.

The next determine illustrates the seamless workflow of including new domains to our system.

Adding domains

Moreover, the event-driven design allowed groups to construct stateless providers that might seamlessly auto scale primarily based on MSK metrics like messages per second. This event-driven scalability eradicated the necessity for intensive capability planning and handbook scaling efforts, liberating up growth time.

By utilizing an event-driven microservices structure on Amazon MSK, we achieved organizational agility, enhanced scalability, and accelerated innovation whereas minimizing operational overhead.

Seamless infrastructure scaling

Nexthink’s enterprise grew tenfold in 3 years, and plenty of new capabilities had been added to the product, resulting in a considerable improve in visitors from 200 MB per second to five GB per second. This exponential knowledge progress was enabled by the strong scalability of Amazon MSK. Reaching such scale with an on-premises answer would have been difficult and costly, if not infeasible.

Trying to self-manage Kafka imposed pointless operational overhead with out offering enterprise worth. Operating it with simply 5% of right this moment’s visitors was already complicated and required two engineers. At right this moment’s volumes, we estimated needing 6–10 devoted workers, growing prices and diverting sources away from core priorities.

Actual-time capabilities

By channeling all our knowledge by means of Amazon MSK, we enabled real-time processing of occasions. This unlocked capabilities like real-time alerts, event-driven triggers, and webhooks that had been beforehand unattainable. As such, Amazon MSK was instrumental in facilitating our event-driven structure and powering impactful improvements.

Safe knowledge entry

Transitioning to our new structure, we met our safety and knowledge integrity targets. With Kafka ACLs, we enforced strict entry controls, permitting shoppers and producers to solely work together with approved matters. We primarily based these granular knowledge entry controls on standards like knowledge sort, area, and group.

To securely scale decentralized administration of matters, we launched proprietary Kubernetes Customized Useful resource Definitions (CRDs). These CRDs enabled groups to independently handle their very own matters, settings, and ACLs with out compromising safety.

Amazon MSK encryption made certain that the info remained encrypted at relaxation and in transit. We additionally launched a Carry Your Personal Key (BYOK) choice, permitting application-level encryption with buyer keys for all single-tenant and multi-tenant matters.

Enhanced observability

Amazon MSK gave us nice visibility into our knowledge flows. The out-of-the-box Amazon CloudWatch metrics allow us to see the quantity and varieties of knowledge flowing by means of every matter and cluster. This helped us quantify the utilization of our product options by monitoring knowledge volumes on the matter stage. The Amazon MSK operational metrics enabled easy monitoring and right-sizing of clusters and brokers. General, the wealthy observability of Amazon MSK facilitated data-driven choices about structure and product options.

Conclusion

Nexthink’s journey from an on-premises monolith to a cloud SaaS was streamlined by utilizing Amazon MSK, a totally managed Kafka service. Amazon MSK allowed us to scale seamlessly whereas benefiting from enterprise-grade reliability and safety. By offloading Kafka administration to AWS, we might keep targeted on our core enterprise and innovate quicker.

Going ahead, we plan to additional enhance efficiency, prices, and scalability by adopting Amazon MSK capabilities equivalent to tiered storage and AWS Graviton-based EC2 occasion varieties.

We’re additionally working carefully with the Amazon MSK group to organize for upcoming service options. Quickly adopting new capabilities will assist us stay on the forefront of innovation whereas persevering with to develop our enterprise.

To be taught extra about how Nexthink makes use of AWS to serve its world buyer base, discover the Nexthink on AWS case examine. Moreover, uncover different buyer success tales with Amazon MSK by visiting the Amazon MSK weblog class.


Concerning the Authors

Moe HaidarMoe Haidar is a principal engineer and particular tasks lead @ CTO workplace of Nexthink. He has been concerned with AWS since 2018 and is a key contributor to the cloud transformation of the Nexthink platform to AWS. His focus is on product and know-how incubation and structure, however he additionally loves doing hands-on actions to maintain his information of applied sciences sharp and updated. He nonetheless contributes closely to the code base and likes to sort out complicated issues.
Simone PomataSimone Pomata is Senior Options Architect at AWS. He has labored enthusiastically within the tech business for greater than 10 years. At AWS, he helps prospects reach constructing new applied sciences daily.
Magdalena GargasMagdalena Gargas is a Options Architect enthusiastic about know-how and fixing buyer challenges. At AWS, she works principally with software program firms, serving to them innovate within the cloud. She participates in business occasions, sharing insights and contributing to the development of the containerization discipline.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox