In the present day, we’re asserting the overall availability of Amazon DocumentDB (with MongoDB compatibility) zero-ETL integration with Amazon OpenSearch Service.
Amazon DocumentDB gives native textual content search and vector search capabilities. With Amazon OpenSearch Service, you’ll be able to carry out superior search analytics, comparable to fuzzy search, synonym search, cross-collection search, and multilingual search, on Amazon DocumentDB knowledge.
Zero-ETL integration simplifies your structure for superior search analytics. It frees you from performing undifferentiated heavy lifting duties and the prices related to constructing and managing knowledge pipeline structure and knowledge synchronization between the 2 providers.
On this submit, we present you methods to configure zero-ETL integration of Amazon DocumentDB with OpenSearch Service utilizing Amazon OpenSearch Ingestion. It includes performing a full load of Amazon DocumentDB knowledge and repeatedly streaming the newest knowledge to Amazon OpenSearch Service utilizing change streams. For different ingestion strategies, see documentation.
Answer overview
At a excessive degree, this resolution includes the next steps:
- Allow change streams on the Amazon DocumentDB collections.
- Create the OpenSearch Ingestion pipeline.
- Load pattern knowledge on the Amazon DocumentDB cluster.
- Confirm the info in OpenSearch Service.
Stipulations
To implement this resolution, you want the next conditions:
Zero-ETL will carry out an preliminary full load of your assortment by doing a group scan on the first occasion of your Amazon DocumentDB cluster, which can take a number of minutes to finish relying on the dimensions of the info, and you might discover elevated useful resource consumption in your cluster.
Allow change streams on the Amazon DocumentDB collections
Amazon DocumentDB change stream occasions comprise a time-ordered sequence of knowledge modifications on account of inserts, updates, and deletes in your knowledge. We use these change stream occasions to transmit knowledge modifications from the Amazon DocumentDB cluster to the OpenSearch Service area.
Change streams are disabled by default; you’ll be able to allow them on the particular person assortment degree, database degree, or cluster degree. To allow change streams in your collections, full the next steps:
- Hook up with Amazon DocumentDB utilizing mongo shell.
- Allow change streams in your assortment with the next code. For this submit, we use the Amazon DocumentDB database
stock
and assortmentproduct
:
In case you have a couple of assortment for which you need to stream knowledge into OpenSearch Service, allow change streams for every assortment. If you wish to allow it on the database or cluster degree, see Enabling Change Streams.
It’s really useful to allow change streams for less than the required collections.
Create an OpenSearch Ingestion pipeline
OpenSearch Ingestion is a totally managed knowledge collector that delivers real-time log and hint knowledge to OpenSearch Service domains. OpenSearch Ingestion is powered by the open supply knowledge collector Knowledge Prepper. Knowledge Prepper is a part of the open supply OpenSearch mission.
With OpenSearch Ingestion, you’ll be able to filter, enrich, rework, and ship your knowledge for downstream evaluation and visualization. OpenSearch Ingestion is serverless, so that you don’t want to fret about scaling your infrastructure, working your ingestion fleet, and patching or updating the software program.
For a complete overview of OpenSearch Ingestion, go to Amazon OpenSearch Ingestion, and for extra details about the Knowledge Prepper open supply mission, go to Knowledge Prepper.
To create an OpenSearch Ingestion pipeline, full the next steps:
- On the OpenSearch Service console, select Pipelines within the navigation pane.
- Select Create pipeline.
- For Pipeline identify, enter a reputation (for instance,
zeroetl-docdb-to-opensearch
). - Arrange pipeline capability for compute sources to mechanically scale your pipeline based mostly on the present ingestion workload.
- Enter the minimal and most Ingestion OpenSearch Compute Models (OCUs). On this instance, we use the default pipeline capability settings of minimal 1 Ingestion OCU and most 4 Ingestion OCUs.
Every OCU is a mixture of roughly 8 GB of reminiscence and a pair of vCPUs that may deal with an estimated 8 GiB per hour. OpenSearch Ingestion helps as much as 96 OCUs, and it mechanically scales up and down based mostly in your ingest workload demand.
- Select the configuration blueprint and beneath Use case within the navigation pane, select ZeroETL.
- Choose Zero-ETL with DocumentDB to construct the pipeline configuration.
This pipeline is a mixture of a supply
half from the Amazon DocumentDB settings and a sink
half for OpenSearch Service.
You will need to set a number of AWS Id and Entry Administration (IAM) roles (sts_role_arn
) with the mandatory permissions to learn knowledge from the Amazon DocumentDB database and assortment and write to an OpenSearch Service area. This position is then assumed by OpenSearch Ingestion pipelines to verify the suitable safety posture is all the time maintained when transferring the info from supply to vacation spot. To be taught extra, see Organising roles and customers in Amazon OpenSearch Ingestion.
You want one OpenSearch Ingestion pipeline per Amazon DocumentDB assortment.
Present the next parameters from the blueprint:
- Amazon DocumentDB endpoint – Present your Amazon DocumentDB cluster endpoint.
- Amazon DocumentDB assortment – Present your Amazon DocumentDB database identify and assortment identify within the format
dbname.assortment
throughout thecollections
part. For instance,stock.product
. - s3_bucket – Present your S3 bucket identify together with the AWS Area and S3 prefix. This will likely be used briefly to carry the info from Amazon DocumentDB for knowledge synchronization.
- OpenSearch hosts – Present the OpenSearch Service area endpoint for the host and supply the popular index identify to retailer the info.
- secret_id – Present the ARN for the key for the Amazon DocumentDB cluster together with its Area.
- sts_role_arn – Present the ARN for the IAM position that has permissions for the Amazon Doc DB cluster, S3 bucket, and OpenSearch Service area.
To be taught extra, see Creating Amazon OpenSearch Ingestion pipelines.
- After getting into all of the required values, validate the pipeline configuration for any errors.
- When designing a manufacturing workload, deploy your pipeline inside a VPC. Select your VPC, subnets, and safety teams. Additionally choose Connect to VPC and select the corresponding VPC CIDR vary.
The safety group inbound rule ought to have entry to the Amazon DocumentDB port. For extra data, discuss with Securing Amazon OpenSearch Ingestion pipelines inside a VPC.
Load pattern knowledge on the Amazon DocumentDB cluster
Full the next steps to load the pattern knowledge:
- Hook up with your Amazon DocumentDB cluster.
- Insert some paperwork into the gathering product within the stock database by working the next instructions. For creating and updating paperwork on Amazon DocumentDB, discuss with Working with Paperwork.
Confirm the info in OpenSearch Service
You should use the OpenSearch Dashboards dev console to seek for the synchronized objects inside a number of seconds. For extra data, see Creating and looking for paperwork in Amazon OpenSearch Service.
To confirm the change knowledge seize (CDC), run the next command to replace the OnHand
and MinOnHand
fields for the prevailing doc merchandise Extremely GelPen
within the product
assortment:
Confirm the CDC for the replace to the doc for the merchandise Extremely GelPen
on the OpenSearch Service index.
Monitor the CDC pipeline
You possibly can monitor the state of the pipelines by checking the standing of the pipeline on the OpenSearch Service console. Moreover, you should use Amazon CloudWatch to offer real-time metrics and logs, which helps you to arrange alerts in case of a breach of user-defined thresholds.
Clear up
Ensure you clear up undesirable AWS sources created throughout this submit so as to stop extra billing for these sources. Observe these steps to wash up your AWS account:
- On the OpenSearch Service console, select Domains beneath Managed clusters within the navigation pane.
- Choose the area you need to delete and select Delete.
- Select Pipelines beneath Ingestion within the navigation pane.
- Choose the pipeline you need to delete and on the Actions menu, select Delete.
- On the Amazon S3 console, choose the S3 bucket and select Delete.
Conclusion
On this submit, you discovered methods to allow zero-ETL integration between Amazon DocumentDB change knowledge streams and OpenSearch Service. To be taught extra about zero-ETL integrations obtainable with different knowledge sources, see Working with Amazon OpenSearch Ingestion pipeline integrations.
Concerning the Authors
Praveen Kadipikonda is a Senior Analytics Specialist Options Architect at AWS based mostly out of Dallas. He helps clients construct environment friendly, performant, and scalable analytic options. He has labored with constructing databases and knowledge warehouse options for over 15 years.
Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Options Architect at AWS based mostly out of London. He’s keen about database applied sciences and enjoys serving to clients clear up issues and modernize functions utilizing NoSQL databases. Earlier than becoming a member of AWS, he labored extensively with relational databases, NoSQL databases, and enterprise intelligence applied sciences for over 15 years.
Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search functions and options. Muthu is within the matters o f networking and safety, and relies out of Austin, Texas.