Carry out reindexing in Amazon OpenSearch Serverless utilizing Amazon OpenSearch Ingestion


Amazon OpenSearch Serverless is a serverless deployment possibility for Amazon OpenSearch Service that makes it simple to run search and analytics workloads with out managing infrastructure. Prospects utilizing OpenSearch Serverless usually want to repeat paperwork between two indexes throughout the similar assortment or throughout completely different collections. This primarily arises from two situations:

  • Reindexing – You often must replace or modify index mapping because of evolving knowledge wants or schema adjustments
  • Catastrophe restoration – Though OpenSearch Serverless knowledge is inherently sturdy, you might need to copy knowledge throughout AWS Areas for added redundancy and resiliency

Amazon OpenSearch Ingestion had not too long ago launched a characteristic supporting OpenSearch as a supply. OpenSearch Ingestion, a completely managed, serverless knowledge collector, facilitates real-time ingestion of log, metric, and hint knowledge into OpenSearch Service domains and OpenSearch Serverless collections. We are able to leverage this characteristic to handle these two situations, by studying the info from an OpenSearch Serverless Assortment. This functionality permits you to effortlessly copy knowledge between indexes, making knowledge administration duties extra streamlined and eliminating the necessity for customized code.

On this publish, we define the steps to repeat knowledge between two indexes in the identical OpenSearch Serverless assortment utilizing the brand new OpenSearch supply characteristic of OpenSearch Ingestion. That is notably helpful for reindexing operations the place you need to change your knowledge schema. OpenSearch Serverless and OpenSearch Ingestion are each serverless companies that allow you to seamlessly deal with your knowledge workflows, offering optimum efficiency and scalability.

Answer overview

The next diagram reveals the circulate of copying paperwork from the supply index to the vacation spot index utilizing an OpenSearch Ingestion pipeline.

Implementing the answer consists of the next steps:

  1. Create an AWS Identification and Entry Administration (IAM) position to make use of as an OpenSearch Ingestion pipeline position.
  2. Replace the info entry coverage connected to the OpenSearch Serverless assortment.
  3. Create an OpenSearch Ingestion pipeline that merely copies knowledge from one index to a different, or you possibly can even create an index template utilizing the OpenSearch Ingestion pipeline to outline specific mapping, after which copy the info from the supply index to the vacation spot index with the outlined mapping utilized.

Conditions

To get began, it’s essential to have an lively OpenSearch Serverless assortment with an index that you just need to reindex (copy). Confer with Creating collections to study extra about creating a set.

When the gathering is prepared, word the next particulars:

  • The endpoint of the OpenSearch Serverless assortment
  • The title of the index from which the paperwork have to be copied
  • If the gathering is outlined as a VPC assortment, word down the title of the community coverage connected to the gathering

You employ these particulars within the ingestion pipeline configuration.

Create an IAM position to make use of as a pipeline position

An OpenSearch Ingestion pipeline wants sure permissions to drag knowledge from the supply and write to its sink. For this walkthrough, each the supply and sink are the identical, but when the supply and sink collections are completely different, modify the coverage accordingly.

Full the next steps:

  1. Create an IAM coverage (opensearch-ingestion-pipeline-policy) that gives permission to learn and ship knowledge to the OpenSearch Serverless assortment. The next is a pattern coverage with least privileges (modify {account-id}, {area}, {collection-id} and {collection-name} accordingly):
    {
        "Model": "2012-10-17",
        "Assertion": [{
                "Action": [
                    "aoss:BatchGetCollection",
                    "aoss:APIAccessAll"
                ],
                "Impact": "Enable",
                "Useful resource": "arn:aws:aoss:{area}:{account-id}:assortment/{collection-id}"
            },
            {
                "Motion": [
                    "aoss:CreateSecurityPolicy",
                    "aoss:GetSecurityPolicy",
                    "aoss:UpdateSecurityPolicy"
                ],
                "Impact": "Enable",
                "Useful resource": "*",
                "Situation": {
                    "StringEquals": {
                        "aoss:assortment": "{collection-name}"
                    }
                }
            }
        ]
    }

  2. Create an IAM position (opensearch-ingestion-pipeline-role) that the OpenSearch Ingestion pipeline will assume. Whereas creating the position, use the coverage you created (opensearch-ingestion-pipeline-policy). The position ought to have the next belief relationship (modify {account-id} and {area} accordingly):
    {
        "Model": "2012-10-17",
        "Assertion": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                }
            }
        }]
    }

  3. Document the ARN of the newly created IAM position (arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role).

Replace the info entry coverage connected to the OpenSearch Serverless assortment

After you create the IAM position, you could replace the info entry coverage connected to the OpenSearch Serverless assortment. Knowledge entry insurance policies management entry to the OpenSearch operations that OpenSearch Serverless helps, similar to PUT <index> or GET _cat/indices. To carry out the replace, full the next steps:

  1. On the OpenSearch Service console, underneath Serverless within the navigation pane, select Collections.
  2. From the listing of the collections, select your OpenSearch Serverless assortment.
  3. On the Overview tab, within the Knowledge entry part, select the related coverage.
  4. Select Edit.
  5. Edit the coverage within the JSON editor so as to add the next JSON rule block within the current JSON (modify {account-id} and {collection-name} accordingly):
    {
        "Guidelines": [{
            "Resource": [
                "index/{collection-name}/*"
            ],
            "Permission": [
                "aoss:CreateIndex",
                "aoss:UpdateIndex",
                "aoss:DescribeIndex",
                "aoss:ReadDocument",
                "aoss:WriteDocument"
            ],
            "ResourceType": "index"
        }],
        "Principal": [
            "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
        ],
        "Description": "Present entry to OpenSearch Ingestion Pipeline Function"
    }

You too can use the Visible Editor methodology to decide on Add one other rule and add the previous permissions for arn:aws:iam::{account-id}:position/opensearch-ingestion-pipeline-role.

  1. Select Save.

Now you have got efficiently allowed the OpenSearch Ingestion position to carry out OpenSearch operations in opposition to the OpenSearch Serverless assortment.

Create and configure the OpenSearch Ingestion pipeline to repeat the info from one index to a different

Full the next steps:

  1. On the OpenSearch Service console, select Pipelines underneath Ingestion within the navigation pane.
  2. Select Create a pipeline.
  3. In Select Blueprint, choose OpenSearchDataMigrationPipeline.
  4. For Pipeline title, enter a reputation (for instance, sample-ingestion-pipeline).
  5. For Pipeline capability, you possibly can outline the minimal and most capability to scale up the assets. For this walkthrough, you need to use the default worth of two Ingestion OCUs for Min capability and 4 Ingestion OCUs for Max capability. Nonetheless, you possibly can even select completely different values as OpenSearch Ingestion mechanically scales your pipeline capability in keeping with your estimated workload, based mostly on the minimal and most Ingestion OpenSearch Compute Items (Ingestion OCUs) that you just specify.
  6. Replace the next info for the supply:
    1. Uncomment hosts and specify the endpoint of the prevailing OpenSearch Serverless assortment that was copied as a part of stipulations.
    2. Uncomment embody and index_name_regex, and specify the title of the index that can act because the supply (on this demo, we’re utilizing logs-2024.03.01).
    3. Uncomment area underneath aws and specify the AWS Area the place your OpenSearch Serverless assortment is (for instance, us-east-1).
    4. Uncomment sts_role_arn underneath aws and specify the position that has permission to learn knowledge from the OpenSearch Serverless assortment (for instance, arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role). This is similar position that was added within the knowledge entry coverage of the gathering.
    5. Replace the serverless flag to true.
    6. If the OpenSearch Serverless assortment has VPC entry, uncomment serverless_options and network_policy_name and specify the title of the community coverage used for the gathering.
    7. Uncomment scheduling, interval, index_read_count, and start_time and modify these parameters accordingly.
      Utilizing these parameters makes positive the OpenSearch Ingestion pipeline processes the indexes a number of instances (to choose up new paperwork).
      Notice – If the gathering specified within the sink is of the Time collection or Vector search kind, you possibly can hold the scheduling, interval, index_read_count, and start_time parameters commented.
  1. Replace the next info for the sink:
    1. Uncomment hosts and specify the endpoint of the prevailing OpenSearch Serverless assortment.
    2. Uncomment sts_role_arn underneath aws and specify the position that has permission to put in writing knowledge into the OpenSearch Serverless assortment (for instance, arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role). This is similar position that was added within the knowledge entry coverage of the gathering.
    3. Replace the serverless flag to true.
    4. If the OpenSearch Serverless assortment has VPC entry, uncomment serverless_options and network_policy_name and specify the title of the community coverage used for the gathering.
    5. Replace the worth for index and supply the index title to which you need to switch the paperwork (for instance, new-logs-2024.03.01).
    6. For document_id, you will get the ID from the doc metadata within the supply and use the identical within the goal.
      Nonetheless, it is very important word that customized doc IDs are solely supported for the Search kind of assortment. In case your assortment is of the Time Collection or Vector Search kind, you must remark out the document_id line.
    7. (Non-obligatory) The values for bucket, area and sts_role_arn keys throughout the dlq part might be modified to seize any failed requests in an S3 bucket.
      Notice – Extra permission to opensearch-ingestion-pipeline-role must be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the adjustments required.
      For this walkthrough, you’ll not arrange a DLQ. You possibly can take away the complete dlq block.
  1. Now click on on Validate pipeline to validate the pipeline configuration.
  2. For Community settings, select your most well-liked setting:
    1. Select VPC entry and choose your VPC, subnet, and safety group to arrange the entry privately. Select this feature if the OpenSearch Serverless assortment has VPC entry. AWS recommends utilizing a VPC endpoint for all manufacturing workloads.
    2. Select Public to make use of public entry. For this walkthrough, we choose Public as a result of the gathering can be accessible from public community.
  3. For Log Publishing Possibility, you possibly can both create a brand new Amazon CloudWatch group or use an current CloudWatch group to put in writing the ingestion logs. This supplies entry to details about errors and warnings raised in the course of the operation, which might help throughout troubleshooting. For this walkthrough, select Create new group.
  4. Select Subsequent, and confirm the main points you specified to your pipeline settings.
  5. Select Create pipeline.

It’s going to take a few minutes to create the ingestion pipeline. After the pipeline is created, you will note the paperwork within the vacation spot index, specified within the sink (for instance, new-logs-2024.03.01). After all of the paperwork are copied, you possibly can validate the variety of paperwork through the use of the depend API.

When the method is full, you have got the choice to cease or delete the pipeline. If you happen to select to maintain the pipeline working, it is going to proceed to repeat new paperwork from the supply index in keeping with the outlined schedule, if specified.

On this walkthrough, the endpoint outlined within the hosts parameter underneath supply and sink of the pipeline configuration belonged to the identical assortment which was of the Search kind. If the collections are completely different, you could modify the permissions for the IAM position (opensearch-ingestion-pipeline-role) to permit entry to each collections. Moreover, be sure you replace the info entry coverage for each the collections to grant entry to the OpenSearch Ingestion pipeline.

Create an index template utilizing the OpenSearch Ingestion pipeline to outline mapping

In OpenSearch, you possibly can outline how paperwork and their fields are saved and listed by making a mapping. The mapping specifies the listing of fields for a doc. Each area within the doc has a area kind, which defines the kind of knowledge the sphere accommodates. OpenSearch Service dynamically maps knowledge varieties in every incoming doc if an specific mapping will not be outlined. Nonetheless, you need to use the template_type parameter with the index-template worth and template_content with JSON of the content material of the index-template within the pipeline configuration to outline specific mapping guidelines. You additionally must outline the index_type parameter with the worth as customized.

The next code reveals an instance of the sink portion of the pipeline and the utilization of index_type, template_type, and template_content:

sink:
    - opensearch:
        # Present an AWS OpenSearch Service area endpoint
        hosts: [ "<<OpenSearch-Serverless-Collection-Endpoint>>" ]
        aws:
          # Present a Function ARN with entry to the area. This position ought to have a belief relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role"
          # Present the area of the area.
          area: "us-east-1"
          # Allow the 'serverless' flag if the sink is an Amazon OpenSearch Serverless assortment
          serverless: true
          # serverless_options:
            # Specify a reputation right here to create or replace community coverage for the serverless assortment
            # network_policy_name: "network-policy-name"
        # This may make it so every doc within the supply cluster will probably be written to the identical index within the vacation spot cluster
        index: "new-logs-2024.03.01"
        index_type: customized
        template_type: index-template
        template_content: >
          {
            "template" : {
              "mappings" : {
                "properties" : {
                  "Knowledge" : {
                    "kind" : "textual content"
                  },
                  "EncodedColors" : {
                    "kind" : "binary"
                  },
                  "Kind" : {
                    "kind" : "key phrase"
                  },
                  "LargeDouble" : {
                    "kind" : "double"
                  }          
                }
              }
            }
          }
        # This may make it so every doc within the supply cluster will probably be written with the identical document_id within the vacation spot cluster
        document_id: "${getMetadata("opensearch-document_id")}"
        # Allow the 'distribution_version' setting if the AWS OpenSearch Service area is of model Elasticsearch 6.x
        # distribution_version: "es6"
        # Allow and change the 'enable_request_compression' flag if the default compression setting is modified within the area. See https://docs.aws.amazon.com/opensearch-service/newest/developerguide/gzip.html
        # enable_request_compression: true/false
        # Allow the S3 DLQ to seize any failed requests in an S3 bucket
        # dlq:
          # s3:
            # Present an S3 bucket
            # bucket: "<<your-dlq-bucket-name>>"
            # Present a key path prefix for the failed requests
            # key_path_prefix: "<<logs/dlq>>"
            # Present the area of the bucket.
            # area: "<<us-east-1>>"
            # Present a Function ARN with entry to the bucket. This position ought to have a belief relationship with osis-pipelines.amazonaws.com
            # sts_role_arn: "<<arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role>>"

Or you possibly can create the index first, with the mapping within the assortment earlier than you begin the pipeline.

If you wish to create a template utilizing an OpenSearch Ingestion pipeline, you could present aoss:UpdateCollectionItems and aoss:DescribeCollectionItems permission for the gathering within the knowledge entry coverage for the pipeline position (opensearch-ingestion-pipeline-role). The up to date JSON block for the rule would appear to be the next:

{
    "Guidelines": [
      {
        "Resource": [
          "collection/{collection-name}"
        ],
        "Permission": [
          "aoss:UpdateCollectionItems",
          "aoss:DescribeCollectionItems"
        ],
        "ResourceType": "assortment"
      },
      {
        "Useful resource": [
          "index/{collection-name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
    ],
    "Description": "Present entry to OpenSearch Ingestion Pipeline Function"
  }

Conclusion

On this publish, we confirmed use an OpenSearch Ingestion pipeline to repeat knowledge from one index to a different in an OpenSearch Serverless assortment. OpenSearch Ingestion additionally permits you to carry out transformation of information utilizing varied processors. AWS presents varied assets so that you can shortly begin constructing pipelines utilizing OpenSearch Ingestion. You should use varied built-in pipeline integrations to shortly ingest knowledge from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Safety Lake, Fluent Bit, and lots of extra. You should use the next OpenSearch Ingestion blueprints to construct knowledge pipelines with minimal configuration adjustments.


In regards to the Authors

Utkarsh Agarwal is a Cloud Help Engineer within the Help Engineering workforce at Amazon Internet Providers. He focuses on Amazon OpenSearch Service. He supplies steerage and technical help to prospects thus enabling them to construct scalable, extremely accessible, and safe options within the AWS Cloud. In his free time, he enjoys watching motion pictures, TV collection, and naturally, cricket. These days, he has additionally been making an attempt to grasp the artwork of cooking in his free time – the style buds are excited, however the kitchen may disagree.

Prashant Agrawal is a Sr. Search Specialist Options Architect with Amazon OpenSearch Service. He works intently with prospects to assist them migrate their workloads to the cloud and helps current prospects fine-tune their clusters to realize higher efficiency and save on price. Earlier than becoming a member of AWS, he helped varied prospects use OpenSearch and Elasticsearch for his or her search and log analytics use circumstances. When not working, yow will discover him touring and exploring new locations. Briefly, he likes doing Eat → Journey → Repeat.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox