Use a number of bookmark keys in AWS Glue JDBC jobs


AWS Glue is a serverless information integrating service that you need to use to catalog information and put together for analytics. With AWS Glue, you possibly can uncover your information, develop scripts to remodel sources into targets, and schedule and run extract, remodel, and cargo (ETL) jobs in a serverless atmosphere. AWS Glue jobs are answerable for operating the information processing logic.

One necessary characteristic of AWS Glue jobs is the power to make use of bookmark keys to course of information incrementally. When an AWS Glue job is run, it reads information from an information supply and processes it. A number of columns from the supply desk could be specified as bookmark keys. The column ought to have sequentially growing or lowering values with out gaps. These values are used to mark the final processed report in a batch. The following run of the job resumes from that time. This lets you course of massive quantities of knowledge incrementally. With out job bookmark keys, AWS Glue jobs must reprocess all the information throughout each run. This may be time-consuming and dear. By utilizing bookmark keys, AWS Glue jobs can resume processing from the place they left off, saving time and decreasing prices.

This put up explains methods to use a number of columns as job bookmark keys in an AWS Glue job with a JDBC connection to the supply information retailer. It additionally demonstrates methods to parameterize the bookmark key columns and desk names within the AWS Glue job connection choices.

This put up is targeted in the direction of architects and information engineers who design and construct ETL pipelines on AWS. You’re anticipated to have a primary understanding of the AWS Administration Console, AWS Glue, Amazon Relational Database Service (Amazon RDS), and Amazon CloudWatch logs.

Answer overview

To implement this answer, we full the next steps:

  1. Create an Amazon RDS for PostgreSQL occasion.
  2. Create two tables and insert pattern information.
  3. Create and run an AWS Glue job to extract information from the RDS for PostgreSQL DB occasion utilizing a number of job bookmark keys.
  4. Create and run a parameterized AWS Glue job to extract information from completely different tables with separate bookmark keys

The next diagram illustrates the parts of this answer.

Deploy the answer

For this answer, we offer an AWS CloudFormation template that units up the companies included within the structure, to allow repeatable deployments. This template creates the next assets:

  • An RDS for PostgreSQL occasion
  • An Amazon Easy Storage Service (Amazon S3) bucket to retailer the information extracted from the RDS for PostgreSQL occasion
  • An AWS Id and Entry Administration (IAM) position for AWS Glue
  • Two AWS Glue jobs with job bookmarks enabled to incrementally extract information from the RDS for PostgreSQL occasion

To deploy the answer, full the next steps:

  1. Select  to launch the CloudFormation stack:
  2. Enter a stack title.
  3. Choose I acknowledge that AWS CloudFormation would possibly create IAM assets with customized names.
  4. Select Create stack.
  5. Wait till the creation of the stack is full, as proven on the AWS CloudFormation console.
  6. When the stack is full, copy the AWS Glue scripts to the S3 bucket job-bookmark-keys-demo-<accountid>.
  7. Open AWS CloudShell.
  8. Run the next instructions and change <accountid> together with your AWS account ID:
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_1_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_1_job.py
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_2_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_2_job.py

Add pattern information and run AWS Glue jobs

On this part, we connect with the RDS for PostgreSQL occasion through AWS Lambda and create two tables. We additionally insert pattern information into each the tables.

  1. On the Lambda console, select Features within the navigation pane.
  2. Select the perform LambdaRDSDDLExecute.
  3. Select Check and select Invoke for the Lambda perform to insert the information.


The 2 tables product and tackle can be created with pattern information, as proven within the following screenshot.

Run the multiple_job_bookmark_keys AWS Glue job

We run the multiple_job_bookmark_keys AWS Glue job twice to extract information from the product desk of the RDS for PostgreSQL occasion. Within the first run, all the present data can be extracted. Then we insert new data and run the job once more. The job ought to extract solely the newly inserted data within the second run.

  1. On the AWS Glue console, select Jobs within the navigation pane.
  2. Select the job multiple_job_bookmark_keys.
  3. Select Run to run the job and select the Runs tab to observe the job progress.
  4. Select the Output logs hyperlink underneath CloudWatch logs after the job is full.
  5. Select the log stream within the subsequent window to see the output logs printed.

    The AWS Glue job extracted all data from the supply desk product. It retains monitor of the final mixture of values within the columns product_id and model.Subsequent, we run one other Lambda perform to insert a brand new report. The product_id 45 already exists, however the inserted report can have a brand new model as 2, making the mix sequentially growing.
  6. Run the LambdaRDSDDLExecute_incremental Lambda perform to insert the brand new report within the product desk.
  7. Run the AWS Glue job multiple_job_bookmark_keys once more after you insert the report and look ahead to it to succeed.
  8. Select the Output logs hyperlink underneath CloudWatch logs.
  9. Select the log stream within the subsequent window to see solely the newly inserted report printed.

The job extracts solely these data which have a mix larger than the beforehand extracted data.

Run the parameterised_job_bookmark_keys AWS Glue job

We now run the parameterized AWS Glue job that takes the desk title and bookmark key column as parameters. We run this job to extract information from completely different tables sustaining separate bookmarks.

The primary run can be for the tackle desk with bookmarkkey as address_id. These are already populated with the job parameters.

  1. On the AWS Glue console, select Jobs within the navigation pane.
  2. Select the job parameterised_job_bookmark_keys.
  3. Select Run to run the job and select the Runs tab to observe the job progress.
  4. Select the Output logs hyperlink underneath CloudWatch logs after the job is full.
  5. Select the log stream within the subsequent window to see all data from the tackle desk printed.
  6. On the Actions menu, select Run with parameters.
  7. Broaden the Job parameters part.
  8. Change the job parameter values as follows:
    • Key --bookmarkkey with worth product_id
    • Key --table_name with worth product
    • The S3 bucket title is unchanged (job-bookmark-keys-demo-<accountnumber>)
  9. Select Run job to run the job and select the Runs tab to observe the job progress.
  10. Select the Output logs hyperlink underneath CloudWatch logs after the job is full.
  11. Select the log stream to see all of the data from the product desk printed.

The job maintains separate bookmarks for every of the tables when extracting the information from the supply information retailer. That is achieved by including the desk title to the job title and transformation contexts within the AWS Glue job script.

Clear up

To keep away from incurring future fees, full the next steps:

  1. On the Amazon S3 console, select Buckets within the navigation pane.
  2. Choose the bucket with job-bookmark-keys in its title.
  3. Select Empty to delete all of the information and folders in it.
  4. On the CloudFormation console, select Stacks within the navigation pane.
  5. Choose the stack you created to deploy the answer and select Delete.

Conclusion

This put up demonstrated passing multiple column of a desk as jobBookmarkKeys in a JDBC connection to an AWS Glue job. It additionally defined how one can a parameterized AWS Glue job to extract information from a number of tables whereas retaining their respective bookmarks. As a subsequent step, you possibly can take a look at the incremental information extract by altering information within the supply tables.


In regards to the Authors

Durga Prasad is a Sr Lead Marketing consultant enabling prospects construct their Knowledge Analytics options on AWS. He’s a espresso lover and enjoys taking part in badminton.

Murali Reddy is a Lead Marketing consultant at Amazon Net Companies (AWS), serving to prospects construct and implement information analytics answer. When he’s not working, Murali is an avid bike rider and loves exploring new locations.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox