Scale AWS Glue jobs by optimizing IP tackle consumption and increasing community capability utilizing a personal NAT gateway


As companies develop, the demand for IP addresses throughout the company community usually exceeds the availability. A corporation’s community is commonly designed with some anticipation of future necessities, however as enterprises evolve, their info know-how (IT) wants surpass the beforehand designed community. Firms could discover themselves challenged to handle the restricted pool of IP addresses.

For knowledge engineering workloads when AWS Glue is utilized in such a constrained community configuration, your workforce could typically face hurdles working many roles concurrently. This occurs as a result of it’s possible you’ll not have sufficient IP addresses to assist the required connections to databases. To beat this scarcity, the workforce could get extra IP addresses out of your company community pool. These obtained IP addresses may be distinctive (non-overlapping) or overlapping, when the IP addresses are reused in your company community.

Whenever you use overlapping IP addresses, you want an extra community administration to determine connectivity. Networking options can embody choices like non-public Community Tackle Translation (NAT) gateways, AWS PrivateLink, or self-managed NAT home equipment to translate IP addresses.

On this submit, we are going to focus on two methods to scale AWS Glue jobs:

  1. Optimizing the IP tackle consumption by right-sizing Knowledge Processing Models (DPUs), utilizing the Auto Scaling characteristic of AWS Glue, and fine-tuning of the roles.
  2. Increasing the community capability utilizing further non-routable Classless Inter-Area Routing (CIDR) vary with a personal NAT gateway.

Earlier than we dive deep into these options, allow us to perceive how AWS Glue makes use of Elastic Community Interface (ENI) for establishing connectivity. To allow entry to knowledge shops inside a VPC, it’s worthwhile to create an AWS Glue connection that’s hooked up to your VPC. When an AWS Glue job runs in your VPC, the job creates an ENI contained in the configured VPC for every knowledge connection, and that ENI makes use of an IP tackle within the specified VPC. These ENIs are short-lived and lively till job is full.

Now allow us to have a look at the primary resolution that explains optimizing the AWS Glue IP tackle consumption.

Methods for environment friendly IP tackle consumption

In AWS Glue, the variety of staff a job makes use of determines the depend of IP addresses used out of your VPC subnet. It is because every employee requires one IP tackle that maps to 1 ENI. Whenever you don’t have sufficient CIDR vary allotted to the AWS Glue subnet, it’s possible you’ll observe IP tackle exhaustion errors. The next are some greatest practices to optimize AWS Glue IP tackle consumption:

  • Proper-sizing the job’s DPUs – AWS Glue is a distributed processing engine. It really works effectively when it could actually run duties in parallel. If a job has greater than the required DPUs, it doesn’t all the time run faster. So, discovering the precise variety of DPUs will ensure you use IP addresses optimally. By constructing observability within the system and analyzing the job efficiency, you will get insights into ENI consumption tendencies after which configure the suitable capability on the job for the precise measurement. For extra particulars, seek advice from Monitoring for DPU capability planning. The Spark UI is a useful device to observe AWS Glue jobs’ staff utilization. For extra particulars, seek advice from Monitoring jobs utilizing the Apache Spark net UI.
  • AWS Glue Auto Scaling – It’s usually troublesome to foretell a job’s capability necessities upfront. Enabling the Auto Scaling characteristic of AWS Glue will offload a few of this duty to AWS. At runtime primarily based on the workload necessities, the job mechanically scales employee nodes upto the outlined most configuration. If there isn’t any further want, AWS Glue won’t overprovision staff, thereby saving assets and lowering price. The Auto Scaling characteristic is on the market in AWS Glue 3.0 and later. For extra info, seek advice from Introducing AWS Glue Auto Scaling: Robotically resize serverless computing assets for decrease price with optimized Apache Spark.
  • Job-level optimization – Establish job-level optimizations by utilizing AWS Glue job metrics , and apply greatest practices from Greatest practices for efficiency tuning AWS Glue for Apache Spark jobs.

Subsequent allow us to have a look at the second resolution that elaborates community capability enlargement.

Options for community measurement (IP tackle) enlargement

On this part, we are going to focus on two doable options to develop community measurement in additional element.

Broaden VPC CIDR ranges with routable addresses

One resolution is so as to add extra non-public IPv4 CIDR ranges from RFC 1918 to your VPC. Theoretically, every AWS account may be assigned to some or all these IP tackle CIDRs. Your IP Tackle Administration (IPAM) workforce usually manages the allocation of IP addresses that every enterprise unit can use from RFC1918 to keep away from overlapping IP addresses throughout a number of AWS accounts or enterprise items. In case your present routable IP tackle quota allotted by the IPAM workforce isn’t adequate, then you’ll be able to request for extra.

In case your IPAM workforce points you an extra non-overlapping CIDR vary, then you’ll be able to both add it as a secondary CIDR to your present VPC or create a brand new VPC with it. If you’re planning to create a brand new VPC, then you’ll be able to inter-connect the VPCs through VPC peering or AWS Transit Gateway.

If this extra capability is adequate to run all of your jobs inside outlined the timeframe, then it’s a easy and cost-effective resolution. In any other case, you’ll be able to take into account adopting overlapping IP addresses with a personal NAT gateway, as described within the following part. With the next resolution it’s essential to use Transit Gateway to attach VPCs as VPC peering isn’t doable when there are overlapping CIDR ranges in these two VPCs.

Configure non-routable CIDR with a personal NAT gateway

As described within the AWS whitepaper Constructing a Scalable and Safe Multi-VPC AWS Community Infrastructure, you’ll be able to develop your community capability by making a non-routable IP tackle subnet and utilizing a personal NAT gateway that’s positioned in a routable IP tackle area (non-overlapping) to route visitors. A personal NAT gateway interprets and routes visitors between non-routable IP addresses and routable IP addresses. The next diagram demonstrates the answer close to AWS Glue.

High level architecture

As you’ll be able to see within the above diagram, VPC A (ETL) has two CIDR ranges hooked up. The smaller CIDR vary 172.33.0.0/24 is routable as a result of it not reused wherever, whereas the bigger CIDR vary 100.64.0.0/16 is non-routable as a result of it’s reused within the database VPC.

In VPC B (Database), we’ve got hosted two databases in routable subnets 172.30.0.0/26 and 172.30.0.64/26. These two subnets are in two separate Availability Zones for top availability. We even have two further unused subnet 100.64.0.0/24 and 100.64.1.0/24 to simulate a non-routable setup.

You’ll be able to select the dimensions of the non-routable CIDR vary primarily based in your capability necessities. Since you’ll be able to reuse IP addresses, you’ll be able to create a really massive subnet as wanted. For instance, a CIDR masks of /16 would provide you with roughly 65,000 IPv4 addresses. You’ll be able to work along with your community engineering workforce and measurement the subnets.

In brief, you’ll be able to configure AWS Glue jobs to make use of each routable and non-routable subnets in your VPC to maximise the out there IP tackle pool.

Now allow us to perceive how Glue ENIs which are in a non-routable subnet talk with knowledge sources in one other VPC.

Call flow

The information move for the use case demonstrated right here is as follows (referring to the numbered steps in determine above):

  1. When an AWS Glue job must entry an information supply, it first makes use of the AWS Glue connection on the job and creates the ENIs within the non-routable subnet 100.64.0.0/24 in VPC A. Later AWS Glue makes use of the database connection configuration and makes an attempt to hook up with the database in VPC B 172.30.0.0/24.
  2. As per the route desk VPCA-Non-Routable-RouteTable the vacation spot 172.30.0.0/24 is configured for a personal NAT gateway. The request is shipped to the NAT gateway, which then interprets the supply IP tackle from a non-routable IP tackle to a routable IP tackle. Visitors is then despatched to the transit gateway attachment in VPC A as a result of it’s related to the VPCA-Routable-RouteTable route desk in VPC A.
  3. Transit Gateway makes use of the 172.30.0.0/24 route and sends the visitors to the VPC B transit gateway attachment.
  4. The transit gateway ENI in VPC B makes use of VPC B’s native route to hook up with the database endpoint and question the info.
  5. When the question is full, the response is shipped again to VPC A. The response visitors is routed to the transit gateway attachment in VPC B, then Transit Gateway makes use of the 172.33.0.0/24 route and sends visitors to the VPC A transit gateway attachment.
  6. The transit gateway ENI in VPC A makes use of the native path to ahead the visitors to the non-public NAT gateway, which interprets the vacation spot IP tackle to that of ENIs in non-routable subnet.
  7. Lastly, the AWS Glue job receives the info and continues processing.

The non-public NAT gateway resolution is an possibility for those who want additional IP addresses when you’ll be able to’t get hold of them from a routable community in your group. Generally with every further service there may be an extra price incurred, and this trade-off is critical to satisfy your objectives. Consult with the NAT Gateway pricing part on the Amazon VPC pricing web page for extra info.

Conditions

To finish the walk-through of the non-public NAT gateway resolution, you want the next:

Deploy the answer

To implement the answer, full the next steps:

  1. Register to your AWS administration console.
  2. Deploy the answer by clicking Launch stack . This stack defaults to us-east-1, you’ll be able to choose your required Area.
  3. Click on subsequent after which specify the stack particulars. You’ll be able to retain the enter parameters to the prepopulated default values or change them as wanted.
  4. For DatabaseUserPassword, enter an alphanumeric password of your selection and guarantee to notice it down for additional use.
  5. For S3BucketName, enter a singular Amazon Easy Storage Service (Amazon S3) bucket identify. This bucket shops the AWS Glue job script that will likely be copied from an AWS public code repository.Stack details
  6. Click on subsequent.
  7. Go away the default values and click on subsequent once more.
  8. Assessment the small print, acknowledge the creation of IAM assets, and click on submit to begin the deployment.

You’ll be able to monitor the occasions to see assets being created on the AWS CloudFormation console. It might take round 20 minutes for the stack assets to be created.

After the stack creation is full, go to the Outputs tab on the AWS CloudFormation console and observe the next values for later use:

  • DBSource
  • DBTarget
  • SourceCrawler
  • TargetCrawler

Connect with an AWS Cloud9 occasion

Subsequent, we have to put together the supply and goal Amazon RDS for MySQL tables utilizing an AWS Cloud9 occasion. Full the next steps:

  1. On the AWS Cloud9 console web page, find the aws-glue-cloud9 surroundings.
  2. Within the Cloud9 IDE column, click on on Open to launch your AWS Cloud9 occasion in a brand new net browser.

Put together the supply MySQL desk

Full the next steps to arrange your supply desk:

  1. From the AWS Cloud9 terminal, set up the MySQL consumer utilizing the next command: sudo yum replace -y && sudo yum set up -y mysql
  2. Connect with the supply database utilizing the next command. Change the supply hostname with the DBSource worth you captured earlier. When prompted, enter the database password that you simply specified through the stack creation. mysql -h <Supply Hostname> -P 3306 -u admin -p
  3. Run the next scripts to create the supply emp desk, and cargo the take a look at knowledge:
    -- connect with supply database
    USE srcdb;
    -- Drop emp desk if it exists
    DROP TABLE IF EXISTS emp;
    -- Create the emp desk
    CREATE TABLE emp (empid INT AUTO_INCREMENT,
                      ename VARCHAR(100) NOT NULL,
                      edept VARCHAR(100) NOT NULL,
                      PRIMARY KEY (empid));
    -- Create a saved process to load pattern information into emp desk
    DELIMITER $$
    CREATE PROCEDURE sp_load_emp_source_data()
    BEGIN
    DECLARE empid INT;
    DECLARE ename VARCHAR(100);
    DECLARE edept VARCHAR(50);
    DECLARE cnt INT DEFAULT 1; -- Initialize counter to 1 to auto-increment the PK
    DECLARE rec_count INT DEFAULT 1000; -- Initialize pattern information counter
    TRUNCATE TABLE emp; -- Truncate the emp desk
    WHILE cnt <= rec_count DO -- Loop and cargo the required variety of pattern information
    SET ename = CONCAT('Employee_', FLOOR(RAND() * 100) + 1); -- Generate random worker identify
    SET edept = CONCAT('Dept_', FLOOR(RAND() * 100) + 1); -- Generate random worker division
    -- Insert report with auto-incrementing empid
    INSERT INTO emp (ename, edept) VALUES (ename, edept);
    -- Increment counter for subsequent report
    SET cnt = cnt + 1;
    END WHILE;
    COMMIT;
    END$$
    DELIMITER ;
    -- Name the above saved process to load pattern information into emp desk
    CALL sp_load_emp_source_data();
  4. Test the supply emp desk’s depend utilizing the under SQL question (you want this at later step for verification). choose depend(*) from emp;
  5. Run the next command to exit from the MySQL consumer utility and return to the AWS Cloud9 occasion’s terminal: give up;

Put together the goal MySQL desk

Full the next steps to arrange the goal desk:

  1. Connect with the goal database utilizing the next command. Change the goal hostname with the DBTarget worth you captured earlier. When prompted enter the database password that you simply specified through the stack creation. mysql -h <Goal Hostname> -P 3306 -u admin -p
  2. Run the next scripts to create the goal emp desk. This desk will likely be loaded by the AWS Glue job within the subsequent step.
    -- connect with the goal database
    USE targetdb;
    -- Drop emp desk if it exists 
    DROP TABLE IF EXISTS emp;
    -- Create the emp desk
    CREATE TABLE emp (empid INT AUTO_INCREMENT,
                      ename VARCHAR(100) NOT NULL,
                      edept VARCHAR(100) NOT NULL,
                      PRIMARY KEY (empid)
    );

Confirm the networking setup (Non-obligatory)

The next steps are helpful to know NAT gateway, route tables, and the transit gateway configurations of personal NAT gateway resolution. These elements have been created through the CloudFormation stack creation.

  1. On the Amazon VPC console web page, navigate to Digital non-public cloud part and find NAT gateways.
  2. Seek for NAT Gateway with identify Glue-OverlappingCIDR-NATGW and discover it additional. As you’ll be able to see within the following screenshot, the NAT gateway was created in VPC A (ETL) on the routable subnet.NAT Gateway setup
  3. Within the left aspect navigation pane, navigate to Route tables below digital non-public cloud part.
  4. Seek for VPCA-Non-Routable-RouteTable and discover it additional. You’ll be able to see that the route desk is configured to translate visitors from overlapping CIDR utilizing the NAT gateway.Route table setup
  5. Within the left aspect navigation pane, navigate to Transit gateways part and click on on Transit gateway attachments. Enter VPC- within the search field and find the 2 newly created transit gateway attachments.
  6. You’ll be able to discover these attachments additional to study their configurations.

Run the AWS Glue crawlers

Full the next steps to run the AWS Glue crawlers which are required to catalog the supply and goal emp tables. This can be a prerequisite step for working the AWS Glue job.

  1. On the AWS Glue Console web page, below Knowledge Catalog part within the navigation pane, click on on Crawlers.
  2. Find the supply and goal crawlers that you simply famous earlier.
  3. Choose these crawlers and click on Run to create the respective AWS Glue Knowledge Catalog tables.
  4. You’ll be able to monitor the AWS Glue crawlers for the profitable completion. It might take round 3–4 minutes for each crawlers to finish. Once they’re performed, the final run standing of the job adjustments to Succeeded, and you may also see there are two AWS Glue catalog tables created from this run.Crawler run sucessful

Run the AWS Glue ETL job

After you arrange the tables and full the prerequisite steps, you are actually able to run the AWS Glue job that you simply created utilizing the CloudFormation template. This job connects to the supply RDS for MySQL database, extracts the info, and masses the info into the goal RDS for MySQL database. This job reads knowledge from a supply MySQL desk and masses it to the goal MySQL desk utilizing non-public NAT gateway resolution. To run the AWS Glue job, full the next steps:

  1. On the AWS Glue console, click on on ETL jobs within the navigation pane.
  2. Click on on the job glue-private-nat-job.
  3. Click on Run to begin it.

The next is the PySpark script for this ETL job:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node AWS Glue Knowledge Catalog
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
    database="glue_cat_db_source",
    table_name="srcdb_emp",
    transformation_ctx="AWSGlueDataCatalog_node",
)

# Script generated for node Change Schema
ChangeSchema_node = ApplyMapping.apply(
    body=AWSGlueDataCatalog_node,
    mappings=[
        ("empid", "int", "empid", "int"),
        ("ename", "string", "ename", "string"),
        ("edept", "string", "edept", "string"),
    ],
    transformation_ctx="ChangeSchema_node",
)

# Script generated for node AWS Glue Knowledge Catalog
AWSGlueDataCatalog_node = glueContext.write_dynamic_frame.from_catalog(
    body=ChangeSchema_node,
    database="glue_cat_db_target",
    table_name="targetdb_emp",
    transformation_ctx="AWSGlueDataCatalog_node",
)

job.commit()

Primarily based on the job’s DPU configuration, AWS Glue creates a set of ENIs within the non-routable subnet that’s configured on the AWS Glue connection. You’ll be able to monitor these ENIs on the Community Interfaces web page of the Amazon Elastic Compute Cloud (Amazon EC2) console.

The under screenshot reveals the ten ENIs that have been created for the job run to match the requested variety of staff configured on the job parameters. As anticipated, the ENIs have been created within the non-routable subnet of VPC A, enabling scalability of IP addresses. After the job is full, these ENIs will likely be mechanically launched by AWS Glue.Execution ENIs

When the AWS Glue job is working, you’ll be able to monitor its standing. Upon profitable completion, the job’s standing adjustments to Succeeded.Job successful completition

Confirm the outcomes

After the AWS Glue job is full, connect with the goal MySQL database. Confirm if the goal report depend matches to the supply. You need to use the under SQL question in AWS Cloud9 terminal.

USE targetdb;
SELECT depend(*) from emp;

Lastly, exit from the MySQL consumer utility utilizing the next command and return to the AWS Cloud9 terminal: give up;

Now you can verify that AWS Glue has efficiently accomplished a job to load knowledge to a goal database utilizing the IP addresses from a non-routable subnet. This concludes finish to finish testing of the non-public NAT gateway resolution.

Clear up

To keep away from incurring future prices, delete the useful resource created through CloudFormation stack by finishing the next steps:

  1. On the AWS CloudFormation console, click on Stacks within the navigation pane.
  2. Choose the stack AWSGluePrivateNATStack.
  3. Click on on Delete to delete the stack. When prompted verify the stack deletion.

Conclusion

On this submit, we demonstrated how one can scale AWS Glue jobs by optimizing IP addresses consumption and increasing your community capability by utilizing a personal NAT gateway resolution. This two-fold method lets you get unblocked in an surroundings that has IP tackle capability constraints. The choices mentioned within the AWS Glue IP tackle optimization part are complimentary to the IP tackle enlargement options, and you’ll iteratively construct to mature your knowledge platform.

Study extra about AWS Glue job optimization methods from Monitor and optimize price on AWS Glue for Apache Spark and Greatest practices to scale Apache Spark jobs and partition knowledge with AWS Glue.


In regards to the authors

Author1Sushanth Kothapally is a Options Architect at Amazon Net Companies supporting Automotive and Manufacturing prospects. He’s captivated with designing know-how options to satisfy enterprise objectives and has eager curiosity in serverless and event-driven architectures.

Author2Senthil Kamala Rathinam is a Options Architect at Amazon Net Companies specializing in Knowledge and Analytics. He’s captivated with serving to prospects to design and construct trendy knowledge platforms. In his free time, Senthil likes to spend time along with his household and play badminton.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox