Constructing Your First ETL Pipeline with Bash


Building Your First ETL Pipeline with BashBuilding Your First ETL Pipeline with Bash
Picture by Writer | Midjourney & Canva

 

Introduction

 

ETL, or Extract, Remodel, Load, is a vital knowledge engineering course of, which entails extracting knowledge from numerous sources, changing it right into a workable type, and transferring it to some vacation spot, similar to a database. ETL pipelines automate this course of, ensuring that knowledge is processed in a constant and environment friendly method, which gives a framework for duties like knowledge evaluation, reporting, and machine studying, and ensures knowledge is clear, dependable, and able to use.

Bash, brief for brief for Bourne-Once more Shell — aka the Unix shell — is a strong device for constructing ETL pipelines, on account of its simplicity, flexibility, and very vast applicability, and thus it is a wonderful choice for novices and seasoned execs alike. Bash scripts can do issues like automate duties, transfer information round, and speak to different instruments on the command line, that means that it’s a good selection for ETL work. Furthermore, Bash is ubiquitous on Unix-like techniques (Linux, BSD, macOS, and so on.), so it is able to use on most such techniques with no additional work in your half.

This text is meant for newbie and practitioner knowledge scientists and knowledge engineers who need to construct their first ETL pipeline. It assumes a primary understanding of the command line and goals to offer a sensible information to creating an ETL pipeline utilizing Bash.

The aim of this text is to information readers by way of the method of constructing a primary ETL pipeline utilizing Bash. By the tip of the article, readers can have a working understanding of implementing an ETL pipeline that extracts knowledge from a supply, transforms it, and masses it right into a vacation spot database.

 

Setting Up Your Atmosphere

 

Earlier than we start, guarantee you could have the next:

  • A Unix-based system (Linux or macOS)
  • Bash shell (often pre-installed on Unix techniques)
  • Primary understanding of command-line operations

For our ETL pipeline, we are going to want these particular command line instruments:

You possibly can set up them utilizing your system’s package deal supervisor. On a Debian-based system, you need to use apt-get:

sudo apt-get set up curl jq awk sed sqlite3

 

On macOS, you need to use brew:

brew set up curl jq awk sed sqlite3

 

Let’s arrange a devoted listing for our ETL venture. Open your terminal and run:

mkdir ~/etl_project
cd ~/etl_project

 

This creates a brand new listing referred to as etl_project and navigates into it.

 

Extracting Information

 

Information can come from numerous sources similar to APIs, CSV information, or databases. For this tutorial, we’ll show extracting knowledge from a public API and a CSV file.

Let’s use curl to fetch knowledge from a public API. For instance, we’ll extract knowledge from a mock API that gives pattern knowledge.

# Fetching knowledge from a public API
curl -o knowledge.json "https://api.instance.com/knowledge"

 

This command will obtain the info and put it aside as knowledge.json.

We will additionally use curl to obtain a CSV file from a distant server.

# Downloading a CSV file
curl -o knowledge.csv "https://instance.com/knowledge.csv"

 

This can save the CSV file as knowledge.csv in our working listing.

 

Remodeling Information

 

Information transformation is critical to transform uncooked knowledge right into a format appropriate for evaluation or storage. This will contain parsing JSON, filtering CSV information, or cleansing textual content knowledge.

jq is a strong device for working with JSON knowledge. Let’s use it to extract particular fields from our JSON file.

# Parsing and extracting particular fields from JSON
jq '.knowledge[] | {id, title, worth}' knowledge.json > transformed_data.json

 

This command extracts the id, title, and worth fields from every entry within the JSON knowledge and saves the end in transformed_data.json.

awk is a flexible device for processing CSV information. We’ll use it to extract particular columns from our CSV file.

# Extracting particular columns from CSV
awk -F, '{print $1, $3}' knowledge.csv > transformed_data.csv

 

This command extracts the primary and third columns from knowledge.csv and saves them in transformed_data.csv.

sed is a stream editor for filtering and remodeling textual content. We will use it to carry out textual content replacements and clear up our knowledge.

# Changing textual content in a file
sed 's/old_text/new_text/g' transformed_data.csv

 

This command replaces occurrences of old_text with new_text in transformed_data.csv.

 

Loading Information

 

Frequent locations for loading knowledge embrace databases and information. For this tutorial, we’ll use SQLite, a generally used light-weight database.

First, let’s create a brand new SQLite database and a desk to carry our knowledge.

# Creating a brand new SQLite database and desk
sqlite3 etl_database.db "CREATE TABLE knowledge (id INTEGER PRIMARY KEY, title TEXT, worth REAL);"

 

This command creates a database file named etl_database.db and a desk named knowledge with three columns.

Subsequent, we’ll insert our reworked knowledge into the SQLite database.

# Inserting knowledge into SQLite database
sqlite3 etl_database.db <<EOF
.mode csv
.import transformed_data.csv knowledge
EOF

 

This block of instructions units the mode to CSV and imports transformed_data.csv into the knowledge desk.

We will confirm that the info has been inserted appropriately by querying the database.

# Querying the database
sqlite3 etl_database.db "SELECT * FROM knowledge;"

 

This command retrieves all rows from the knowledge desk and shows them.

 

Remaining Ideas

 

We have now coated the next steps whereas constructing our ETL pipeline with Bash, together with:

  1. Atmosphere setup and power set up
  2. Information extraction from a public API and CSV file with curl
  3. Information transformation utilizing jq, awk, and sed
  4. Information loading in an SQLite database with sqlite3

Bash is an effective selection for ETL on account of its simplicity, flexibility, automation capabilities, and interoperability with different CLI instruments.

For additional investigation, take into consideration incorporating error dealing with, scheduling the pipeline by way of cron, or studying extra superior Bash ideas. You may additionally want to examine various transformation apps and strategies to extend your pipeline skillset.

Check out your personal ETL tasks, placing what you could have discovered to the check, in additional elaborate situations. With some luck, the fundamental ideas right here will probably be jumping-off level to extra complicated knowledge engineering duties.
 
 

Matthew Mayo (@mattmayo13) holds a Grasp’s diploma in pc science and a graduate diploma in knowledge mining. As Managing Editor, Matthew goals to make complicated knowledge science ideas accessible. His skilled pursuits embrace pure language processing, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize information within the knowledge science neighborhood. Matthew has been coding since he was 6 years outdated.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox