Simple ETL 101

My Post (1)

The basic know-how for a data role is the ability to perform an ETL process. It does not matter which tool you are using be it spark or excel. So, how about we journey on a simple ETL process using spark and scala.

Extract

I will be demonstrating this using the Boston housing dataset. Spark supports different file formats the most common to spark is Parquet. However, we are using a CSV file.

Screen Shot 2018-08-02 at 11.46.09 AM
File Extract

This is assuming you have some experience with Spark, please refer to A Newbie’s Guide to Big Data  for for details.

Before transformation begins, it helps to know what the data looks like. An example will be using pivot in excel, displaying the first three rows of the new data frame we can see that the columns are not properly named.

Dataextract
Dataframe view

Transform

Three steps are taken to transform the data. We use .first() to store the first row, .alias() to rename the columns and .filter() to take out the first row.

Screen Shot 2018-08-02 at 11.51.45 AM
Transform
Screen Shot 2018-08-02 at 12.44.07 PM
Transformed output

Load

Once the transformation is complete we can load the transformed data for further analysis.

Screen Shot 2018-08-02 at 11.52.41 AM
Load

This really is to give you a simple introduction to ETL. So, the next time you asked about it, just know its not all that complex.

After loading the data what next? Is it large? Want to perform data partition?

PS: Refer to my GitHub for the complete code. Let me know what you think.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s