Data Warehousing in AWS part 2

Muhammad Ariq Naufal
Analytics Vidhya
Published in
5 min readFeb 22, 2021

--

Hello again, so I will continue my previous article about Data Warehousing in AWS. So lets begin 😊

Before you design your database and load data into Redshift, you must first identify where the data will come from. There are many steps before you load the data into Redshift.

Data comes in different formats. Some data is structured, but some data is also unstructured. Amazon provide you with many type of service to load your data.

in the previous article we talked about data mart, data mart have some similarities with data lakes but its different. so what is data lakes?

Data lakes are primarily driven by the need to collect as much data as possible without prior processing, understanding of need, or aggregation, so that the data is flexible for later analysis.

What is the different?

In this article I only talk about how to load data into redshift with Amazon S3.

Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers. Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. so use Amazon S3 as your data lake in AWS and load data into other services such as Amazon Redshift.

The integration between Amazon S3 and Amazon Redshift enables you to:

  • Load data directly from your bucket into your cluster.
  • Load multiple data files in parallel.
  • Unload data from your cluster to a bucket.

So you can load your data from S3 into redshift using a simple COPY command.

AWS also provide you to do some ETL process before your data load into redshift using AWS Glue.

You can run a server-less, event-driven ETL pipeline to Redshift with AWS Glue and AWS Lambda. As soon as new data becomes available in Amazon S3, you can run your AWS Glue ETL jobs by calling the Glue APIs from a Lambda function. You can also register this new dataset in the AWS Glue data catalog as part of your ETL jobs.

There are 3 main components in AWS Glue.

AWS Glue component

Preparing Data

Prepare you data is important, here is some tips to preparing data for load into amazon redshift, you can see the detailed information here.

Prepare data

also you can use a manifest files if you split your data into a few slices. manifest files is a JSON- formatted text that list each of the files to load when you run the COPY command.

There is some tips if your data need a special treatments.

if your data is character-delimited.

if you want to specify the data into a fixed-width column

Also Redshift doesn't enforce you to make a primary and foreign key constraint.

Loading Data using COPY Command

The COPY command is the recommended way to load data from data sources into Redshift. You can see detailed information about COPY command here.

The required COPY parameters are:

Loading data from CSV file

CSV is a common format used for importing and exporting data in spreadsheets. with CSV, you can:

  • Skip headers by using the IGNOREHEADERS AS option and specify the number of rows to ignore.
  • Use column mapping or a JSONpaths file to specify mapping if you don’t want to load all columns.
COPY command for load CSV file

Loading data from JSON file

Redshift uses a JSONPaths file to parse the JSON formatted source data.

  • The JSONPaths file contains one or more JSONPaths expression.
  • A JSONPath expression corresponds to a column in the target table in Redshift

You can specify column order using a column list in COPY command.

There are many data conversion parameters in COPY command. you can find the detailed information here.

Maintaining Tables

Maintain tables is important to making sure the table still run correctly. you can use ANALYZE and VACUUM command to help you maintaining the tables.

ANALYZE command obtains a sample of rows, does some calculations, and saves the resulting column statistics.

Use VACUUM command to free up storage and improve performance, because Redshift doesn't automatically reclaim space from deleted items.

If you have problem to validating data and troubleshooting data loads, use following system tables helps validate data and troubleshoot issue during load:

  • STL_LOAD_ERRORS contains errors that occurred during a specific load.
  • STL_FILE_SCAN provides load times for specific files.

There are some best practices tips for loading data:

So that’s it how to make a data warehouse in AWS. i hope you can find this article helpful. later you can connect the Redshift into Amazon QuickSight to create a Visualization of your data. Thank you 😂

Source:

  • Amazon Training Course

--

--