Efficient Redshift Data Upload using Node.JS

What is Amazon Redshift?

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze huge amounts of data using SQL based BI tools.

Using Amazon Redshift

When building a large scale peta-byte systems one has to consider not only how to retrieve data from the database but how to effectively upload data into the database. While the Amazon Redshift documentation is very extensive and has all the info, sometimes it is, well, too extensive. So I tried to summarize our experience and share our methodology which works nicely so far. I would, however, recommend to read the Amazon Redshift Best Practices for Loading Data article which summarizes the do’s and don’ts nicely.

Being a mobile advertising service based on Real-Time-Bidding, we need to examine, process and store huge amount of data. On average, we store and process about 500k ad requests every second (and growing). This post will demonstrate how we manage to effectively upload 500k ad requests into Amazon Redshift every second using Node.js.

TL;DR;

  1. Collect your data in CSV files and upload them to Amazon S3
  2. Use the Copy command to upload CSV files with your data. DO NOT use inserts. This is not MySQL.

Many people who move to Peta-byte data bases from standard SQL databases (like MySQL) make the basic mistake of using the same methodology they already know:

INSERT INTO table_name (field1, field2,...fieldN )
                       VALUES
                       (value1, value2,...valueN);

While this works perfectly fine on MySQL you quickly discover that on AWS Redshift this is painfully slow and totally wrong.

Instead, Amazon Redshift uses a copy command to upload data files quickly into the database. The basic copy syntax looks like:

COPY table_name FROM data_source CREDENTIALS 'aws_access_credentials';

In a nutshell, the COPY command loads data into Redshift in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. The COPY process loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.

Storing your Data in Amazon S3

Step 1: Create an array that will store your data in the memory on each server
Step 2: add items to the array
Step 3: Convert the array to CSV, and;
Step 4: Store the CSV file in S3 whenever you want (every minute, when the number of items reach a certain threshold etc).

var csv = require('csv');
/*
the rest of your code
*/

var file_name = 'items_'+this_date+'_'+this_hour+'_'+this_minute+'_'+ Date.now() + '.txt';

csv.stringify(data_array, {
    delimiter: '|'
}, function (err, data) {

    var params = {
        Bucket: 'my-bucket',
        Key: file_name,
        Body: data
    };

    var options = {partSize: 10 * 1024 * 1024, queueSize: 1};

    aws.s3.upload(params, options, function(s3err, result) {
        if (s3err) {
            //handle the error
        }
        else {
            //Upload successful. You can delete the S3 files
        }
    });
});

For us, the best practice is to upload data to S3 every minute where the next process collects the data from Amazon S3 and copies the data into the database. The file naming should follow a pattern that will be used later on to collect the data and store it in Redshift. We also added a unix timestamp at the end of file name to distinguish between data uploaded from different servers or Node workers.

Utilizing Parallel uploading using the Copy command

Now, that we have multiple CSV files stored on Amazon S3, we should upload them to Redshift:

var prefix = 'items_'+this_date+'_'+this_hour+'_'+this_minute+'_';
pg.connect(conString, function (err, client, release) {
    if (err) {
        console.error(err);
        release();
    } else {

        var pg_query = "copy table_name from 's3://my-bucket/" + prefix + "' credentials 'aws-credentials' escape delimiter as '|' MAXERROR 10;";
        client.query(pg_query, function (err1, pgres) {
            //query completed, we can close the connection
            release();
            if (err1) {
                console.error(err1);
            } else {
                //upload successful
            }
        });

    }
});

 

OK, some explaining is due:

The prefix should be in the same format as the files you store. Redshift will copy the data from all the files matching the prefix.

‘MAX ERROR 10’ basically sets how many records can fail within an import process before the whole upload fails. It is a good practice to keep ‘MAX ERROR 0’ on development and allow some ERRORs in production monitoring Redshift for errors (as suggested below).

Monitoring Results in looking for Errors

When data is copied into the database, Redshift records errors in a tables called ‘pg_catalog.stl_load_errors’. Any failed imports will be listed here. It would be a good practice to go over this table from time to time and see if there were errors in the upload process.

Enjoy Amazon Redshift. We do.

 

Rafi Ton

This entry has 1 replies

  1. Tomer says:

    Looks super professional,
    Will show it to me fellow account members.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>