How To Import JSON From Amazon S3 to PostgreSQL on RDS

And Include in an Event-driven Workflow

Mike
Level Up Coding

--

Pololu Valley, photo by me

Introduction

I’ve been working on a project where I was required to import partitioned JSON files in an Amazon S3 bucket into a PostgreSQL database on Amazon RDS. I had worked on another project where I did something similar — taking JSON ingested with Kinesis Data Streams into S3 and performing ETL to populate staging tables in an Amazon Redshift cluster — so I figured targeting a different storage service on the same cloud platform, and a storage technology (PostgreSQL) that originally served as the basis for the other (Redshift), would be fairly straightforward. After all, I can use the same database adapter to connect to either storage solution with Python, so maybe I can just replace some connection string parameter values and keep moving. Uh, not so fast.

Background

Redshift is a proprietary Amazon technology, as is S3, and Amazon often provides a direct path for integration and communication between their own technologies. PostgreSQL, on the other hand, is an open source relational database management system that Amazon has adapted for its cloud platform and offered as a branded, managed service, similar to other OS tech such as Apache Hadoop, Kafka, Cassandra, etc. I had to do some research, along with trial and error, to accomplish the aforementioned task of importing JSON from S3 to PostgreSQL on RDS. Along the way, I noticed others wanting to do the same thing but no direct solution that I could find online, so I pieced together a solution from various resources, and I thought I’d share what worked in my case.

CSV

I also needed to import CSV to PostgreSQL and made another incorrect assumption — that I could use a COPY command to move data from either file type and just change the value of the FORMAT parameter option accordingly. I mean, it kinda works that way with Redshift, so why not? Wrong again. This post is primarily about importing JSON from S3, but I’ll briefly mention what’s required for CSV as well, as this was helpful in leading to a solution for JSON.

For CSV, we need to be on PostgresSQL version 11.1 or later and install the aws_s3 and aws_commons extensions on our database. Both can be installed with a single command executed while connected to your PostgreSQL database on RDS with pgAdmin or other SQL client tool.

CREATE EXTENSION aws_s3 CASCADE;

Then, the following is an example of SQL that can be used to import CSV data from S3:

SELECT aws_s3.table_import_from_s3(
'my_table',
'optional_column_list',
'(FORMAT CSV, HEADER true)',
'my_bucket',
'/path/my_file.csv',
'us-west-2'
);

This statement requires your database table has columns that correspond to the columns specified in ‘optional_column_list’. If passing an empty string for that parameter, the query will expect a column for each delimited value in the CSV file.

JSON

For JSON, one approach would be to transform the file into CSV and use the aws_s3.table_import_from_s3 function to do the import, as above. In my case, I decided to embrace the schema-less nature of JSON and take advantage of the JSONB data type introduced with Postgres 9.4. First step is to create a table with a JSONB data type field.

CREATE TABLE IF NOT EXISTS json_table (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
ingested_at timestamp DEFAULT CURRENT_TIMESTAMP,
data jsonb NOT NULL
);

I will import IOT sensor data from the project mentioned above.

Here’s the Python code to do the JSON import from S3 to PostgreSQL on RDS.

The code is fairy straightforward and self explanatory. We’re downloading the JSON file from S3 and executing a COPY statement

Now, we can take advantage of the many JSONB operators and functions that PostgreSQL has to offer.

For my use case, I’m importing data from S3 to PostgreSQL as part of a larger ETL workflow, so I implemented the import functionality in a Lambda function on AWS. I also added a SNS notification that gets published each time the function is invoked, as well as a Cloud Formation template for packaging and deploying the function and all dependencies to AWS. Full source code is on GitHub.

Conclusion

This is an example of how to import JSON data from S3 to PostgreSQL on RDS. If you need your data to comply with a specific schema, JSONB offers some compelling features for accomplishing that. Perhaps we’ll cover that in a future post. Thanks for reading!

--

--