Apache Parquet: How to Be a Hero with the open source columnar data format on the Google and Amaz...

(Original source by Thomas Spicer)

Get All the Benefits of Apache Parquet File Format for Google Cloud, Amazon Athena and Redshift Spectrum

You have read about Google Cloud (BigQuery, Dataproc…), Amazon Redshift Spectrum and AWS Athena. You are looking to take advantage of one or two. However, before you jump into the deep end you will want to familiarize yourself with the opportunities of leveraging Apache Parquet file format instead of regular Text, CSV or TSV files. The parquet format is a columnar storage format which allows systems, like Amazon Athena, the ability to query information as columnar data vs a flat file like CSV

If you are not thinking about how to optimize for these new query service models, you can be throwing money out the window.

What Is Apache Parquet?

Apache Parquet format is a columnar storage format with the following characteristics:

Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV

Apache Parquet is built from the ground up with complex nested data structures in mind

Apache Parquet is built to support very efficient compression and encoding schemes

Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum and Google Dataproc.

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. The great thing is that it is licensed under the Apache software foundation and available to any project.

Parquet and The Rise of Cloud Warehouses & Interactive Query Services

The rise interactive query services like AWS Athena and Amazon Redshift Spectrum make it easy using standard SQL to analyze data in storage systems like Amazon S3. Also, data warehouses like Google BigQuery and the Google Dataproc platform can leverage different formats for data ingest.

However, the data format you select can have significant implications on performance and cost, especially if you are looking at machine learning, AI or other complex operations. We will walk you through a few examples of those considerations.

Parquet vs CSV

CSV is simple and ubiquitous. Many tools like Excel, Google Sheets and a host of others can generate CSV files. You can even create them with your favorite text editing tool. We all love CSV files, but everything has a cost, even your love of CSV files, especially if CSV is your default format for data processing pipelines.

AWS Athena and AWS Redshift Spectrum charge you by the amount of data scanned per query. (Many other services also charge based on data queried so this is not unique to AWS)

Google and Amazon charge you for the amount of data stored on GS/S3

Google Dataproc charges are time-based

Defaulting to the use of CSV will have both technical and financial outcomes (not in a good way). You will learn to love Apache Parquet just as much as your trusty CSV.

Example: A 1 TB CSV File

The following demonstrates the efficiency and effectiveness of using a Parquet file vs CSV.

By converting your CSV data to Parquet’s columnar format, compressing and partitioning it, you save money and reap the rewards of better performance. The following table compares the savings created by converting data into Parquet vs CSV.

Think about this: If over the course of a year you stuck with the uncompressed 1 TB CSV files as a foundation of your queries costs would be $2000 USD. Using Parquet files your total cost would be $3.65 USD. I know you love your CSV files, but do you love them THAT much?

Also, if time is money your analysts can be spending close to 5 minutes waiting for a query to complete simply because you use raw CSV. If you are paying someone $150 an hour and they are doing this once a day for a year then they spent about 30 hours simply waiting for a query to complete. That is roughly about $4500 in unproductive “wait” time. Total wait time for the Apache Parquet user? About 42 mins or $100.

Example 2: Parquet, CSV and Your Redshift Data Warehouse

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data in Amazon S3. This can be an effective strategy for teams that want to partition data where some of it is resident within Redshift and other data is resident on S3. For example, let’s assume you have about 4 TB of data in a historical_purchase table in Redshift. Since it is not accessed frequently, offloading it to S3 makes sense. This will free up that space in Redshift while still providing your team access via Spectrum. Now, the big question becomes what format are you storing that 4 TB historical_purchase table in? CSV? How about using Parquet?

Our historical_purchase table has 4 equally sized columns, stored in Amazon S3 in three files; uncompressed CSV, gzip CSV and Parquet.

Uncompressed CSV File

The uncompressed CSV file has a total size of 4 TB. Running a query to get data from a single column of the table requires Redshift Spectrum to scan the entire file 4 TB. As result this query would cost $20.

GZIP CSV File

If you compress your CSV file using GZIP, the file size is reduced to 1GB.Great savings! However, Redshift Spectrum still has to scan the entire file. The good news is your CSV file is four times smaller than the uncompressed one so you pay one-fourth of what you did before. This query would cost $5.

Parquet File

If you compress your file and convert it to Apache Parquet you end up with 1 TB of data in S3. However, because Parquet is columnar, Redshift Spectrum can read only the column that is relevant for the query being run. It only needs to scan just 1/4 the data. This query would only cost $1.25.

If you are running this query once a day for a year, using uncompressed CSV files would cost $7300. Even compressed CSV queries would cost over $1800. However, using the Apache Parquet file format it would cost about $460. Still in love with your CSV file?

Summary

The trend toward “serverless”, interactive query services and pre-built data processing suites is rapidly progressing. It is providing new opportunities for teams to go faster with lower investments. Athena and Spectrum make it easy to analyze data in Amazon S3 using standard SQL. Also, Google supports loading Parquet files into BigQuery and Dataproc.

When you only pay for the queries that you run, or resources like CPU and storage, it is important to look at optimizing the data those systems are relying on.

By the way, we have launched a zero admin data processing framework for Amazon Redshift Spectrum and Amazon Athena which includes automated database/table creation, Parquet file conversion, partitioning and more. See announcement for details:

Amazon Redshift Spectrum Automated — 60 Second Setup, Zero Administration And Automatic…

Announcing fully-managed support of zero administration Amazon Redshift Spectrum data pipeline service.blog.openbridge.com

AWS Athena Automated — 60 Second Setup, Zero Administration And Automatic Optimization

We are excited to announce the release of our zero administration AWS Athena data pipeline service.blog.openbridge.com

Also, take a look at our post about AWS Redshift Spectrum and AWS Athena. Using Apache Parquet can benefit both!

How is AWS Redshift Spectrum different than AWS Athena?

This question has come up a few times and most of the discussion in centered around the technical difference. Rather…blog.openbridge.com

Did we miss anything? Do you have any questions about how to transform your CSV to Apache Parquet? If you want help to streamline your data to Google Cloud, AWS Athena, AWS Redshift Spectrum or other data technologies, feel free to leave a comment or contact us at [email protected]. You can also visit us at https://www.openbridge.comto learn how we are helping other companies with their data efforts.