Skip to main content

Using Snowflake to Query Datasets from GCS

Overview

I have recently taken an interest in the land of data engineering and the processes surrounding data collection and transformation. There is something oddly satisfying about figuring out which combination of jq commands you can mash together with a few sed commands sprinkled on top to eventually return a syntactically correct JSON file where all the data is parsed correctly. It’s sometimes frustrating, but it’s like solving a puzzle and quite satisfying when you finally get the desired output. While I continue to find time to explore the world of data, I will continue to share my knowledge on this topic.

The diagram below illustrates the storage integration we will eventually set up. The integration will allow Snowflake to ingest datasets from Google Cloud Storage. Before we do that, we need to understand the key components that make up the Snowflake platform to avoid any confusion.

Snowflake is a cloud-native data platform that removes much of the operational overhead that occurs when the responsibilities for managing the infrastructure, storage, and backend sit heavily on the employees of a company. Snowflake can be deployed on AWS, Azure, or GCP. The Stack deployment is wholly abstracted from the end-user, and once you sign up, you can start interacting with Snowflake. Below are Snowflake deep dives I found for each cloud platform where the companies go into more depth around how Snowflake operates on each platform. Much of it is similar except for the platform specific services to support Snowflake.

Warehouses

One of the contributing factors for Snowflake’s popularity is that compute costs are only incurred when a query is made. This is the role of the warehouse. Warehouses have fixed compute sizes and costs associated with them, and it’s easy to create new warehouses and limit their scope to individual databases and tables. Thinking about this in practice, you might want to have dedicated warehouses for queries to production data. Any queries made outside production can have their own warehouses with the permissions necessary. This concept guarantees that the compute is reserved only for production operations.

The walkthrough for this project will use the X-small Warehouse but feel free to indulge and use a larger Warehouse. After all, no credit card is required for a 30-day free trial, so there is no possibility of getting a large bill later. For a company to let me try out their product without a credit card is admirable. While the trial does not require a credit card, you have a fixed allocated number of credits, so use them wisely. If you want to execute complex SQL queries that will likely run slow on X-small warehouse, you can switch to a larger Warehouse size.

Billing

Snowflake charges customers for the compute and storage consumed. This billing model is desired by many because the costs for the services are billed based on usage. No fixed costs for storage and compute will be incurred. A small service cost is subtracted from the credits each month for providing the Snowflake service. This cost can vary based on the region and cloud platform used to host Snowflake. For a more detailed breakdown of the associated costs for each cloud provider, go here.

The billing model is a direct extension of the cloud platform Snowflake is hosted on. Snowflake uses a credits billing model. In most cases, 1 Snowflake credit is equivalent to $2.00 USD. This cost varies based on the cloud platform and region used to host Snowflake. You can calculate the pricing here. Snowflake does not markup any of the costs related to compute and storage.

Warehouse Pricing

Below is a table from the Snowflake documentation regarding credit usage for warehouses. It appears that for every hour an instance runs, the cost is the number of credits used multiplied by the price of the credit, which is in many cases $2.00 USD. Snowflake bills per second, and when you are running short batch jobs using large clusters, you are billed only for the seconds it took. This translates into cost savings when comparing this model against owning a large compute capacity that incurs costs all the time. Let’s look at a couple of examples of how billing is calculated:

  • If I wanted to run a 4X-Large warehouse for 1 hour, the credits used would be 128 * 1 = 128, which is $256.00 USD.
  • If I wanted to run a 4X-Large warehouse for 30 seconds, the credits used would be 0.0356 * 30 = 1.068, which is about $2.00 USD.
Warehouse Size Servers / Cluster Credits / Hour Credits / Second
X-Small 1 1 0.0003
Small 2 2 0.0006
Medium 4 4 0.0011
Large 8 8 0.0022
X-Large 16 16 0.0044
2X-Large 32 32 0.0089
3X-Large 64 64 0.0178
4X-Large 128 128 0.0356

Getting Started

  1. Go to Snowflake and sign up for a free trial. Snowflake will give you a 30 day trial with no credit card required. You will be asked to choose a cloud platform to host the Snowflake environment. Since we use GCS as the data source, select Google Cloud Platform.
  2. We now need to source a data set. We will use Pushshift to download a dataset that contains 69 Million Reddit user accounts. The size is about 1GB and can be found here.
    • The downloaded file is a CSV, and it is compressed using gzip.
    • This dataset has already been cleaned, so we don’t need to perform any transformation before loading it into Snowflake.
  3. We can now upload the dataset to GCS. Use the following two commands to upload the dataset to a new GCS bucket. We add the project number to the bucket to avoid conflicts with another bucket that might have the same name.

    <

    pre>PROJECT_ID=$(gcloud config get-value project)
    PROJECT_NUMBER=$(gcloud projects list --filter="${PROJECT_ID}" --format="value(PROJECT_NUMBER)")

gsutil mb gs://reddit_accounts_${PROJECT_NUMBER} gsutil cp ~/Downloads/69M_reddit_accounts.csv.gz gs://reddit_accounts_${PROJECT_NUMBER}/

*Note: Make sure you are logged into the Google SDK. Run `gcloud auth login` to connect to GCP.*

When the upload has been completed, confirm the file has been uploaded to the correct spot.

Setup Snowflake

We are now ready to configure Snowflake and ingest the dataset from GCS. Head to the link for your Snowflake environment. The link is unique for each customer and is provided to you after signing up.

The first thing we need to do is elevate the Snowflake role so we can create a database, schema, and table. Run the following command in the worksheet to assign the role Account Admin. To run a block of code in a worksheet, highlight the code, right-click and select run.

use role accountadmin;

We can now create our database, warehouse, and schema. The warehouse is the actual compute that execute queries against our data set. We don’t need anything excessive for our use case.

create database if not exists redditdb;
create warehouse if not exists redditwh with warehouse_size = 'xsmall' auto_suspend = 60 
initially_suspended = true;
use schema redditdb.public;
use database redditdb;   

For our dataset to import properly, we need to structure our table with the correct column names and data types. To determine the required columns, I peaked at the CSV and matched the names for my table. `\sql CREATE TABLE REDDIT_ACCOUNTS ( id number, name varchar(100), created_utc number, updated_on number, comment_karma number, link_karma number )

<br />Snowflake has a streamlined method to grant Snowflake permissions to your object storage. The commands below will create a storage integration for GCS and generate a service account that we add to our IAM after completing this step.
```sql
CREATE STORAGE INTEGRATION gcp_storage
   TYPE = EXTERNAL_STAGE
   STORAGE_PROVIDER = GCS
   ENABLED = TRUE
   STORAGE_ALLOWED_LOCATIONS = ('*');

After the above completes, run the following to describe the integration we just created. This will show the service account address that needs to be granted permissions in IAM. Save the service account email for the next section.

DESC STORAGE INTEGRATION gcp_storage;

Create/Assign an IAM Role

We will create a role with only the necessary permissions required to read objects from a storage account.

Create a custom role using the instructions below.

  • Log in to the Google Cloud Platform Console as a project editor.
  • From the home dashboard, choose IAM & admin » Roles.
  • Click Create Role.
  • Enter a name, and description for the custom role.
  • Click Add Permissions.
  • Filter the list of permissions using the Storage Admin role and add the following from the list: – storage.buckets.getstorage.objects.getstorage.objects.list

Assign the custom role to the Snowflake service account.

  • Head over to Storage in the GCP console.
  • Select a bucket to assign the roles.
  • Click SHOW INFO PANEL in the upper-right corner. The information panel for the bucket slides out.
  • In the Add members field, paste the service account we retrieved from Snowflake.
  • Click Save.

Import the Dataset from GCS

To assign the storage integration to a Snowflake service account, use the built-in sysadmin role.

grant usage on integration gcp_storage to sysadmin;
use role sysadmin;

For the data to be processed, we need to specify how the data is structured and identify the type of compression, so Snowflake can decompress the data using the correct algorithm. The key attributes to consider are the type, compression, and field_delimiter. These values are specific to the source dataset, where the other options are decisions made based on the use case.

create or replace file format reddit_accounts_csv
   type = csv
   field_delimiter = ','
   skip_header = 1
   null_if = ('NULL', 'null')
   empty_field_as_null = true
   compression = gzip;

Create a Snowflake stage that puts together the file format and storage integration components.

Replace PROJECT_NUMBER with your project number.

create stage my_gcs_stage
  url = 'gcs://reddit_accounts_<PROJECT_NUMBER>'
  storage_integration = gcp_storage
  file_format = reddit_accounts_csv;   

Finally, copy the data into the REDDIT_ACCOUNTS.

copy into REDDIT_ACCOUNTS
  from @my_gcs_stage

It takes around 2 minutes to import 69,382,538 rows into our table from a compressed CSV file. This is also running on the smallest data warehouse size. This puts in perspective the power of today’s data warehouse technologies.

Query the Data

Return all Reddit usernames that contain "greg" in them.

SELECT * FROM REDDIT_ACCOUNTS
  WHERE NAME LIKE '%greg%'

Return the count of all Reddit usernames with "greg" in them.

SELECT COUNT(NAME) FROM REDDIT_ACCOUNTS
  WHERE NAME LIKE '%greg%'
Row COUNT(NAME)
1 25350

Return the count of all Reddit usernames with "trump" in them.

SELECT COUNT(NAME) FROM REDDIT_ACCOUNTS
  WHERE NAME LIKE '%trump%'
Row COUNT(NAME)
1 10643

See who has the most comment karma out of all 69 million Reddit accounts.

SELECT * FROM REDDIT_ACCOUNTS
  ORDER BY COMMENT_KARMA DESC
Row NAME COMMENT_KARMA
1 TooShiftyForYou 13076606
2 Poem_for_your_sprog 4480894
3 dick-nipples 3747915

I would expect nothing less from a query on Reddit data. We are playing with fire folks…

Final Thoughts

The platform is intuitive. Snowflake is similar to BigQuery in that the compute is separated from where the datasets are stored. This decoupling fosters the ability to scale the compute separately without a dependency on the data. It allows many warehouses to operate independently of each other while interacting with a common location for data. The ability to automatically stop and resize warehouses when needed keeps costs low, and per-second billing can be a huge advantage when running complex queries that might take a long time using traditional low capacity databases. Although, if you have five bucks to spare, you can run a query on a 4X-Large warehouse for 4 minutes, which is 128 servers in a cluster executing your query. Powerful stuff.

Regarding working with Snowflake, there is a learning curve when writing SQL: ANSI along with configuring the storage integrations, formats, and stages, but the Internet will guide you. I spent a fair amount of time on Stack Overflow when working through the storage integration and IAM assignments setup. I think we all end up on Stack Overflow one way or another, but the Snowflake community is active there.

Importing data into a warehouse is only the tip of the iceberg. After a dataset is loaded into Snowflake or another data warehouse, platforms such as Apache Superset can visualize the data. Superset supports database connections to Snowflake and other cloud data warehouses, notably BigQuery.

If you found this content interesting, I encourage you to explore other available datasets. Our World In Data provides a Covid-19 Dataset in CSV and JSON format that is updated daily. The dataset is largely aggregate data from John Hopkins University, the government of the United Kingdom, and other institutions. OWID also has some interesting articles and publications on its website driven by data analysis. I am currently working on additional content when I have spare time using Apache Superset with BigQuery. So stay tuned. If you have any questions about the content, don’t hesitate to reach out.

Source Code

Repository

References

Getting Started — Snowflake Documentation

Overview of Warehouses — Snowflake Documentation

Configuring an Integration for Google Cloud Storage — Snowflake Documentation

Snowflake Pricing Information

Snowflake Community

Snowflake StackOverflow Community