How to transform data in Snowflake | Use Snowflake Dynamic Tables like a pro

 

Hello Data Pros, and welcome back to another exciting episode of our Snowflake learning series!

In our previous video, we discussed various options for ingesting data into Snowflake, including Tasks, Snowpipe, Snowpipe Streaming, and the Snowflake Connector for Kafka. Today, we’re diving into the world of data transformations, with a special focus on Dynamic Tables.


Let’s jump right in!

 

-----Follow the link in the description for the SQLs used in this lab ------

---------------------------------------------------------------------------
---------------- Setting Up Foundational Objects for the Lab---------------
---------------------------------------------------------------------------

-- Create the database
CREATE DATABASE IF NOT EXISTS my_database;

-- Use the created database
USE DATABASE my_database;

-- Create the schema
CREATE SCHEMA IF NOT EXISTS my_schema;

-- Use the created schema
USE SCHEMA my_schema;

-- Create the 'customers' table
CREATE TABLE IF NOT EXISTS customers (
    customer_id INT,
    name STRING,
    email STRING,
    phone STRING,
    age INT
);

-- Create the 'orders' table
CREATE TABLE IF NOT EXISTS orders (
    order_id INT,
    customer_id INT,
    order_amount DECIMAL(10, 2),
    order_date DATE
);

-- Insert sample records into the 'customers' table
INSERT INTO my_schema.customers (customer_id, name, email, phone, age) VALUES
(1, '  Alice Smith  ', 'ALICE.SMITH@EXAMPLE.COM', '(123) 456-7890', 29),
(2, 'Bob Johnson   ', 'bob.johnson@example.com', '9876543210', 35),
(3, '  Carol Williams', 'carol.williams@example.com', '555-123456', 42),
(4, ' David Brown ', 'david.brown@example.com', '111-222-3333', 31),
(5, 'Eva Green', 'eva.green@example.com', '444-555-6666', 27),
(6, 'Frank Harris  ', 'frank.harris@example.com', '777-888-9999', 40),
(7, 'Gina White ', 'gina.white@example.com', '222-333-4444', 33),
(8, '  Henry Black', 'henry.black@example.com', '555-666-7777', 39),
(9, 'Ivy Gray', 'ivy.gray@example.com', '888-999-0000', 25),
(10, 'Jack Lee  ', 'jack.lee@example.com', '999-000-1111', 45);

-- Insert sample records into the 'orders' table
INSERT INTO my_schema.orders (order_id, customer_id, order_amount, order_date) VALUES
(1001, 1, 250.75, '2024-01-15'),
(1002, 2, 150.00, '2024-06-22'),
(1003, 1, 75.25, '2024-07-30'),
(1004, 3, 120.00, '2024-08-12'),
(1005, 4, 300.50, '2024-02-14'),
(1006, 5, 180.00, '2024-03-21'),
(1007, 6, 220.75, '2024-04-25'),
(1008, 7, 90.00, '2024-05-30'),
(1009, 8, 160.50, '2024-06-15'),
(1010, 9, 110.25, '2024-07-05'),
(1011, 10, 250.00, '2024-08-09'),
(1012, 1, 50.00, '2024-08-18'),
(1013, 2, 75.50, '2024-09-12'),
(1014, 3, 130.00, '2024-10-01'),
(1015, 4, 160.00, '2024-10-21'),
(1016, 5, 200.00, '2024-11-02'),
(1017, 6, 85.75, '2024-11-15'),
(1018, 7, 95.50, '2024-12-01'),
(1019, 8, 140.00, '2024-12-14'),
(1020, 9, 220.00, '2024-12-25'),
(1021, 1, 200.00, '2023-12-15'),
(1022, 2, 150.00, '2023-11-20'),
(1023, 3, 300.00, '2023-10-10'),
(1024, 4, 400.00, '2023-09-05'),
(1025, 5, 250.00, '2023-08-12'),
(1026, 6, 100.00, '2023-07-01');


---------------------------------------------------------------------------
----------------------- DATA TRANSFORMATIONS MAIN LAB ---------------------
---------------------------------------------------------------------------


-- Please note this is One SQL doing transformation in multiple steps using common table expressions (CTEs)

-- Define a dynamic table with the transformation logic
CREATE OR REPLACE DYNAMIC TABLE customer_sales_summary (
    customer_id INT,
    name STRING,
    email STRING,
    phone STRING,
    total_sales DECIMAL(10, 2)
)
TARGET_LAG = '1 min'
WAREHOUSE = compute_Wh
REFRESH_MODE = INCREMENTAL
INITIALIZE = ON_CREATE
AS

-- Data Cleansing: Remove leading/trailing spaces and non-numeric characters
WITH cleaned_customers AS (
    SELECT customer_id,
           TRIM(name) AS cleaned_name,  
           -- Trim leading and trailing spaces from the name field
           
           REGEXP_REPLACE(phone, '[^0-9]', '') AS cleaned_phone,
           -- Remove all non-numeric characters from the phone field
           
           email,
           age
    FROM customers
),

-- Data Standardization: Convert emails to lowercase and format phone numbers
standardized_customers AS (
    SELECT customer_id,
           cleaned_name AS name,
           LOWER(email) AS email,
           -- Convert all email addresses to lowercase
           
           CONCAT('(', SUBSTR(cleaned_phone, 1, 3), ') ', SUBSTR(cleaned_phone, 4, 3), '-', SUBSTR(cleaned_phone, 7, 4)) AS phone,
           -- Format phone numbers into (XXX) XXX-XXXX format
           
           age
    FROM cleaned_customers
),

-- Data Filtering: Select orders placed after 2024
filtered_orders AS (
    SELECT order_id,
           customer_id,
           order_amount
           -- Exclude order_date as it's not needed in the final output
    FROM orders
    WHERE YEAR(order_date) >= 2024
           -- Filter the orders to include only those placed after 2024
),

-- Joining Multiple Datasets: Combine customer data with filtered orders
customer_orders AS (
    SELECT c.customer_id,
           c.name,
           c.email,
           c.phone,
           o.order_amount
    FROM standardized_customers c
    JOIN filtered_orders o ON c.customer_id = o.customer_id
           -- Join the standardized customer data with filtered order data
),

-- Data Aggregation: Calculate total sales per customer
aggregated_data AS (
    SELECT customer_id,
           name,
           email,
           phone,
           SUM(order_amount) AS total_sales
           -- Calculate the total sales amount per customer
    FROM customer_orders
    GROUP BY customer_id, name, email, phone
           -- Group by customer_id, name, email, and phone
)

-- Final Output: This is the result stored in the dynamic table
SELECT customer_id, name, email, phone, total_sales
FROM aggregated_data;


select * from customer_sales_summary;

   
ALTER DYNAMIC TABLE customer_sales_summary SET TARGET_LAG = '1 minutes';            -- Near real-time data load to the target table
ALTER DYNAMIC TABLE customer_sales_summary SET TARGET_LAG = '6 hours';              -- Batch data load to the target table


ALTER DYNAMIC TABLE customer_sales_summary RESUME;
ALTER DYNAMIC TABLE customer_sales_summary SUSPEND;
ALTER DYNAMIC TABLE customer_sales_summary REFRESH;



Data transformation is the process of converting one dataset into another, usually to meet the requirements of a destination platform or downstream applications. This process can include activities like data cleansing, standardization, filtering, joining multiple datasets, aggregating data, and more.

There are several ways to transform data in Snowflake. You can use Snowflake’s native features, like streams and tasks, or take advantage of Dynamic Tables.

 

 

If you prefer external tools, dbt is a popular choice, where transformation logic is written in SQL. Another option is Matillion, which allows you to design your transformations in a visual interface.

If you're interested in dbt, we have an excellent playlist on our channel—please check the description for the link.

 

In this video, we’ll be covering only Snowflake’s native features. On that note, let's start with streams and tasks.

 

 

First, we should create a stream on your source table to capture changes. This stream tracks any inserts, updates, or deletes that have occurred since the last data load into the target table.

Next, create the target table.

Then, set up a task to process these changes. The task will pull records from the stream and load them into the target table. This approach is imperative, meaning you need to manually write detailed code to specify each step of the data-loading process.

 

 

Additionally, managing this setup requires maintaining multiple objects, including the table, stream, and tasks.

Dynamic tables address these issues by simplifying the data transformation process. With dynamic tables, you don't need to set up multiple objects or write complex code. Instead, you follow a declarative approach, meaning you describe the result you want to achieve without specifying each step to get there. Snowflake internally handles the details, and transforms your data as needed.

Both approaches deliver the same results, but the amount of code required can differ significantly!

One key question that often arises is, whether SQL is truly capable of performing data transformations!

Let’s take a closer look at this dynamic table SQL!

It might seem lengthy at first, but it's actually quite simple to understand.

The length is mainly because we wanted to demonstrate nearly all major types of data transformations.

But remember, if you were to use tasks and streams for this, you'd end up writing 2 to 3 times more code than this.

Let’s understand these data transformations!

Imagine we have two source tables: customers and orders. Our goal is to create, and periodically refresh a customer-level sales summary table, that will be consumed by our downstream reporting application.

First, the SQL trims any leading or trailing spaces from the NAME field, and removes non-numeric characters from the PHONE field. This step ensures the data is clean and ready for further processing.

Next, we move on to data standardization. All email addresses are converted to lowercase, a common practice to prevent inconsistencies. The phone numbers are also formatted into a standard, easy-to-read format.

Then, the SQL filters the orders table to include only those placed after 2024, ensuring that only the most recent records make it into the final output.

After filtering, the SQL joins multiple datasets, combining the standardized customer data with the filtered order data, linking each customer with their corresponding orders.

Finally, we perform data aggregation, where the SQL calculates the total sales per customer, giving us a clear picture of how much each customer has spent.

The final output of this dynamic table is a clean, standardized, filtered, joined, and aggregated dataset, perfectly ready for BI reporting.

What’s truly remarkable is that this dynamic table automatically and incrementally refreshes the target table based on the changes in the underlying source tables.

All of this is achieved in a single concise script, thanks to the declarative approach of dynamic tables!

Let’s jump right into the UI and see this in action.

I’ve already set up some foundational objects for this lab and loaded the source tables with initial data. As always, you can find a link in the description to copy all the SQL used here.

Now, let’s move on to the main part of this lab: creating the dynamic table with all the transformations we’ve just discussed.

In a dynamic table, data refresh is fully automatic. You just set the TARGET LAG, and Snowflake checks the data at optimal intervals and refreshes as necessary.

You can also set the REFRESH MODE to incremental or full, depending on your needs.

You can initialize the dynamic table as soon as it’s created or wait for the next schedule based on your TARGET LAG configuration.

Let’s try to query the dynamic table.

Our fully transformed customer sales summary table is ready for consumption!

Before we close, I’d like to quickly highlight a few more points.

Unlike other data platforms, you don’t need separate coding or technologies for batch and real-time processing. You can easily switch between them with a simple ALTER statement.

Using the UI, you can monitor the dynamic table’s refresh history. In addition, you’ll also get this impressive lineage graph!

That's all for today! Please stay tuned for our next video where we’ll explore more advanced Snowflake features!

If you found this video helpful, please give it a thumbs up, and hit that subscribe button to stay up-to-date with the latest technologies!

Thanks for watching!

 





Comments

Popular posts from this blog

How to Install Airflow on Windows

Airflow DAGs, Operators, Tasks & Providers

How to Install DBT and Set Up a Project, Create Your First dbt Model