How to transform data in Snowflake | Use Snowflake Dynamic Tables like a pro
Hello Data Pros, and welcome back to another exciting episode of our Snowflake learning series!
In our previous video, we discussed various options for ingesting data into Snowflake, including Tasks, Snowpipe, Snowpipe Streaming, and the Snowflake Connector for Kafka. Today, we’re diving into the world of data transformations, with a special focus on Dynamic Tables.
Let’s jump right in!
Data transformation is the process of converting one dataset into another, usually to meet the requirements of a destination platform or downstream applications. This process can include activities like data cleansing, standardization, filtering, joining multiple datasets, aggregating data, and more.
There are several ways to transform data in Snowflake. You can use Snowflake’s native features, like streams and tasks, or take advantage of Dynamic Tables.
If you prefer external tools, dbt is a popular choice, where transformation logic is written in SQL. Another option is Matillion, which allows you to design your transformations in a visual interface.
If you're interested in dbt, we have an excellent playlist on our channel—please check the description for the link.
In this video, we’ll be covering only Snowflake’s native features. On that note, let's start with streams and tasks.
First, we should create a stream on your source table to capture changes. This stream tracks any inserts, updates, or deletes that have occurred since the last data load into the target table.
Next, create the target table.
Then, set up a task to process these changes. The task will pull records from the stream and load them into the target table. This approach is imperative, meaning you need to manually write detailed code to specify each step of the data-loading process.
Additionally, managing this setup requires maintaining multiple objects, including the table, stream, and tasks.
Dynamic tables address these issues by simplifying the data transformation process. With dynamic tables, you don't need to set up multiple objects or write complex code. Instead, you follow a declarative approach, meaning you describe the result you want to achieve without specifying each step to get there. Snowflake internally handles the details, and transforms your data as needed.
Both approaches deliver the same results, but the amount of code required can differ significantly!
One key question that often arises is, whether SQL is truly capable of performing data transformations!
Let’s take a closer look at this dynamic table SQL!
It might seem lengthy at first, but it's actually quite simple to understand.
The length is mainly because we wanted to demonstrate nearly all major types of data transformations.
But remember, if you were to use tasks and streams for this, you'd end up writing 2 to 3 times more code than this.
Let’s understand these data transformations!
Imagine we have two source tables: customers and orders. Our goal is to create, and periodically refresh a customer-level sales summary table, that will be consumed by our downstream reporting application.
First, the SQL trims any leading or trailing spaces from the NAME field, and removes non-numeric characters from the PHONE field. This step ensures the data is clean and ready for further processing.
Next, we move on to data standardization. All email addresses are converted to lowercase, a common practice to prevent inconsistencies. The phone numbers are also formatted into a standard, easy-to-read format.
Then, the SQL filters the orders table to include only those placed after 2024, ensuring that only the most recent records make it into the final output.
After filtering, the SQL joins multiple datasets, combining the standardized customer data with the filtered order data, linking each customer with their corresponding orders.
Finally, we perform data aggregation, where the SQL calculates the total sales per customer, giving us a clear picture of how much each customer has spent.
The final output of this dynamic table is a clean, standardized, filtered, joined, and aggregated dataset, perfectly ready for BI reporting.
What’s truly remarkable is that this dynamic table automatically and incrementally refreshes the target table based on the changes in the underlying source tables.
All of this is achieved in a single concise script, thanks to the declarative approach of dynamic tables!
Let’s jump right into the UI and see this in action.
I’ve already set up some foundational objects for this lab and loaded the source tables with initial data. As always, you can find a link in the description to copy all the SQL used here.
Now, let’s move on to the main part of this lab: creating the dynamic table with all the transformations we’ve just discussed.
In a dynamic table, data refresh is fully automatic. You just set the TARGET LAG, and Snowflake checks the data at optimal intervals and refreshes as necessary.
You can also set the REFRESH MODE to incremental or full, depending on your needs.
You can initialize the dynamic table as soon as it’s created or wait for the next schedule based on your TARGET LAG configuration.
Let’s try to query the dynamic table.
Our fully transformed customer sales summary table is ready for consumption!
Before we close, I’d like to quickly highlight a few more points.
Unlike other data platforms, you don’t need separate coding or technologies for batch and real-time processing. You can easily switch between them with a simple ALTER statement.
Using the UI, you can monitor the dynamic table’s refresh history. In addition, you’ll also get this impressive lineage graph!
That's all for today! Please stay tuned for our next video where we’ll explore more advanced Snowflake features!
If you found this video helpful, please give it a thumbs up, and hit that subscribe button to stay up-to-date with the latest technologies!
Thanks for watching!
Comments
Post a Comment