Snowpipe & Snowpipe Streaming | Snowflake Pipe Example | kafka connector
Hello Data Pros, and welcome back to another exciting episode of our Snowflake learning series!
In our previous video, we covered Snowflake Tasks, and demonstrated how you can automate processes within Snowflake.
Today, we will explore Snowpipes, a powerful feature in Snowflake that offers near real-time data ingestion.
Before we dive into Snowpipe, let’s understand the three primary methods for ingesting data into Snowflake.
The COPY command is the traditional way of ingesting data into Snowflake. Generally, data from applications, IoT devices, and other sources is first ingested into cloud storage locations like an S3 bucket by external tools. Later, the data is copied from these cloud storage locations to Snowflake.
You can manually trigger the COPY command or create a task to run it automatically at predefined intervals. This is what we’ve demonstrated in our previous video.
Snowpipe operates little differently. Instead of specifying a schedule, you set up an event notification from your cloud storage. Whenever a file is uploaded, a notification is sent to Snowflake, which triggers the pipe and the associated copy statement! we’ll demonstrate this later in this video.
Snowpipe Streaming is totally different approach, where streaming rows are continuously loaded directly into Snowflake tables from your external applications.
You don’t need to set up any tasks or pipes in Snowflake. Instead, within your client application that generates or processes the streaming data, you import the Snowpipe Streaming Ingest SDK, create a client connection to your Snowflake account, and establish a streaming channel, to insert data directly into a Snowflake table. This is just a pseudocode to illustrate the concept. For a complete sample of working code, please refer to the link in the description.
Additionally, Snowpipe Streaming also supports ingesting rows from Apache Kafka topics directly into Snowflake tables.
Kafka is an event streaming platform, is designed for building real-time data pipelines between systems and applications.
However, if you need to connect Kafka to non-streaming platforms like databases or file stores, you must write complex code, similar to the code sample we’ve discussed previously.
To simplify this process, and to provide a scalable solution, Kafka includes a component called Kafka Connect. This enables easy integration between Kafka and external systems such as databases and file stores.
Focusing on ingesting streaming data into Snowflake, on the system where Kafka Connect is running, you'll need to set up a configuration file similar to the one you see here.
Internally, the SnowflakeSinkConnector package uses these configurations to interact with the Snowpipe Streaming API, eliminating the need for complex coding on the user’s end.
People often confuse Snowflake Streams with Snowpipe Streaming, but they serve different purposes. Snowflake Streams are schema-level objects used for Change Data Capture! On the other hand, Snowpipe Streaming is a framework for ingesting streaming data into Snowflake.
Moving forward to compute resource, the COPY statement can use a user-defined warehouse, but if you remove the warehouse parameter, it will default to serverless Snowflake compute.
In contrast, both Snowpipe and Snowpipe Streaming always use serverless compute.
Next, let's talk about latency, the key differentiator! the COPY statement typically has a latency of 1 minute or more, depending on the schedule. Snowpipe, using event-based triggers, improves latency to around 30 seconds.
On the other hand, Snowpipe Streaming offers the lowest latency of all three methods, around 5 seconds or even less, as it bypasses intermediate cloud file staging.
COPY is best for bulk data loading, and batch data loading use cases.
Snowpipe handles near real-time data ingestion. It’s designed for scenarios where you want to automatically load data once it arrives in your cloud storage.
Snowpipe Streaming is used for real-time data ingestion. It’s perfect for continuous streaming of data.
With this understanding, let’s jump into the UI, and see Snowpipe in action!
I’ve already set up some foundational objects required for this lab.
As always, you can find a link in the description to copy all the SQLs used here.
This command creates a Snowpipe object.
As you can see, we do not use any schedule or warehouse parameters, Instead, we simply configure AUTO-INGEST to TRUE.
Once the pipe is created, please run show pipes!
The most important thing to look for is the notification channel. Snowflake automatically creates a message queue within its internal cloud account, and the resource name identifier of the same is shown here.
We will need to setup event notifications to this specific queue, so please copy it.
Log in to your AWS console.
Though we’re doing it on AWS, the concept remains same across all clouds.
Navigate to the S3 bucket from which you want to ingest files to the snowflake table.
Select properties.
Scroll down and locate the Event Notifications section.
Click on "Create Event Notification."
Give a name for it as you wish.
You can limit notifications to a specific folder or file extension, but I’m not setting this up for now.
Select the event types for which you want to receive notifications. I’ll choose "All object create events."
Next, choose SQS Queue, and ‘enter the SQS queue ARN’. Paste the URL you copied earlier from the ‘SHOW PIPES’ output.
Save the changes.
That’s it! we’ve successfully created an event notification for the pipe object.
From now on, any file you upload to the S3 bucket will generate an event notification, which will subsequently trigger the pipe execution.
Let’s validate the setup.
As you can see, there are no records in the table yet.
Now, I’m going to upload a new file to S3 bucket.
Please remember the latency we discussed, Snowpipe generally has approximately 30 seconds latency from the time the file is uploaded to the cloud storage.
I’m not fast-forwarding so you can observe the latency.
Yes, we now have the records.
Snowflake keeps a record of processed files, so even if you upload another file, only the records from the new file will be loaded into the table.
That's all for today! Please stay tuned for our next video where we’ll explore more advanced Snowflake features!
If you found this video helpful, please give it a thumbs up, and hit that subscribe button to stay up-to-date with the latest technologies!
Thanks for watching!
Comments
Post a Comment