Snowflake Advanced table types - External | Iceberg | Dynamic | Hybrid | Event | Directory

Hello Data Pros, welcome back to another exciting episode of our Snowflake learning series!

In our previous video, we’ve dived deep into three main snowflake table types: permanent, temporary, and transient. Today, we'll move forward, and learn about 6 additional table types available within the Snowflake platform and their specific use cases!

Snowflake External Table Snowflake Iceberg Table Snowflake Dynamic table Snowflake Hybrid Table Snowflake Event Table Snowflake Directory Table


--  External Tables, Iceberg Tables, Dynamic tables, Hybrid Tables, Event Tables, Directory Tables  --
--             All commands/statements can be found via the link in the video description           --

-- <<<<<<<<<<<<< External Tables >>>>>>>>>>>>> --

CREATE OR REPLACE STAGE oms_ext_stg
  URL='s3://s3explore/'
  CREDENTIALS=(AWS_KEY_ID='SDFEDU5ADSAF34YEDASDF0' AWS_SECRET_KEY='hxR5q/k2Gp8l3vTnDUkk6jDWH9EAmGk4VZcOy7')
--You can refer to our previous videos to learn more about creating stages.

CREATE OR REPLACE FILE FORMAT my_csv_format
    TYPE = 'CSV'
    SKIP_HEADER = 1;

CREATE OR REPLACE EXTERNAL TABLE my_external_table (
    employeeid VARCHAR AS (VALUE:c1::VARCHAR),
    firstname VARCHAR AS (VALUE:c2::VARCHAR),
    lastname VARCHAR AS (VALUE:c3::VARCHAR),
    email VARCHAR AS (VALUE:c4::VARCHAR)
)
WITH LOCATION = @oms_ext_stg/employees
FILE_FORMAT = (FORMAT_NAME = my_csv_format);

select  METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, Value, employeeid, firstname, lastname, email from my_external_table;

-- Alternative for external tables
SELECT METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, $1, $2, $3, $4
FROM @oms_ext_stg/employees
(FILE_FORMAT => my_csv_format);

-- Note: You cannot create an external table on top of internal stages; however, you can directly query internal stages as above
-- Advantages of External tables over direct stage queries:
    -- Performance: External tables are faster than direct stage queries due to the metadata they collect and store.
    -- Management: Easier to manage and automate refreshes to reflect changes in the data.
    -- Functionality: Supports advanced features like partitioning and materialized views for further optimization.
    -- Security: Leverages Snowflake's robust table access control mechanisms for enhanced data security.


-- <<<<<<<<<<<<< Iceberg Tables >>>>>>>>>>>>> --

-- Create an External Volume to hold Parquet and Iceberg data
create or replace external volume my_ext_vol
STORAGE_LOCATIONS =
(
    (
        NAME = 'my-s3-us-east-1'
        STORAGE_PROVIDER = 'S3'
        STORAGE_BASE_URL = 's3://my-s3-bucket/data/snowflake_extvol/'
        STORAGE_AWS_ROLE_ARN = '****'
    )
);

-- Create an Iceberg Table using my External Volume
create or replace iceberg table my_iceberg_table
    with EXTERNAL_VOLUME = 'my_ext_vol'
    as select id, date, first_name, last_name, address, region, order_number, invoice_amount from sales;
   

-- <<<<<<<<<<<<< Dynamic Tables >>>>>>>>>>>>> --

-- WITHOUT DYNAMIC TABLE -- [Source > Transform > Target]
-- Source Table Creation
-- Please assume frequent data loading into the source table by another process.
CREATE OR REPLACE TABLE raw
(var VARIANT);

-- Target Table Creation
CREATE OR REPLACE TABLE names
(id INT,
first_name STRING,
last_name STRING);

-- Stream Creation
CREATE OR REPLACE STREAM rawstream1
ON TABLE raw;

-- Task Creation
CREATE OR REPLACE TASK raw_to_names
WAREHOUSE = mywh
SCHEDULE = '1 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('rawstream1')
AS
MERGE INTO names n
USING (
SELECT var:id id, var:fname fname,
var:lname lname FROM rawstream1
) r1 ON n.id = TO_NUMBER(r1.id)
WHEN MATCHED AND metadata$action = 'DELETE' THEN
DELETE
WHEN MATCHED AND metadata$action = 'INSERT' THEN
UPDATE SET n.first_name = r1.fname, n.last_name = r1.lname
WHEN NOT MATCHED AND metadata$action = 'INSERT' THEN
INSERT (id, first_name, last_name)
VALUES (r1.id, r1.fname, r1.lname);

-- WITH DYNAMIC TABLE -- [Source > Transform > Target]
-- Source Table Creation
-- Please assume frequent data loading into the source table by another process.
CREATE OR REPLACE TABLE raw
(var VARIANT);

-- Dynamic Target Table Creation
CREATE OR REPLACE DYNAMIC TABLE names
TARGET_LAG = '1 minute'
WAREHOUSE = mywh
AS
SELECT var:id::int id, var:fname::string first_name,
var:lname::string last_name FROM raw;

-- <<<<<<<<<<<<< Hybrid Tables >>>>>>>>>>>>> --

-- Create hybrid table for Product Inventory
CREATE OR REPLACE HYBRID TABLE ProductInventory (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100) NOT NULL,
  category VARCHAR(50) NOT NULL,
  quantity INT NOT NULL,
  price DECIMAL(10, 2) NOT NULL
);

-- Insert data into the Product Inventory table
INSERT INTO ProductInventory VALUES(1001, 'Laptop', 'Electronics', 10, 999.99);
INSERT INTO ProductInventory VALUES(1002, 'Smartphone', 'Electronics', 20, 599.99);
INSERT INTO ProductInventory VALUES(1003, 'Printer', 'Office Supplies', 5, 299.99);
INSERT INTO ProductInventory VALUES(1004, 'Chair', 'Furniture', 30, 49.99);

-- Update quantity of a product
UPDATE ProductInventory SET quantity = 25 WHERE product_id = 1002;

-- Delete a product from inventory
DELETE FROM ProductInventory WHERE product_id = 1004;

-- Note: Hybrid tables are currently not available to trial accounts.


-- <<<<<<<<<<<<< Event Tables >>>>>>>>>>>>> --

-- Create Event Table
CREATE OR REPLACE EVENT TABLE SLEEK_OMS.L1_LANDING.oms_event_table;

-- Associate Event Table with Account
ALTER ACCOUNT SET EVENT_TABLE = SLEEK_OMS.L1_LANDING.oms_event_table;

-- Push logs (info,warning, errors) / events
ALTER SESSION SET LOG_LEVEL = INFO;
CREATE OR REPLACE FUNCTION log_trace_data()
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'run'
AS $$
import logging
logger = logging.getLogger("tutorial_logger")

def run():
  logger.info("Logging from Python function.")
  return "SUCCESS"
$$;
SELECT log_trace_data();

-- Display/Verify the contents of the Event table (litle delay/lag expected)
select * from SLEEK_OMS.L1_LANDING.oms_event_table;


-- <<<<<<<<<<<<< Directory Tables >>>>>>>>>>>>> --
-- TO enable Directory Table for a stage
CREATE OR REPLACE STAGE my_int_stg
DIRECTORY = (ENABLE = TRUE);
-- OR
ALTER STAGE my_int_stg SET DIRECTORY = ( ENABLE = TRUE );


-- PUT needs to be executed in SnowSQL. If you haven't set SnowSQL up yet, please refer to the initial videos in this playlist.
PUT 'file://C:/Users/mamba/Desktop/csvfiles/dates.csv' @my_int_stg;
PUT 'file://C:/Users/mamba/Desktop/csvfiles/customers.csv' @my_int_stg;
PUT 'file://C:/Users/mamba/Desktop/csvfiles/employees.csv' @my_int_stg;
PUT 'file://C:/Users/mamba/Desktop/csvfiles/stores.csv' @my_int_stg;

-- TO REFRESH directory table metadata
ALTER STAGE my_int_stg REFRESH;

-- To query directory table contents
SELECT * FROM DIRECTORY( @my_int_stg )





Let’s jump right in, and start with external tables!

External tables in Snowflake are a feature that allows you to query raw data files stored in an external stage, as if they were native tables within Snowflake. 

These tables are read-only, meaning you cannot perform insert, update, or delete operations on them. However, they support select queries and can be used in joins with other Snowflake native tables.

External tables are relatively slower than native Snowflake tables. To enhance query performance, consider creating a materialized view on top of the external table. They are best suited for use cases such as cost-effectively querying rarely used data in your cloud storage.

Analyzing data in your data lake without moving it to Snowflake.

Let’s get into Snow Sight and see how we create external tables! 

All these statements can be found via the link in the video description!

When querying data, you have the option to include three additional columns: The name of the file from which the record originated, the row number, and the complete row values in JSON format.

While you might wonder why external tables are necessary when you can directly query data from external stages!

Well, use direct querying for one-time analysis or when querying from the internal stage is required.

On the other hand, opt for external tables for repeated queries or when you require better performance, manageability, or security.

Up Next, we’re gonna explore Iceberg Tables!

Database systems generally store their table data in their own proprietary formats. For instance, Amazon Redshift utilizes its unique format, while Snowflake uses micro partitions for data storage. Accessing and processing this underlying data is typically limited to the respective compute engines. For example, the Redshift data cannot be processed using Snowflake's compute engine, and vice versa.

However, Apache Iceberg is an open table format, enabling data accessibility and management across various computing engines like Spark, Trino, PrestoDB, Flink, Hive, and even Snowflake.

In Snowflake, you can create Iceberg tables pointing to a data volume or external cloud storage where Iceberg table data is located. Here's a sample DDL for creating an Iceberg table in Snowflake. Unlike external tables, Iceberg tables generally support all table operations, including insert, update, and delete.

This diagram illustrates how Iceberg table data can be accessed and managed by both Snowflake and Spark compute engines.

Moving forward, let's explore Dynamic tables.

Dynamic tables are truly fascinating; they serve an entirely different purpose! They work as a smart replacement for traditional data engineering pipelines!

In the past, data engineers relied on Streams and Tasks to define various aspects of their data transformation pipeline. This process typically involved creating individual streams for each source table to capture data changes. Subsequently, tasks were created to run at specific schedules, processing the new data received,  and merging it with the target table.

We'll be covering streams and tasks more extensively in another video. However, for our current discussion, just be aware that this process is relatively complex, involving multiple steps, and can be time-consuming in terms of data engineering development.

Snowflake addresses these challenges with Dynamic tables!

Dynamic tables eliminate the need for creating streams and tasks. With a simple see-tass statement, you define the end state of the target table, and Snowflake smartly handles the rest.

Here is an example that illustrates a data engineering pipeline that gets data from a single source table, transforms the data, and loads it into a target table.

First, we set up this pipeline using Streams and Tasks. Please take a look at the number of code lines required.

Next, we set up the same data pipeline using a dynamic table! As clearly evident, with Dynamic Tables, data pipeline development becomes much simpler to set up, and easier to manage.

Shifting gears, let's talk about Hybrid Tables!

As we're aware, traditional snowflake tables store data in columnar storage, optimized primarily for read-heavy OLAP workloads. However, they are not well-suited for OLTP processing, particularly in scenarios requiring concurrent and frequent small updates.

Snowflake introduced Hybrid Tables to address these limitations!  

Unlike regular tables, hybrid tables leverage row-based stores as the primary data store, ensuring excellent OLTP performance. Additionally, data from the primary store is internally synchronized with secondary columnar storage and cache. This integration of secondary columnar storage also offers decent OLAP performance. This dual functionality is why they are called hybrid tables. As they unite OLTP and OLAP workloads into a single platform, they are also referred to as uni-store tables.

In addition to row-based table storage, hybrid tables also support many essential features specially required for transactional tables, such as row-level locking, referential integrity, UNIQUE constraints, and indexes.

Here is the sample DDL statement for creating a hybrid table named Product Inventory. 

Up Next, we'll delve into Event Tables!

These tables are specially designed to capture and store logs and events, generated by your Stored Procedures and User Defined Functions. These events could be errors encountered during data processing, warnings, informational messages, as well as debug and trace events.

Setting up an Event Table for logging involves three steps. 

First, we create the Event Table. It's important to note that when creating an Event Table, we do not specify the column names, as it already has a predefined set of columns.

Next, we associate the Event Table with the Snowflake account. 

Finally, we start pushing messages to the Event Table from Stored Procedures and User Defined Functions as required.

Let's proceed to verify the contents of our event table. Yes, it has received the info log that we pushed from the stored procedure.

While not within the scope of our current discussion, based on the entries in this Event Table, you can even configure timely alerts and emails.

We've now reached the final type of table in Snowflake: Directory Tables! These are special tables that store file-level metadata about the data files in a particular stage.

We typically create other tables using explicit CREATE TABLE statements, in contrast, directory tables can be added to a stage as part of the CREATE STAGE statement, or later with ALTER STAGE command.

Similar to event tables, you don't need to specify column names for directory tables as they come with predefined columns, including the file size, a timestamp of when it was last modified, and downloadable Snowflake file URL.

A directory table has no grant able privileges of its own! To query a directory table, you'll need to be part of a role that has either the READ or USAGE privilege on the stage itself.

Directory tables are useful for querying a list of files on a stage, and sharing unstructured data in a stage with other users using a secure view.

Both external and internal stages support directory tables. Interestingly, directory tables do not have a name of their own; instead, you use the stage name within the DIRECTORY keyword to access them, and obtain a list of files available in it!

That's all for today! Please stay tuned for our next video where we’ll explore more advanced Snowflake features!

If you found this video helpful, please give it a thumbs up, and hit that subscribe button to stay up-to-date with the latest technologies!

We also welcome your questions or thoughts in the discussion section below!

Thanks for watching!

 

Comments

Popular posts from this blog

How to Install Airflow on Windows

How to Install DBT and Set Up a Project, Create Your First dbt Model

Airflow DAGs, Operators, Tasks & Providers