Krishna Vaddadi, Author at Perficient Blogs

Coding and Database Best Principles

Krishna Vaddadi — Mon, 28 Feb 2022 21:24:40 +0000

Today we will dwell into some basics and best practices to follow for any database design as well as for Coding. The key topics that I would like to emphasize today are following

Coding Best Practices
Data model Design Principles
Best Query writing practices.

Before I move on to the topics, I would like to thank my previous organization manager, mentor @Kevin Owens who Is my best-known Architect and learnt quite a lot from Design aspects and following principles.

Coding Practices

Have you ever encountered queries or a code that is not formatted properly or written a code in hurry that does not follow standards and later you are asked to re-visit for either a bug-fix or enhancement? At that point of time what could have gone through you mind? You would say,” Yikes that’s a one crazy query/code that needs time for me to understand and fix/enhance it”.

Over the years of experience, I had come across such queries or code and gave me hard time to read and understand. This is the precise reason why some coding best practices are designed by Organizations and asked teams to follow. In the earlier days, there were no tools to format the code or Intelligent IDE’s that tell you better ways, now with technology evolving we are getting better suggestions to write the code.

Database languages such as Oracle PL/SQL, T-SQL does not have strict guidelines to code or format whereas languages like Python have strict guidelines to follow the format, but we still tend to ignore and follow some incorrect practices. So, lets dig into some examples and explore some good and best practices.

Now let’s say we see a query like this below.

What do we observe from the above?

The query is not formatted for better readability.
Usage of mixed case in fields, Table, Schema.
Usage of Alias with a,b,c.. which does not signify the Table Name
Lacks readability in the query.

Now let’s look at the below formatted query and analyze what best practices were performed

The query is much readable with proper indentation followed
Each Table is aliased properly, Say for Tables is given an alias as Tbl.
The Alias names are much readable and signify what the underlying table meant to be.
Keywords are Capitalized, Joins are properly aligned

Now let’s look at a simple Trigger that is not formatted properly

What do we observe here?

No proper formatting of the code
Using some variables that have no meaning
No proper documentation to understand why looping is done.

Now, let’s look at the properly formatted code

We do see following

Properly formatted and Indented code
Variable names are declared with proper meaning.
Necessary documentation has been provided for code readability.
Begin and End have been tagged accordingly.
Tables are Properly Aliased.
Order of the conditions were followed properly.

So finally based on the above let’s list some of the best practices at the coding level, this can be applied to any language as such

Follow proper indentation in the code.
Avoid using variables names with single character as they do not signify any meaning to them. So, use some fully qualified name Example for Counter declare it as either Cntr or v_Cntr (Where V => local variable)
Give a brief one- or two-line description of what the program or the Query does if needed (Be brief).
Init Capitalize the Variables for better readability.
Make all keywords in Capitals.
Default the variables as best practice.
Remove any un-necessary variables that are not used in the code
Init Capitalize the Function/Procedure/Trigger Names.
Declare any parameter variables in the Function/Procedure starting with “p_ “
If the language offers the direction of the parameter then signify the direction explicitly in the Procedure/Function as (Example: If we are passing Customer ID as Input then call the parameters as “P_In_Cust_Id” , If we are returning then call it as “P_Out_Cust_Id”, if Bi-directional then call it as “P_In_Out_Cust_Id”)
If a variable has multiple scopes, then use the scope accordingly like l_ (l –> Local).
Declare procedure with either sp_ Or up_, Similarly for functions f_

Data Model Design Principles

Every application stores their data in one or other form, traditionally certain databases are designed with Relational databases like Oracle/MY SQL/PostgreSQL. With evolving trends in data pattern and business requirements there has been instance where business and technology teams were looking towards No-SQL databases like Mongo DB etc. No matter which database that we choose from each of the database should follow some basic principles in defining the tables and others.

Let’s focus first on a relational database (Where this is predominantly used for various reasons like flexible SQL Queries, Analytics etc). When we talk about relational database we tend to talk about Normalization of the data, which implies organizing your data in the database. We come across 4 Normal forms

1NF: First Normal form, where you cannot have multiple attribute values but hold Single -valued attribute, you can see repetition of the data in the table (Example, if an employee has two phone Numbers you will see two records for same Employee ID, causing duplicity)
2NF: Second Normal form, it says Non-Key attributes should be dependent on a Unique Key (Called primary Key). So given a Key I should get just one record and not multiple, so if an employee has say two address, we will have two tables one with Employee and his/her details (That does not repeat) and Employee and Address.
3NF: Third Normal form, this allows or reduces data duplication and used to achieve data integrity. For instance, if two banking customers hold same address then we could see the Same address repeating twice in the Customer table for two unique Customers. If there is a change in address you end up updating two rows, so to avoid such situation you will have Address information in another table with Primary Key associated with that address and then tag the ID in the Customer table. This way when you need to update address you will only touch one table but not the other, keeping data integrity in view.

We also have other form called BCNF (Boyce Codd Normal form) which is advanced 3NF that talks about relationship between two parent tables.

Some regularly talked data model methodologies are

Flat Model — single, two-dimensional array of data elements
Hierarchical Model — records containing fields and sets defining a parent/child hierarchy
Network Model — similar to hierarchical model allowing one-to-many relationships using a junction ‘link’ table mapping
Relational Model — collection of predicates over finite set of predicate variables defined with constraints on the possible values and combination of values
Star Schema Model — normalized fact and dimension tables removing low cardinality attributes for data aggregations
Data Vault Model — records long term historical data from multiple data sources using hub, satellite, and link tables

Now, no matter whether we create data structures in Relational or Non-Relational (No-SQL), we need to follow some principles or practices to follow.

Lets first focus on the table and fields
1. Give a meaningful name to the table and avoid any acronyms. Example
  1. Good: CUSTOMERS/CUST_MASTER
  2. Avoid: CUST (Table that holds Customer details)
2. Give a meaningful field name to the fields and should be self-explanatory (But avoid a long field). Example:
  1. Good: CUSTOMER_ID
  2. Bad: CUST_ID (Field that holds Customer Identification Number)
3. Follow some of the below best practices on naming the fields.
  1. For any identification fields have the suffix of the field with _ID (Example: for Orders order ID give it as ORDER_ID)
  2. For any field that holds Date have the suffix of the field _DATE/_DT. Example Employee Start Date : EMPLOYEE_START_DATE/EMPLOYEE_START_DT Or Record End date as: TERMINATED_DATE/TERMINATED_DT
    1. Please note: If the database has field limit, then try to go with _DT to have meaningful name to the field.
  3. For any field that holds Time Stamp have the suffice of the field as _TIMESTAMP/_TS. Example: Updated Time Stamp UPDATED_TIMESTAMP/UPDATE_TS
    1. Please note: If the database has field limit, then try to go with _TS to have meaningful name to the field.
  4. For any field that holds Boolean have the field starting with IS_/HAS_. Example: Has Employee Mobile: HAS_MOBILE.
  5. For any field that holds Amount have field suffix as _AMT/_AMOUNT. Example: Order Amount: ORDER_AMOUNT/ORDER_AMT.
    1. Please note: If the database has field limit, then try to go with _AMT to have meaningful name to the field.
  6. For any field that holds Count have field suffix as _CNT/_COUNT. Example: Order Amount: ORDER_CNT/ORDER_COUNT
    1. Please note: If the database has field limit, then try to go with _CNT to have meaningful name to the field.
  7. For any field that says Primary key have a field suffice as _KEY. Example Customer Key (A field formed by combination of multiple fields like Customer Id and Create Date), CUSTOMER_KEY.
  8. Avoid defining the Primary Key field as Just “ID” this would not give a meaningful name to the column and not self-explanatory.
  9. Always give comments to the Table and Fields as part of the Design. This allows anyone looking at the dictionary from the tool can understand the context of the table and fields. Example: (Oracle Table)

CREATE TABLE CUSTOMERS
(
    CUSTOMER_ID NUMBER,
    CUSTOMER_NAME VARCHAR2(100 CHAR),
    CUSTOMER_ADDRESS_1 VARCHAR2(1000 CHAR),
    CUSTOMER_ADDRESS_2 VARCHAR2(1000 CHAR),
    CUSTOMER_ADDRESS_3 VARCHAR2(1000 CHAR),
    CUSTOMER_STATE VARCHAR2(10 CHAR),
    CUSTOMER_CITY VARCHAR2(100 CHAR),
    CUSTOMER_ZIP VARCHAR2(20 CHAR),
    CUSTOMER_PHONE VARCHAR2(20 CHAR),
    CUSTOMER_RELATIONSHIP_START_DATE DATE,
    CONSTRAINT PK_CUSTOMER_ID PRIMARY KEY (CUSTOMER_ID)
);

COMMENT ON TABLE IS 'This table holds Customer Information.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_ID 'This holds the Customer ID, Unique Identifier of the Customer.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_ADDRESS_1 'Holds the Customer First line of Address.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_ADDRESS_2 'Holds the Customer Second line of Address.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_ADDRESS_3 'Holds the Customer Third line of Address.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_STATE 'Holds the Customer State.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_CITY 'Holds the Customer City.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_ZIP 'Holds the Customer Address Zip Code.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_PHONE 'Holds the Customer Phone Number.'
COMMENT ON COLUMN CUSTOMERS.CUSTOMER_RELATIONSHIP_START_DATE 'Holds Customer relation start date.'

Avoid breaking tables into too granular way, this causes more Joins and un-necessary complexity. Example
- - Do: If you want to hold Organization information then create one table called ORGANIZATION and Key it by a Unique ID and with other Org Details. So, the table name would look like this

 CREATE TABLE ORGANIZATION
(
    ORGANIZATION_ID NUMBER
    ORGANIZATION_NAME VARCHAR2(100 CHAR),
    
    CONSTRAINT PK_ORGANIZATION_ID PRIMARY KEY (ORGANIZATION _ID)
);

- - DO NOT create one table that contains Just Organization Id and Name and other Table that holds Organization Id and other Details, this causes Un-necessary maintenance of the tables and Joins. As below

CREATE TABLE ORGANIZATION
(
    ORGANIZATION_ID NUMBER
    ORGANIZATION_NAME VARCHAR2(100 CHAR)
    CONSTRAINT PK_ORGANIZATION_ID PRIMARY KEY (ORGANIZATION _ID)
);

CREATE TABLE ORGANIZATION_DETAILS
(
   ORGANIZATION_ID NUMBER
   ORGANIZATION_TYPE VARCHAR2(100 CHAR),
   ORGANIZATION_ADDRESS VARCHAR2(100 CHAR),
   
   CONSTRAINT FK_ORGANIZATION_ID FOREIGN KEY (ORGANIZATION _ID) REFERENCES ORGANIZATION(ORGANIZATION_ID)
);

Define and compute the summarization into their own tables and partition them accordingly to the desired grain and avoid dynamic complex computation for performance.
De-Normalize the tables accordingly and as needed based on the Business functionality.
Choose a Data-model and technology as per the business requirements. Example
- If the application is more performing OLTP then pick desired Normal Form and avoid constant updates on Indexed field.
- If its more of OLAP, then look towards Fact and Dimension table with pre-computed data for Analytical purpose.
Where possible and if Materialized views supported by the database, try to use them. These are like a fixed tables which allow better performance. These are used for summarization. These can be used for both types like OLTP/OLAP.
Use views very sparsely, as they are expensive and avoid any computation logic in the views and keep them as light as possible.
Do not add too many indexes on a table this will cause slow down for the Inserts and Updates and there will be quite a maintenance behind scenes.
Try to Partition the tables as necessary this allows the queries to run efficiently. Example, say we have a Transaction table which gets data daily then Partition the table by Date.
Create Indexes with meaningful name and have them based on the Query that is been accessed but follow the step of Number of Indexes as listed above.
For a NO-SQL database
- Form the Key where queries use most to retrieve the data in a well-defined range of rows. So start with most common values and end with granular. Say, you have a table that tracks the flights (Both arrival and departure) then define the key with Direction(Arrival/Departure), Followed by City and Time Stamp. This allows search on Arrival/Departure on a given city and timestamp in better manner. Separate with #.
- Try to consolidate related data into a Column Family for better consolidation of the data.

Query Writing Practices

In very simple terms a Query is simple request from the database that will return data that you asked. So, if we ask a simple question the answer would be quick and efficient, but when a query is asked on a complex topic, we end up answering back takes time. In terms of database the simple the query the faster the response, the complex the query the slower the response.

Say, if you want to fetch all employees in a department you can simply join necessary tables (Say Employee and Department) and return, but what if you try to overly complicate it by joining un-necessary table? The query will take while to return the data. In addition, there are some ground rules or best practices to write better and Optimal queries.

Let’s look some of the best practices in writing the query.

Avoid fetching all columns from the tables Joined but fetch only those that are really needed. Example

DO:

SELECT Tbl1.Field1, Tbl1.Field10, Tbl2.Field2 
  FROM SomeTable Tbl1 
 INNER JOIN SomeTable Tbl2 
    ON Tbl1.Key_Field = Tbl2.Key_Field 
 WHERE Tbl1.Field1 =

DON’T:

SELECT * 
  FROM SomeTable Tbl1 
 INNER JOIN SomeTable Tbl2 
    ON Tbl1.Key_Field = Tbl2.Key_Field 
 WHERE Tbl1.Field1 =

Start with the table that is the primary focus is on (Example where we have an input) then follow the trail to fetch other data as desired.
Join the tables in the order data is accessed rather Joining them haphazardly.
Use ANSI Standard Joins to make the queries more compatible with various Relational databases and helps in minimizing touching the code during migration. Examples being
1. INNER JOIN: Natural Join fetches only those that match on both sides (Based of a Key)
2. LEFT OUTER JOIN: Fetches all records of the table on left hand side that match with right and fetches additional data that does not match to Right from the Left table.
3. RIGHT OUTER JOIN: Fetches all records of the table on right hand side that match with left and fetches additional data that does not match to Left from the right table.
4. FULL OUTER JOIN: Fetches all records from both left and right that match and does not match.
Perform Joins on the tables as per the Index Column ordering to the best possible way to use the proper Index.
Avoid user defined functions in Where clause as much as possible, they tend to slow down the queries. If need arises, ensure that the table that is used in the function is Indexed with no additional summarization or conditions.
When joining two set of data together, if there are no duplicates in both data sets then use UNION ALL and avoid UNION. As Union tries to remove duplicates and sort the data which is costly in performance.
Do not use DISTINCT clause in Sub-Query or UNION Clause in a Sub-query as they will be more expensive.
Use Exists clause if the volume of your sub-query is high rather doing IN.
Use IN in your Where clause where your sub-query has records less than 10.
Use Merge (UPSERT) for optimal performance to update the Non Key values or Insert into the table.
Use NVL (Oracle/Hive) Or ISNULL (SQL SERVER) if your where condition needs to fetch data on a field with Or Condition. Example
1. DO:WHERE NVL(Tbl1.Field1, Tbl1.Field2) = Tbl2.Field3This gives better performance
2. DON’T: WHERE (Tbl1.Field1 = Tbl2.Field3 OR Tbl1.Field2 = Tbl2.Field3)This has slower execution due to OR Condition.
While writing the queries focus on the table volume and how much data is considered Vs thrown away due to Filter condition.
If needed use Work tables to speed up the process (Only possible when in Procedures or Batch jobs) rather querying high volume by loading filtered data for usage later.

Always perform Insert INTO with columns specified. See below some Good/Bad practices.

Good: Having columns specified will allow table to grow horizontally with out effecting the insert (as long as new fields are Nullable)

INSERT INTO Customers (Customer_Id,
                       Customer_Name,
                       Customer_Address_1,
                       Customer_Address_2,
                       Customer_Address_3,
                       Customer_State,
                       Customer_City,
                       Customer_Zip,
                       Customer_Phone,
                       Customer_Relationship_Start_Date)
  VALUES (Seq_Customers.NextVal,
          'Krishna Vaddadi',
          '7231 Skiles River',
          'Apt. 929',
          Null,
          'MI',
          '29032',
          '789-456-2903',
          SYSDATE);

Bad: Any new field added will cause the Insert fail and needs amended.

INSERT INTO Customers 
  VALUES (Seq_Customers.NextVal,
          'Krishna Vaddadi',
          '7231 Skiles River',
          'Apt. 929',
          Null,
          'MI',
          '29032',
          '789-456-2903',
          SYSDATE);

That’s it for today and hopefully this gives some insight on best practices and have a good read!

Time Travel with Snowflake and BigQuery

Krishna Vaddadi — Tue, 30 Nov 2021 00:00:05 +0000

Hello friends, I am back with another interesting topic called “time travel.” I believe many of you wish you could go back in time and fix something. The idea of time travel was made possible in the fictional world of movies and books, including Aditya 369 (released during my teens), recent Marvel movies, and Tenet. Although we know that cannot time travel in the real world, it is possible within the technology world (with databases, for example).

Have you ever landed into a situation where you ran a bad update/insert/delete/drop and wanted to go back to the pre-state? To certain extent you can with Time Travel. Specifically, you can go back in time and magically “restore” the previous state.

I had been using Oracle for over couple decades and often we practice to take a “back-up” of the table before performing any kind of operation. If everything goes fine, we either discard it or leave it for tracking purposes. With major databases like Oracle, if we had not done this “back-up” then you had to reach out to a database administrator (DBA) to find a way to restore it from Logs or tapes (of course you could restore the dropped table from Recycle bin). Creating a “back-up” is not a perfect solution, however, because it comes with hefty price of high disk usage and it can be time-consuming.

New technologies and database are built keeping flexibility and simplicity in mind. In addition to the low cost of data storage and the ubiquity of cloud, innovation has no limits.

The advantages of technology capable of time travel include:

You can confidently perform the activities that you would like to do and compare them with historical versions (without incurring costs)
It saves time
It reduces disk or storage costs

So, lets jump back to the Data Warehouse tools that I recently wrote about which have time travel built in. These do not need any external team’s assistance and can be performed by a regular developer/tester with few sets of commands. They are none other than:

Snowflake
BigQuery

The time travel feature is possible due to the “fail-safe” methodology that is completely managed behinds scenes by these vendors. I will explore the fail-safe measure in more detail in later blogs.

Please note that fail-safe does consume storage space but its a lot cheaper than traditional methods. You do not need external team to add disk space to it as these technologies use Cloud storage, which is inexpensive.

Let’s now jump into the details of each data warehouse and also compare them next to each other.

Snowflake

Snowflake is Multi-Cloud Data Lake/Warehouse tool that is fully built on cloud, for cloud, and its architecture is defined on a shared-disk and shared-nothing database architecture. We looked at its architecture layer in my earlier blog.

Time travel with Snowflake is called “Continuous Data Protection Lifecycle” and you can preserve the data for a certain period of time. Using this feature, you can perform certain actions within the time window including:

Query data that has since been updated or deleted
Create clones of entire tables, schemas, and databases at or before specific points in the pas
Restore tables, schemas, and databases that have been dropped

After the elapsed time window, the data moves to Snowflake’s “fail-safe” zone.

The beauty of this comes with SQL statements where you can restore:

At certain point in time (using either Time Stamp)
Offset (in seconds from current time)
SQL ID

Snowflake offers different flavors time travel and they differ from version to version. The standard version gives you 24 hours of time travel (like a well-known movie 24 by Surya). Versions above Standard range between 0 to 90 Days! But by default, it is set to 1 day for all account levels.

Now, let’s jump into the technicalities of this time travel.

First, lets check the retention set at our account level, Its set to default 1. (Please note I have 30 Day free-tier Snowflake Enterprise Account)

I created certain tables while this retention is one, all these tables have Just 1 day.

Now, I increase the retention by 90 Days at account level:

As soon as I change to 90, I have all my tables increased to 90 Days!

Now, let me throw in a scenario. Please note I am using banking and AML nomenclature for tables and conventions.

Our Transactions table has data of different Transaction Types and the Originating Currency and Base currency.
We found an issue that all credit wire transactions (CWTF) have both Originating Currency and Base currency as USD.
This needs to be fixed for only CWTF types.

Let’s see the data spread:

So, we must touch only Line item 3 and update CURRENCY_ORIGINATING from USD to say EUR (about 50K records).

Ideally what we do is:

Take back up of Transactions and name it Transactions_20211124 (the date on which we are creating)
Update the Transactions table
Finally drop the temp Transactions Table (or leave it for tracking purposes). The drop won’t cost anything, but if needed for audit purposes, it will consume space.

What if we forgot to take back up and hit commit (say my session has auto-commit)? You are doomed and cannot even go back in time to fix it! So, lets run through a scenario where I ran an incorrect statement where I updated both Currencies to EUR!

Upon checking!

Now what? Well, I should not be that worried, as I can still go back in time and restore the table back. That would look like:

Find the SQL ID from the History, for me its “01a0812f-0000-5c7b-0000-00010fca53ad” and see the data

I can still see the old data, so all I have to do is take this data and re-insert it into the current table (although it sounds simple, in production we need to be cautious!) (If its an OLTP system where data is constantly pouring in, we need to apply different methods). For now, I am simply overwriting my original table.

We see that the data has been restored. Now, I’ll run the correct update.

The correct update updated 50K records.

The final data:

Now, you have seen the power of the Time Travel and how we can use that when in need.

Let’s look at how we can achieve this in BigQuery and what are its features are

BigQuery

BigQuery is a cloud Database by Google and also supports time travel. It does not have multiple “flavors” like Snowflake does, but it supports 7 days of time travel. This feature supports the following:

Query data that was updated/deleted
Restore a table that is deleted (aka Dropped)
Restore a table that is expired

If we want to access the data beyond 7 days, we can take regular Snapshots of the database (e.g., Oracle). We can talk on Snapshots in later blogs, but for now let’s stick to time travel.

Unlike in Snowflake, in BigQuery we can only use point in time using a time stamp.

I have a similar table in my BigQuery but it has less volume.

Lets now do the same activity try to update the Row 4 CWTF transaction type from Currency_Originating from USD to EUR (But lets first run in correct statement)!

Now, we have updated the data incorrectly. Let’s use BigQuery Timestamp methodology and go back in time and fetch the data before it updated, view it, and restore it (I ran the update at 11:50 AM EST, so I need to get back before then).

Now let’s restore this data back to the master table. Note that BigQuery does not have any way to insert overwrite, so you may either use merge (if you know the keys, or truncate and insert). I took the later approach for my POC as I don’t have a key on which I could simply update. Please avoid truncate and insert in any production environment.

The data post restoration:

Now lets cautiously update the data and run the validations, and we should see Just 33,360 Records updated with EUR.

Behold, we did update the data correctly!

Now we could see how we could use the time stamp or other option of going back in intervals (usually by the hour) to fetch data. Remember that BigQuery only holds the data for the past 7 days—after the 7 day period, you will lose the data.

A few additional observations:

You can only go back in time in Intervals of hours, so if you have given time beyond the hours the data is not existing you will see an error:
It is a little tricky when you are using “FOR SYSTEM_TIME AS OF TIMESTAMP“, even though you can get the time stamp from your job history, because you have to provide your time zone as BigQuery executes all times using UTC. For me, I had to give the time as:”TIMESTAMP(‘2021-11-26 11:50:37-05:00’)“, where -5:00 is EST Hours, so you should provide your own timezone so BigQuery can accurately pull data.

Synopsis

We have seen how we could safely bring back the data from history with few sets of commands without relying on DBA’s or any other mechanism. Now, I would like to caution few things:

Time travel is a cool mechanism but it needs to be used with extreme caution.
You need to look at the retention at the account level and table level to ensure you are within your bounds of restore.
You cannot simply overwrite the data from history to current as it may cause some overwrite of the data that has been inserted between the delete and now, so pay extra attention here.

Even though I mentioned the traditional “back-up” as time consuming and costly, you may go for it if you need to prepare for an audit, and remember that Time Travel and Fail Safe have a maximum life span of 97 Days (Fail Safe is 7 Days and this is out of your control) and Time Travel is 90 days (maximum) for Snowflake Enterprise accounts and above.

In this particular use case, I found Snowflake as more advantageous because of its ease of use and the span of time that you can time travel (90 days versus 7).

In the next session, I will dig in on how both systems could have implemented from Architectural point of view as I have not seen much documentation on implementation. Until then, signing off!

Cloning with BigQuery and Snowflake

Krishna Vaddadi — Thu, 18 Nov 2021 20:20:14 +0000

Today I write my first blog with the hope of posting more in the future. I call myself as data enthusiast who likes to analyze and understand data in order to write my solutions because, ultimately, it’s the data that is critical for any application. Without further ado, let’s jump into the today’s topic—cloning on the two Data Warehouse systems:

Snowflake
BigQuery

I’ve watched numerous movies, and I remember in Marvel Heroes when Loki clones himself many ways at same time! Such cloning may not be possible in the real world but it’s possible in the world of technology. For example, cloning the code to a local machine (git clone is a primary example).

Now what about databases (DB)? Why not clone them? Well, we used to do that, but that was really a copy of the DB that consumed space. You might be wondering why do we need to clone a DB without consuming space? Well, there are many answers. Before we delve into that further, lets pause for a moment on why we do cloning (aka Copy). The primary reason for cloning is to resolve issues in production; when we move our application live and we say “Oops, I never encountered this kind of data or envisioned this data,” “I wish I had such data in my Development/Test environment so that I could have covered this scenario,” or “We don’t have such volume of data in my development/test environment and I could not un-earth these issues!”

Cloning also helps with, to name few:

Enhancing the application and underlying data structure by using a production like copy of the data
Performance tuning, which is most critical as Dev/Test environments lack data sets.

We used to clone (aka copy production data to our Dev and Test environments and perform our tests) the monolithic databases like Oracle, DB2, and SQL Server on-Prem. We used to call this cloning but if you observe we are not actually cloning them, but rather copying them. Copying has multiple disadvantages including:

Duplication of data
Manual effort (Or Some additional tool to perform activity)
Days to replicate (If volume is high)
Data is out of sync with Production
Additional Storage

I personally dealt with the copy of extra-large databases, which took anywhere between 2 to 4 days, excluding the time to do post copy set up (analyzing, etc.).

I used to wish we had a simple command to use that would clone the DB without occupying space and get my job done in few commands without depending my DBAs. Voila, this turned out to be true when I started doing my Snowflake Pro Core certification and studying and playing with Snowflake.

Snowflake

Let me step back and talk about Snowflake. Snowflake is Multi-Cloud Data-Lake/Warehouse tool that is fully built on cloud, for cloud, and its architecture is defined on a Shared-disk and Shared-nothing database architecture. It has 3 architecture layers:

Cloud Services: The coordinator and collection of Services.
Query Processing: Brain of the system, where query execution is performed using “virtual warehouses”.
Database Storage: Which physically stores the data in Columnar mode.

Source: Snowflake Documentation

This tool leverages Cloud providers (AWS/Google/Azure) strengths by using :

Compute Engine (which is extensively used at the Query Processing layer and they call it a Virtual Warehouse)
Cloud Storage (GCS in case of Google, S3 in case of AWS, and Blob Storage in case of Azure)
VPC (networking and other security layer for its different versions)

Snowflake has a wonderful utility called Cloning (they call it Zero Copy clone) which can clone a Database/Table seamlessly in a simple command (example create or replace database clone_sales_db clone sales;). This command would clone an entire database (with certain restrictions; refer to Snowflake documentation for restrictions). The beauty of it comes from our college days study called Pointers/references! When you give a command to Clone a database, it simply does the following:

Creates the New Database (An Object Name in the cloud services as normal)
All objects underneath the database (note there are some restrictions in which certain objects do not get cloned) are created (again object name in the Service layer)

One would expect data is copied and held as “replica” but it’s not; what Snowflake smartly does is it creates a pointer/reference to the Source Database/Tables from the cloned Object, which gives a huge advantage of not only replicating data(logically) but savings in storage costs. When a user queries for a table from cloned object, the Cloud services will simply fetch the data from the actual source, making the data as up-to-date as possible. In addition, this does not consume time to create the clone—it takes as much time as you do to create a table.

See this an example:

Say I created a table called “Transactions” and loaded about 200K records, the total size of this table is 8.6 MB:

Observe the metadata of this table:

- ID: Unique ID for this table
- CLONE_GROUP_ID: From what ID of the Object this is cloned from (so its same as the ID)
Now, I will create a clone of Transactions call it “Transactions_Clone”. Now, the Table will show the same volume of the records:

We might be wondering if it has “copied” the entire “Transactions” Table and replicated it, but it has not. So how do we know it did not copy? See the Metadata of the table below:

Observe the following
- ID: New table got its own Unique ID
- CLONE_GROUP_ID: It points to the ID of the “Transactions” Table. Isn’t it smart!
So technically, when I query TRANSACTIONS_CLONE It simply refers to the Data location of “TRANSACTIONS”
When we add data to the master, you will not see that in TRANSACTIONS_CLONE! (which was expected)
Now, lets say, I delete the data from TRANSACTIONS_CLONE (60 of them), I will not see these 60, but still the other data will be referenced from its original Source “TRANSACTIONS“

And my active bytes is still 0

Now I loaded about 100K records to this and take a look at the Number of records!, It takes the new data from the recently added location and the rest from the Source.

If any user makes changes to the table or object, all Snowflake does is it will have the additional data (or deleted data) from the new pointers/references. This involves storage costs for only added data. When you delete the data, the original data is untouched, but its Cloud service layer holds the necessary information of what was removed and added making it a “True” Zero copy.

Similarly, if you make any structural changes, Snowflake handles the changes in its Cloud services layer and brings the necessary data accordingly.

This gives a real power of cloning with out consuming any space and you can replicate the table in matter of minutes if not seconds. Also giving speed to your development and testing with much better efficiency.

BigQuery

BigQuery is one of the most popular databases, or we should call Cloud Data Warehouse, that offers rich features for storing data and extreme performance and scalability of the data. This is a fully managed database so you do not need to worry about Storage or which compute engines to use. Unlike in Snowflake, one has to pick up a Virtual Warehouse (aka compute engine) to run your query, and if the query is not faring better, then you can pick up Higher configured warehouse to let the query run better.

In BigQuery you don’t need to worry of these and simply focus on writing the query and executing it. You will only be charged for the data that you retrieve.

BigQuery is similar to Snowflake, but until now, it did not have the ability to Clone the DB. During Google Next ’21, it was announced that soon you can:

Take a Snapshot
Clone the DB

Source: Google Next ’21 Session

As per the slide above, Snapshots are immutable but a helpful feature that you can use to go back in time with out incurring any charges. This helps with savings because copying the table incurs in storage costs.

Cloning has been introduced (though it’s not yet available) but the concepts run through in similar fashion of the Snowflake (we still need to wait for official announcement until then we may need to assume) in which you will only incur charges on the new or modified data but not on others. This saves a huge cost when your table runs into TB or PB.

This is a feature that every developer/tester is waiting to see on BigQuery and once cloning is in place it will give a tremendous boost to the teams in analyzing production issues without going to the Prod. Specifically, you can simply clone and run your queries on Development area without any down time to the production boxes. Time is of the essence, and everyone wants to have things done faster and smarter and wants to minimize the manual effort.

There is more to come on these features once they are officially released. We will see how it fares against the Cloning of Snowflake Vs Amazon RDS Aurora cloning.

In future blogs, I will do a deep dive into the technicalities of the tool, features, and explore more solutions. Until then, signing off!