sql Articles / Blogs / Perficient

From Cloud to Local: Effortlessly Import Azure SQL Databases

Prasad Joshi — Wed, 26 Feb 2025 08:54:58 +0000

With most systems transitioning to cloud-based environments, databases are often hosted across various cloud platforms. However, during the development cycle, there are occasions when having access to a local database environment becomes crucial, particularly for analyzing and troubleshooting issues originating in the production environment.

Sometimes, it is necessary to restore the production database to a local environment to diagnose and resolve production-related issues effectively. This allows developers to replicate and investigate issues in a controlled setting, ensuring efficient debugging and resolution.

In an Azure cloud environment, database backups are often exported as .bacpac files. The file must be imported and restored locally to work with these databases in a local environment.

There are several methods to achieve this, including:

Using SQL Server Management Studio (SSMS).
Using the SqlPackage command-line.

This article will explore the steps to import a .bacpac file into a local environment, focusing on practical and straightforward approaches.

The first approach—using SQL Server Management Studio (SSMS)—is straightforward and user-friendly. However, challenges arise when dealing with large database sizes, as the import process may fail due to resource limitations or timeouts.

The second approach, using the SqlPackage command-line, is recommended in such cases. This method offers more control over the import process, allowing for better handling of larger .bacpac files.

Steps to Import a `.bacpac` File Using SqlPackage

1. Download SqlPackage

Navigate to the SqlPackage download page: SqlPackage Download.
Ensure you download the .NET 6 version of the tool, as the .NET Framework version may have issues processing databases with very large tables.

2. Install the Tool

Follow the instructions under the “Windows (.NET 6)” header to download and extract the tool.
After extracting, open a terminal in the directory where you extracted SqlPackage.

3. Run SqlPackage

Put .bacpac file into the package folder.(ex: C:\sqlpackage-win7-x64-en-162.1.167.1)
Use the following example command in the terminal to import the .bacpac file:
powershell

SqlPackage /a:Import /tsn:"localhost" /tdn:"test" /tu:"sa" /tp:"Password1" /sf:"database-backup-filename.bacpac" /ttsc:True /p:DisableIndexesForDataPhase=False /p:PreserveIdentityLastValues=True

4. Adjust Parameters for Your Setup

/tsn: The server name (IP or hostname) of your SQL Server instance, optionally followed by a port (default: 1433).
/tdn: The name of the target database (must not already exist).
/tu: SQL Server username.
/tp: SQL Server password.
/sf: The path to your .bacpac file (use the full path or ensure the terminal is in the same directory).

5. Run and Wait

Let the tool process the import. The time taken will depend on the size of the database.

Important: Ensure the target database does not already exist, as .bacpac files can only be imported into a fresh database.

The options /p:DisableIndexesForDataPhase and /p:PreserveIdentityLastValues optimize the import process for large databases and preserve identity column values. SqlPackage provides more reliability and flexibility than SSMS, especially when dealing with more extensive databases.

Reference:

https://learn.microsoft.com/en-us/azure/azure-sql/database/database-import?view=azuresql&tabs=azure-powershell

Snowflake: Master Real-Time Data Ingestion

Cristian Munoz — Mon, 06 May 2024 14:22:41 +0000

In this blog post, we’ll dive into two powerful features of Snowflake: Snowpipe and Streams. Both Snowpipe and Streams are crucial components for real-time data processing and analytics in Snowflake. We’ll explore what each feature entails and how they can be leveraged together to streamline data ingestion and analysis workflows, harnessing the Power of Snowflake’s Snowpipe and Streams.

Snowpipe

Snowpipe is a vital feature of Snowflake that automates the process of loading data as soon as new files appear in a designated location, such as a cloud storage bucket. This eliminates the need for manual intervention and ensures that fresh data is readily available for analysis without delay. Snowpipe operates as a serverless function, managed by Snowflake itself, thus alleviating the burden of managing virtual warehouses.

Streams in Snowflake

Streams in Snowflake provide a continuous, ordered flow of changes made to a table. Whenever a DML (Data Manipulation Language) operation is performed on a table, such as INSERT, UPDATE, or DELETE, the corresponding change data is captured and made available through the stream. Streams are invaluable for capturing real-time changes to data, enabling downstream processing and analytics in near real-time.

Setting up Snowpipe and Streams

Creating the Table

-- Create table to store employee data
CREATE OR REPLACE TABLE SNOW_DB.PUBLIC.employees (
id INT,
first_name STRING,
last_name STRING,
email STRING,
location STRING,
department STRING
);

Creating a Stream

Before setting up Snowpipe, we need to create a stream on the target table to capture the changes.

-- Create a stream on the target table
CREATE OR REPLACE STREAM my_stream ON TABLE SNOW_DB.PUBLIC.employees;

Configuring Snowpipe

Now, let’s configure Snowpipe to automatically load data from an external stage into our target table whenever new files are added.

-- Create a stage object
CREATE OR REPLACE STAGE DATABASE.external_stages.csv_folder
URL = 's3://snowflakebucket/csv/snowpipe'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = DATABASE.file_formats.csv_fileformat;

-- Create a pipe to load data from the stage
CREATE OR REPLACE PIPE DATABASE.pipes.my_pipe
AS
COPY INTO SNOW_DB.PUBLIC.employees
FROM @DATABASE.external_stages.csv_folder;

Example Scenario

Imagine we have a stream of customer transactions being continuously ingested into our Snowflake database. With Snowpipe and Streams configured, we can seamlessly capture these transactions in real time and analyze them for insights or trigger further actions, such as fraud detection or personalized marketing campaigns.

Output from Stream

The stream “my_stream” captures the changes made to the “employees” table. We can query the stream to see the change data using the SELECT statement:

-- Query the stream for change data
SELECT * FROM my_stream;

This query will return the change data captured by the stream, including the operation type (INSERT, UPDATE, DELETE) and the corresponding data changes.

Example Output

Suppose there have been some insertion operations on the “employees” table. The query to the stream might return something like this:

|—————–|——————-|—————–|—–|————|———–|————————|————–|————|

In this example, each row represents an insertion operation on the “employees” table. The “METADATA$ACTION” column indicates the action performed (in this case, INSERT), while the “ID”, “FIRST_NAME”, “LAST_NAME”, “EMAIL”, “LOCATION”, and “DEPARTMENT” columns contain the inserted employees’ data.

This stream output provides insight into the changes made to the “employees” table, enabling real-time monitoring of operations.

Additional use cases

Real-Time Monitoring of Financial Transactions

Financial institutions can utilize Snowpipe and Streams to monitor financial transactions in real time. By setting up Streams on relevant tables, they can capture and analyze changes in transaction data, enabling them to detect potential fraudulent activities or financial risks promptly. For example, they can set up automated alerts to identify suspicious transaction patterns and take corrective actions immediately.

Analysis of User Behavior in Web and Mobile Applications

Technology companies can leverage Snowpipe and Streams to analyze user behavior in their web and mobile applications. By capturing user events such as clicks, interactions, and purchases in real-time through Streams, they can gain valuable insights into user experience, identify areas for improvement, and dynamically personalize user experiences. This enables them to offer personalized recommendations, targeted marketing campaigns, and features tailored to individual user needs.

Conclusion

By harnessing the power of Snowpipe and Streams in Snowflake, organizations can achieve efficient real-time data ingestion and analysis, enabling timely decision-making and unlocking valuable insights from streaming data sources.

This blog post provides an overview of both Snowpipe and Streams, followed by a step-by-step guide on setting them up and an example scenario demonstrating their combined usage.

Discoveries from Q&A with Enterprise Data using GenAI for Oracle Autonomous Database

Mazen Manasseh — Tue, 09 Apr 2024 12:55:53 +0000

Natural language AI has proliferated into many of today’s applications and platforms. One of the high in demand use cases is the ability to find quick answers to questions about what’s hidden within organizational data, such as operational, financial, or other enterprise type data. Therefore leveraging the latest advancements in the GenAI space together with enterprise data warehouses has valuable benefits. The SelectAI feature of the Oracle Autonomous Database (ADB) achieves this outcome. It eliminates the complexity of leveraging various Large Language AI Models (LLMs) from within the database itself. From an end user perspective, SelectAI is as easy as asking the question, without having to worry about GenAI prompt generation, data modeling, or LLM fine tuning.

In this post, I will summarize my findings on implementing ADB SelectAI and share some tips on what worked best and what to look out for when planning your implementation.

Several GenAI Models: Which One to Use?

What I like about SelectAI is that switching the underlying GenAI model is simple. This is important over time to stay up to date and take advantage of the latest and greatest of what LLMs have to offer and at the most suitable cost. We can also set up SelectAI with multiple LLMs simultaneously, for example, to cater to different user groups, at varying levels of service. In the future, there will always be a better LLM model to use, but at this time these findings are based on trials of the Oracle Cloud Infrastructure (OCI) shared Cohere Command model, the OpenAI GPT-3.5-Turbo model and the OpenAI GPT-4 model. Here is a summary of how each worked out:

Cohere Command:

While this model worked well for simple questions that are well phrased with nouns that relate to the metadata, it didn’t work well when the question got more complex. It didn’t give a wrong answer, as much as it returned a message as follows apologizing for the inability to generate an answer: “Sorry, unfortunately a valid SELECT statement could not be generated…”. At the time of this writing, the Command R+ model had just been introduced and became generally available, but it wasn’t attempted as part of this exercise. It remains to be found out how effective the newer R+ model is in comparison to the other ones.

OpenAI GPT-4:

This LLM worked a lot better than Cohere Command in that it answered all the questions that Command couldn’t. However, it comes at a higher cost.

OpenAI GPT-3.5-Turbo:

This one is my favorite so far as it also answered all the questions that Command couldn’t and is roughly 50 times less expensive than GPT-4. It is also a lot faster to respond compared to the OCI shared Cohere Command. There were some differences though at times in how the answers are presented. Below is an example of what I mean:

Sample Question: Compare sales for package size P between the Direct and Indirect Channels

Responses Generated by Each Model:

Cohere command: Sorry, unfortunately, a valid SELECT statement could not be generated
OpenAI gpt-3.5-turbo: This was able to generate a good result set based on the following query, but the results weren’t automatically grouped in a concise manner.

SELECT s.PROD_ID, s.AMOUNT_SOLD, s.QUANTITY_SOLD, s.CHANNEL_ID, p.PROD_PACK_SIZE, c.CHANNEL_CLASS
FROM ADW_USER.SALES_V s
JOIN ADW_USER.CHANNELS_V c ON s.CHANNEL_ID = c.CHANNEL_ID
JOIN ADW_USER.PRODUCTS_V p ON s.PROD_ID = p.PROD_ID
WHERE p.PROD_PACK_SIZE = 'P' AND c.CHANNEL_CLASS IN ('Direct', 'Indirect');

OpenAI gpt-4: This provided the best answer, and the results were most suitable with the question as it grouped by Channel Class to easily compare sales.

SELECT c.CHANNEL_CLASS AS Channel_Class, SUM(s.AMOUNT_SOLD) AS Total_Sales 
FROM ADW_USER.SALES_V s  
JOIN ADW_USER.PRODUCTS_V p ON s.PROD_ID = p.PROD_ID 
JOIN ADW_USER.CHANNELS_V c ON s.CHANNEL_ID = c.CHANNEL_ID 
WHERE  p.PROD_PACK_SIZE = 'P'AND c.CHANNEL_CLASS IN ('Direct', 'Indirect') 
GROUP BY c.CHANNEL_CLASS;

Despite this difference, most of the answers were similar between GPT-4 and GPT-3.5-Turbo and that’s why I recommend to start with the 3.5-Turbo and experiment with your schemas at minimal cost.

Another great aspect of the OpenAI GPT models is that they support conversational type questions to follow up in a thread-like manner. So, after I ask for total sales by region, I can do a follow up question in the same conversation and say for example, “keep only Americas”. The query gets updated to restrict previous results to my new request.

Tips on Preparing the Schema for GenAI Questions

No matter how highly intelligent you pick of an LLM model, the experience of using GenAI won’t be pleasant unless the database schemas are well-prepared for natural language. Thanks to the Autonomous Database SelectAI, we don’t have to worry about the metadata every time we ask a question. It is an upfront setup that is done and applies to all questions. Here are some schema prep tips that make a big difference in the overall data Q&A experience.

Selective Schema Objects:

Limit SelectAI to operate on the most relevant set of tables/views in your ADB. For example exclude any intermediate, temporary, or irrelevant tables and enable SelectAI on only the reporting-ready set of objects. This is important as SelectAI automatically generates the prompt with the schema information to send over to the LLM together with the question. Sending a metadata that excludes any unnecessary database objects, narrows down the focus for the LLM as it generates an answer.

Table/View Joins:

To result in correct joins between tables, name the join columns with the same name. For example, SALES.CHANNEL_ID = CHANNELS.CHANNEL_ID. Foreign key constraints and primary keys constraints don’t affect how tables are joined, at least at the time of writing this post. So we will need to rely on consistently naming join columns in the databases objects.

Create Database Views:

Creating database views are very useful for SelectAI in several ways.

Views allow us to reference tables in other schemas so we can setup SelectAI on one schema that references objects in several other schemas.
We can easily rename columns with a view to make them more meaningful for natural language processing.
When creating a view, we can exclude unnecessary columns that don’t add value to SelectAI and limit the size of the LLM prompt at the same time.
Rename columns in views so the joins are on identical column names.

Comments:

Adding comments makes a huge difference in how much more effective SelectAI is. Here are some tips on what to do with comments:

Comment on table/view level: Describe what type of information a table or view contains: For example, a view called “Demographics” may have a comment as follows: “Contains demographic information about customer education, household size, occupation, and years of residency”
Comment on column level: For security purposes SelectAI (in a non-Narrate mode) doesn’t send data over to the GenAI model. Only metadata is sent over. That means if a user asks a question about a specific data value, the LLM doesn’t have visibility where that exists in the database. To enhance the user experience where sending some data values to the LLM is not a security concern, include the important data values in the comment. This enables the LLM to know where that data is. For example, following is a comment on a column called COUNTRY_REGION: “region. some values are Asia, Africa, Oceania, Middle East, Europe, Americas”. Or for a channel column, a comment like the following can be useful by including channel values: “channel description. For example, tele sales, internet, catalog, partners”

Explain certain data values: Sometimes data values are coded and require translation. Following is an example of when this can be helpful: comment on column Products.VALID_FLAG: “indicates if a product is active. the value is A for active”

Is There a Better Way of Asking a Question?

While the aforementioned guidance is tailored for the upfront administrative setup of SelectAI, here are some tips for the SelectAI end user.

Use double quotations for data values consisting of multiple words: This is useful for example when we want to filter data on particular values such as a customer or product name. The quotation marks also help pass the right case sensitivity of a word. For example: what are the total sales for “Tele Sales” in “New York City”.
Add the phrase “case insensitive” at the end of your question to help find an answer. For example: “calculate sales for the partners channel case insensitive”. The SQL query condition generated in this case is: WHERE UPPER(c.CHANNEL_CLASS) = ‘PARTNERS’, which simply means ignore case sensitivity when looking for information about partners.
If the results are filtered, add a statement like the following at the end of the question to avoid unnecessary filters: “Don’t apply any filter condition”. This was more applicable with the cohere command model than the OpenAI models.
Starting the question with “query” instead of “what is”, for instance, worked better with the cohere command model.
Be field specific when possible: Instead of just asking for information by customer or by product, be more field specific such as “customer name” or “product category”.
Add additional instructions to your question: You can follow the main question with specific requests for example to filter or return the information. Here is an example of how this can be done:

“what is the average total sales by customer name in northern america grouped by customer. Only consider Direct sales and customers with over 3 years of residency and in farming. case insensitive.”

Results are returned based on the following automatically generates SQL query:

SELECT c.CUST_FIRST_NAME || ' ' || c.CUST_LAST_NAME AS CUSTOMER_NAME, AVG(s.AMOUNT_SOLD)
FROM ADW_USER.SALES_V s JOIN ADW_USER.CUSTOMERS_V c ON s.CUST_ID = c.CUST_ID
JOIN ADW_USER.COUNTRIES_V co ON c.COUNTRY_ID = co.COUNTRY_ID
JOIN ADW_USER.CHANNELS_V ch ON s.CHANNEL_ID = ch.CHANNEL_ID
JOIN ADW_USER.CUSTOMER_DEMOGRAPHICS_V cd ON c.CUST_ID = cd.CUST_ID
WHERE UPPER(co.COUNTRY_SUBREGION) = 'NORTHERN AMERICA'
AND UPPER(ch.CHANNEL_CLASS) = 'DIRECT'
AND cd.YRS_RESIDENCE > 3
AND UPPER(cd.OCCUPATION) = 'FARMING'
GROUP BY c.CUST_FIRST_NAME, c.CUST_LAST_NAME;

It’s impressive to see how GenAI can take the burden off the business in finding quick and timely answers to questions that may come up throughout the day, all without data security risks. Contact us if you’re looking to unlock the power of GenAI for your enterprise data.

SQL: DML, DDL and DCL

Saranya Sridhar — Fri, 29 Mar 2024 17:16:09 +0000

In the realm of databases, SQL (Structured Query Language) serves as the lingua franca, enabling users to interact with data stored in various systems effectively. While SQL encompasses a wide array of commands, understanding the distinctions between Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control Language (DCL) is fundamental for wielding this powerful tool with finesse.

DML, DDL, and DCL constitute the triad of SQL commands that form the backbone of database management. Each category serves distinct purposes, catering to different aspects of data manipulation, schema definition, and access control. Let’s delve deeper into each domain to grasp their significance and functionality within the SQL ecosystem.

DML:

DML stands for Data Manipulation Language. It is a subset of SQL used to perform operations on data stored in the database. DML commands enable users to retrieve, insert, update, and delete data from database tables. Here’s a breakdown of the operations:

INSERT: Used to insert new records into a table.

Syntax:

INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);

Example:

INSERT INTO employees (emp_id, emp_name, emp_salary) VALUES (101, 'John Doe', 50000);

SELECT: Used to retrieve data from one or more tables.

Syntax:

SELECT column1, column2, ... FROM table_name WHERE condition;

Example:

SELECT emp_name, emp_salary FROM employees WHERE emp_id = 101;

UPDATE: Used to modify existing records in a table.

Syntax:

UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition;

Example:

UPDATE employees SET emp_salary = 55000 WHERE emp_id = 101;

DELETE: Used to remove one or more records from a table.

Syntax:

DELETE FROM table_name WHERE condition;

Example:

DELETE FROM employees WHERE emp_id = 101;

MERGE: Used to perform an “upsert” operation, which is a combination of INSERT and UPDATE operations based on a condition.

Syntax:

MERGE INTO target_table USING source_table ON condition
WHEN MATCHED THEN UPDATE SET column1 = value1, column2 = value2, ...
WHEN NOT MATCHED THEN INSERT (column1, column2, ...) VALUES (value1, value2, ...);

Example:

MERGE INTO target_table USING source_table ON target_table.id = source_table.id
WHEN MATCHED THEN UPDATE SET target_table.value = source_table.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (source_table.id, source_table.value);

DML commands are essential for managing the contents of the database and are commonly used in conjunction with DDL and DCL commands to manipulate, define, and control the structure and access to the database.

DDL:

DDL stands for Data Definition Language. It is a subset of SQL used to define the structure and organization of the database objects. DDL commands are responsible for creating, modifying, and deleting database objects such as tables, indexes, views, and schemas.

CREATE TABLE: Used to create a new table in the database.

Syntax:

CREATE TABLE table_name (
         column1 datatype constraints,
         column2 datatype constraints,
         ...
     );

Example:

CREATE TABLE employees (
         emp_id INT PRIMARY KEY,
         emp_name VARCHAR(100),
         emp_salary DECIMAL(10, 2)
     );

ALTER TABLE: Used to modify an existing table’s structure.

Syntax:

ALTER TABLE table_name
     ADD column_name datatype constraints,
     MODIFY column_name datatype constraints,
     DROP COLUMN column_name;

Example:

ALTER TABLE employees
ADD emp_department VARCHAR(50),
DROP COLUMN emp_salary;

DROP TABLE: Used to delete a table and all its data from the database.

Syntax:

DROP TABLE table_name;

Example:

DROP TABLE employees;

CREATE INDEX: Used to create an index on a table. Indexes improve the speed of data retrieval operations.

Syntax:

CREATE INDEX index_name ON table_name (column_name);

Example:

CREATE INDEX emp_name_index ON employees (emp_name);

DROP INDEX: Used to remove an index from a table.

Syntax:

DROP INDEX index_name;

Example:

DROP INDEX emp_name_index;

CREATE VIEW: Used to create a virtual table based on the result set of a SELECT query.

Syntax:

CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;

Example:

CREATE VIEW employee_view AS
SELECT emp_id, emp_name
FROM employees
WHERE emp_department = 'IT';

DDL commands are crucial for database administrators and developers to manage the structure of the database and ensure data integrity. They are often used in combination with DML and DCL commands to complete database management tasks effectively.

DCL:

DCL stands for Data Control Language. It is a subset of SQL used to control access to data within a database. DCL commands are primarily concerned with defining and managing user privileges and permissions, ensuring data security and integrity.

GRANT: Used to grant specific privileges to a user or role.

Syntax:

GRANT privilege1, privilege2, ... ON object_name TO user_or_role;

Example:

GRANT SELECT, INSERT ON employees TO user1;

REVOKE: Used to revoke previously granted privileges from a user or role.

Syntax:

REVOKE privilege1, privilege2, ... ON object_name FROM user_or_role;

Example:

REVOKE SELECT, INSERT ON employees FROM user1;

COMMIT: Used to make permanent changes to the database since the last COMMIT or ROLLBACK statement.

Syntax:

COMMIT;

Example:

COMMIT;

ROLLBACK: Used to undo changes made in the current transaction and restore the database to its previous state.

Syntax:

ROLLBACK;

Example:

ROLLBACK;

SAVEPOINT: Used to set a named point within a transaction to which you can later roll back.

Syntax:

SAVEPOINT savepoint_name;

Example:

SAVEPOINT before_update;

SET TRANSACTION: Used to set properties for the current transaction.

Syntax:

SET TRANSACTION { ISOLATION LEVEL | READ WRITE | READ ONLY };

Example:

SET TRANSACTION READ WRITE;

DCL commands are essential for ensuring data security and controlling access to sensitive information within a database. They are often used in conjunction with DDL and DML commands to manage the overall security and integrity of the database.

Official Documentation Reference: Types of SQL Statements

Introduction to Star and Snowflake schema

Aarthii Gurunathan — Fri, 29 Mar 2024 05:24:53 +0000

In the world of data warehousing and business intelligence, two key concepts are fundamental: Snowflake and Star Schema. These concepts play a pivotal role in designing effective data models for analyzing large volumes of data efficiently. Let’s delve into what Snowflake and Star Schema are and how they are used in the realm of data warehousing.

Snowflake Schema

The Snowflake Schema is a type of data warehouse schema that consists of a centralized fact table that is connected to multiple dimension tables in a hierarchical manner. The name “Snowflake” stems from its resemblance to a snowflake, where the fact table is at the center, and dimension tables branch out like snowflake arms. In this schema:

The fact table contains quantitative data or measures, typically numeric values, such as sales revenue, quantity sold, or profit.
Dimension tables represent descriptive attributes or perspectives by which data is analyzed, such as time, geography, product, or customer.

The key characteristics of a Snowflake Schema include:

Normalization: Dimension tables are normalized, meaning redundant data is minimized by breaking down the dimension into multiple related tables.
Complex Joins: Analytical queries may involve complex joins between the fact table and multiple dimension tables to retrieve the desired information.

Snowflake Schema is particularly useful when dealing with large and complex datasets. However, the downside is that it can introduce more complex query logic due to the need for multiple joins.

Star Schema

The Star Schema is another widely used schema for data warehousing that consists of a single fact table connected directly to multiple dimension tables. In this schema:

The fact table contains quantitative data or measures, similar to the Snowflake Schema.
Dimension tables represent descriptive attributes, similar to the Snowflake Schema.

The key characteristics of a Star Schema include:

Denormalization: Dimension tables are denormalized, meaning redundant data is included directly in the dimension tables, simplifying query logic.
Simpler Joins: Analytical queries typically involve simpler joins between the fact table and dimension tables compared to the Snowflake Schema.

Star Schema is known for its simplicity and ease of use. It is well-suited for simpler analytical queries and is often favored for its performance benefits in query execution.

Key Differences

The main difference between Star and Snowflake schemas lies in their approach to storing dimensional data. Star schemas are simpler, with denormalized dimension tables, making them well-suited for fast query performance and simpler analytical queries. On the other hand, Snowflake schemas prioritize data integrity and storage efficiency through normalization but may result in slightly slower query performance due to additional joins.

Conclusion

Both Snowflake and Star Schema are essential concepts in the field of data warehousing, each with its own set of advantages and use cases. Choosing between them depends on the specific requirements of your data analysis tasks, the complexity of your data, and the performance considerations of your analytical queries. By understanding these schemas, you can design effective data models that cater to the needs of your business intelligence initiatives, enabling you to derive valuable insights from your data efficiently.

To Know More, Refer :

Spark DataFrame: Writing into Files

Gowtham Ramadoss Baskaran — Thu, 07 Mar 2024 04:28:19 +0000

This blog post explores how to write Spark DataFrame into various file formats for saving data to external storage for further analysis or sharing.

Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame along with writing a DataFrame into tables and views.

Creating DataFrame: https://blogs.perficient.com/2024/01/10/spark-scala-approaches-toward-creating-dataframe/
Manipulating DataFrame: https://blogs.perficient.com/2024/02/15/spark-dataframe-basic-methods/
Writing DataFrame into Tables and Views: Spark DataFrame: Writing to Tables and Creating Views / Blogs (perficient.com)

Dataset:

The Below is the Dataset that we will be using for looking on writing into a file from DataFrame.

Writing Spark DataFrame to File:

CSV Format:

Below is the Syntax to write a Spark DataFrame into a csv.

df.write.csv("output_path")

Lets go over writing the DataFrame to File using examples and scenarios.

Example:

The below snapshot is the sample for writing a DataFrame into a File.

After writing the DataFrame into the path, the files in the path are displayed. The displayed Part Files are the ones where the data is loaded. Databricks automatically partitioned each row into a file and created a file for all of the rows. We can repartition and create a single file from the DataFrame.

DataFrame Repartition:

After repartitioning, we observe that all the part files combine into a single file, and we notice other files besides the part files, which we can ignore from creating by using some Spark configurations below. These files will be created even when writing the data into other file formats rather than csv.

Removing _committed and _started Files:

We can use the below spark configuration which will not create the files starting with _commited and _started. =

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Removing _SUCCESS File:

We can use the below spark configuration to stop the _SUCCESS file from getting generated.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Data in the File:

With all the additional files removed we can see the data present within what is being loaded into the file. We can notice that by default spark doesn’t write header into the files we can modify them by using option/options. In addition, let’s see the available options when writing a DataFrame into a file.

Header Option:

By adding the header option, we observe that the header is populated in the file. Similarly, we have a option to change the delimiter.

Delimiter Option:

We can change the delimiter to our desired format by adding the additional option – delimiter or we can also use sep (syntax provided below).

df.write.option("header","true").option("sep","|").mode("overwrite").csv("dbfs:/FileStore/df_write/")

nullValue Option:

From the previous output we can notice that the Capital for Tonga in the DataFrame is null though in the csv it would have populated as empty. We can have it retained as null by using the nullValue option.

With this option, we observe that null is retained.

emptyValue Option:

In some scenarios we may need to populate null for empty values, in that case we can use the below option.

From the output above, we observe that Denmark previously had an empty value populated for its capital, but it is now being populated with null.

ignoreLeadingWhiteSpaces and ignoreTrailingWhiteSpaces Option:

If we need to retain the spaces before or after the value in a column, we can use the below options.

Different Way to use Multiple Options:

We can have all the options for the file format in a common variable and then use it whenever needed if we have to use the same set of options for multiple files.

We have created a variable writeOptions of Map type which has the options stored within it and we can use it whenever we need that Output Option.

JSON Format:

We can use the below syntax and format to write into a JSON file from the DataFrame.

Other Formats:

ORC Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").orc("dbfs:/FileStore/df_write/")

Parquet Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").parquet("dbfs:/FileStore/df_write/")

Similar to the above there are several more formats and examples along with syntaxes which you can reference from the below links.

In this blog post, we covered the basics of writing Spark DataFrame into different file formats. Depending on your specific requirements and use cases, you can choose the appropriate file format and configuration options to optimize performance and compatibility.

Spark SQL Properties

Aarthii Gurunathan — Wed, 06 Mar 2024 05:17:29 +0000

The spark.sql.* properties are a set of configuration options specific to Spark SQL, a module within Apache Spark designed for processing structured data using SQL queries, DataFrame API, and Datasets. These properties allow users to customize various aspects of Spark SQL’s behavior, optimization strategies, and execution environment. Here’s a brief introduction to some common spark.sql.* properties:

spark.sql.shuffle.partitions

The spark.sql.shuffle.partitions property in Apache Spark determines the number of partitions to use when shuffling data during operations like joins or aggregations in Spark SQL. Shuffling involves redistributing and grouping data across partitions based on certain criteria, and the number of partitions directly affects the parallelism and resource utilization during these operations. The default behavior splits DataFrames into 200 unique partitions when shuffling data.

Syntax:

// Setting the number of shuffle partitions to 200
spark.conf.set("spark.sql.shuffle.partitions", "200")

spark.sql.autoBroadcastJoinThreshold

The spark.sql.autoBroadcastJoinThreshold property in Apache Spark SQL determines the threshold size beyond which Spark SQL automatically broadcasts smaller tables for join operations. Broadcasting involves replicating a smaller DataFrame or table to all executor nodes to avoid costly shuffling during join operations.

Syntax:

// Setting the autoBroadcastJoinThreshold to 10MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")

spark.sql.execution.arrow.enabled

In Apache Spark SQL, the spark.sql.execution.arrow.enabled property determines whether Arrow-based columnar data transfers are enabled for DataFrame operations. Arrow is a columnar in-memory data format that can significantly improve the performance of data serialization and deserialization, leading to faster data processing.

Syntax:

// Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.sql.sources.partitionOverwriteMode

The spark.sql.sources.partitionOverwriteMode property in Apache Spark SQL determines the mode for overwriting partitions when writing data into partitioned tables. This property is particularly relevant when updating existing data in partitioned tables, as it specifies how Spark should handle the overwriting of partition directories. By default, partitionOverwriteMode will be Static.

Syntax:

// Setting the partition overwrite mode to "dynamic"
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

spark.sql.statistics.histogram.enabled

The spark.sql.statistics.histogram.enabled property in Apache Spark SQL determines whether Spark SQL collects histograms for data statistics computation. Histograms provide additional insights into the distribution of data in columns, which can aid the query optimizer in making better execution decisions. By default, the config is set to false.

Syntax:

// Enable collection of histograms for data statistics computation
spark.conf.set("spark.sql.statistics.histogram.enabled", "true")

spark.sql.streaming.schemaInference

The spark.sql.streaming.schemaInference property in Apache Spark SQL determines whether schema inference is enabled for streaming DataFrames. When enabled, Spark SQL automatically infers the schema of streaming data sources during runtime, simplifying the development process by eliminating the need to manually specify the schema.

Syntax:

// Enable schema inference for streaming DataFrames
spark.conf.set("spark.sql.streaming.schemaInference", "true")

spark.sql.adaptive.skewJoin.enabled

The spark.sql.adaptive.skewJoin.enabled property in Apache Spark SQL determines whether adaptive query execution is enabled for skew join optimization. When enabled, Spark SQL automatically detects and mitigates data skewness in join operations by dynamically adjusting the join strategy to handle skewed data distributions more efficiently. By Default skew join is True.

Syntax:

// Enable adaptive query execution for skew join optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

spark.sql.inMemoryColumnarStorage.batchSize

The spark.sql.inMemoryColumnarStorage.batchSize property in Apache Spark SQL configures the batch size for columnar caching. This property defines the number of rows that are processed and stored together in memory during columnar caching operations. By Default, batchsize is 10000.

Syntax:

// Setting the batch size for columnar caching to 1000 rows
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000")

spark.sql.adaptive.coalescePartitions.enabled

The spark.sql.adaptive.coalescePartitions.enabled property in Apache Spark SQL determines whether adaptive partition coalescing is enabled. When enabled, Spark SQL dynamically adjusts the number of partitions during query execution to optimize resource utilization and improve performance.

Syntax:

// Enable adaptive partition coalescing
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Example

Here’s an example demonstrating the usage of all the mentioned Spark SQL properties along with a SQL query:

// Importing necessary Spark classes
import org.apache.spark.sql.{SparkSession, DataFrame}

// Setting Spark SQL properties
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") // 10 MB
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.conf.set("spark.sql.statistics.histogram.enabled", "true")
spark.conf.set("spark.sql.streaming.schemaInference", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

// Creating DataFrames for the tables
val employeesData = Seq((1, "Aarthii", 1000), (2, "Gowtham", 1500), (3, "Saranya", 1200))
val departmentsData = Seq((1000, "HR"), (1200, "Engineering"), (1500, "Finance"))
val employeesDF = spark.createDataFrame(employeesData).toDF("emp_id", "emp_name", "dept_id")
val departmentsDF = spark.createDataFrame(departmentsData).toDF("dept_id", "dept_name")

// Registering DataFrames as temporary views
employeesDF.createOrReplaceTempView("employees")
departmentsDF.createOrReplaceTempView("departments")

// Executing a SQL query using the configured properties
val result = spark.sql(
"SELECT emp_name, dept_name FROM employees e JOIN departments d ON e.dept_id = d.dept_id"
)

// Showing the result
result.show()

OUTPUT:

In this example:

We import the necessary Spark classes, including SparkSession and DataFrame.
We create a SparkSession object named spark.
We set various Spark SQL properties using the spark.conf.set() method.
We create DataFrames for two tables: “employees” and “departments”.
We register the DataFrames as temporary views using createOrReplaceTempView().
We execute a SQL join query between the “employees” and “departments” tables using spark.sql().
Finally, we display the result using show().

These properties provide fine-grained control over Spark SQL’s behavior and optimization techniques, enabling users to tailor the performance and functionality of Spark SQL applications to specific requirements and use cases.

Reference: https://spark.apache.org/docs/latest/configuration.html

Date and Timestamp in Spark SQL

Aarthii Gurunathan — Tue, 27 Feb 2024 07:55:04 +0000

Spark SQL offers a set of built-in standard functions for handling dates and timestamps within the DataFrame API. These functions are valuable for performing operations involving date and time data. They accept inputs in various formats, including Date type, Timestamp type, or String. If the input is provided as a String, it must be in a format compatible with date (e.g., yyyy-MM-dd) or timestamp (e.g., yyyy-MM-dd HH:mm:ss.SSSS) representations. The functions return the corresponding date or timestamp values based on the input type. If the input string cannot be successfully converted to a date or timestamp, the functions return null.

Let’s see some Date and Timestamp syntax and examples in Spark SQL:

First Create a Sample Dataset and save it as View, which we can use for seeing the date and timestamp functions in SQL.

To Know how to create dataframe and it methods , check into : https://blogs.perficient.com/2024/01/10/spark-scala-approaches-toward-creating-dataframe/

and also to know about writing into Table , look into : https://blogs.perficient.com/2024/02/25/spark-dataframe-writing-to-tables-and-creating-views/

Creating a temp view

val df = spark.createDataFrame(Seq(
  ("2024-02-27", "2024-02-27 15:30:45.123"),
  ("2024-01-15", "2024-01-15 08:45:30.555"),
  ("2023-11-20", "2023-11-20 12:00:00.000"),
  ("invalid date", "invalid timestamp")
)).toDF("date_string", "timestamp_string")
df.createOrReplaceTempView("my_table")

Displaying my_table

Date Data Type

The DateType represents a date without a time component. Dates can be created from string using the DATE literal or the TO_DATE() function.

Example of DATE Literal

Example of TO_DATE() function

Timestamp Data Type

The TimestampType represents a date and time with millisecond precision. Timestamps can be created using the TIMESTAMP literal or the TO_TIMESTAMP() function.

Example of TIMESTAMP Literal

Example of TO_TIMESTAMP() function

Date and Timestamp Functions

Spark SQL provides various functions for working with dates and timestamps:

Date Functions: year(), month(), dayofmonth(), dayofweek(), dayofyear(), weekofyear(), etc.

Timestamp Functions: hour(), minute(), second(), date_format(), etc.

Formatting Dates and Timestamps

You can format dates and timestamps using the date_format() function.

date_format example with formating Date

date_format example with formating timestamp

Date and Timestamp Arithmetic

You can perform arithmetic operations on dates and timestamps using functions like date_add(), date_sub(), and arithmetic operators.

Example of adding 1 day to given dates

Example of adding 1 hour to given timestamp

Filtering Dates and Timestamps

You can filter data based on dates and timestamps using comparison operators.

Example of filtering with date

Example of filtering with timestamp

These are some common operations you can perform with dates and timestamps in Spark SQL. They are essential for various analytical tasks, especially when dealing with time-series data.

References:

Spqk SqL build-in function : https://spark.apache.org/docs/2.3.0/api/sql/index.html

Spark Scala: Approaches toward creating Dataframe

Gowtham Ramadoss Baskaran — Wed, 10 Jan 2024 14:36:55 +0000

In Spark with Scala, creating DataFrames is fundamental for data manipulation and analysis. There are several approaches for creating DataFrames, each offering its unique advantages. You can create DataFrames from various data sources like CSV, JSON, or even from existing RDDs (Resilient Distributed Datasets). In this blog we will see some approaches towards creating dataframe with examples.

Understanding Spark DataFrames:

Spark DataFrames are a fundamental member of Apache Spark, offering a higher-level abstraction built on top of RDD. Their design aims to offer a more efficient processing mechanism for managing large-scale structured data.

Let’s explore the different methods for creating the dataframes:

Creating DataFrame from a Collection:

DataFrames intended for immediate testing can be built on top of a collection.

In this case the dataframe is built on top of a Sequence.

DataFrame Creation using DataFrameReader API:

The shown below is the dataset that will be used to explore creating DataFrames using DataFrameReader API.

Creating a Dataframe from CSV/TXT Files:

We can directly use “spark.read.csv” method to read the file into a dataframe. But the current data cannot be loaded directly into table or other downstream. The headers and delimiters need to be put as separate options to get the data properly loaded into the dataframe. Let’s go through the options that are available and get the data to a proper format below.

Options:

header:

By default, the header option will be false in the Spark Dataframe Reader API. Since our file has header, we need to specify the option – option(“header”,” true”) to get the header. Sometimes if the header option has been missed and the file has a header there is a chance that the header might go as a row and get stored in the table.

delimiter:

The delimiter option will be “,” by default. The sample file that we have provided has “|” as its delimiter so it needs to be explicitly called out using the option – option(“delimiter”,”|”) to get the columns splitted.

multiline:

As you see in the above snapshot that the data is splitted and assigned to individual columns and rows, though the Capital for India New Delhi is splitted into two rows because by default the multiline option would be set to false. If we have this kind of multiple lines coming from the file then we can enable the multiline option by using this option – option(“multiline”, “true”).

schema:

The schema of the dataframe can be viewed by using “.printSchema” method.

From the above snapshot we can see that the datatype for the column “Id” being referred as string though the datatype that we are seeing in the files is integer. This is because the inferSchema will be set to false when reading the file. “inferSchema” is nothing but the DataFrameReader, which will go through the data and finds the datatype of the column. We can also enforce the Schema of the dataframe by creating a Schema using StructType and passing it through “.schema” method. Both the methods are shown below.

inferSchema: We can set the inferSchema to true by including this option – option(“inferSchema”,”true”), which in turn will make the DataFrameReader to go through the data and find the datatype.

Defining Schema: We can enforce the schema by using the “.schema” method for which we will define the schema and pass it on when reading the file, by which we can control the datatype of the columns. If there is a datatype mismatch when enforcing the data in the column then null will be populated.

Reading txt/txt.gz:

The “spark.read.csv” with options can be used to read the txt or txt.gz files, which will return us the dataframe. If we have the proper text file within the zip, then we can directly read them as dataframes without unzipping it.

Reading csv with a different Format:

The “spark.read.format(“csv”).load” can be also used instead of spark.read.csv. Both are same functions but with a different syntax.

The above are some of the options that are present when reading the file into a dataframe, few more options which was not shown here are escapeQuotes, unescapedQuoteHandling, quote, escape, mode, nullValue, lineSep.

Creating a Dataframe from json Files:

The below is the json file snapshot that will used for the examples.

The DataFrameReader can be used for reading the json into a dataframe by using “spark.read.json()”.

In the above snapshot we can see that the dataframe is arranged using the alphabetical order of the column names, which we can change to the desired format with Id in the first followed by country and capital as in the json.

With the schema defined we can see that the columns are in alignment and can be transformed if needed and stored into a table.

The DataFrameReader can also be used to read files of parquet, orc and also, we can connect to different databases using the jdbc connection and read them into the dataframe.

Conclusion:

In conclusion, creating DataFrames in Spark using Scala involves various approaches, each tailored to specific requirements and preferences. The DataFrame API provides a flexible and intuitive interface for data manipulation and analysis, offering both functional and declarative programming paradigms.

The DataFrame creation process can include reading data from diverse sources, such as CSV files, JSON, Parquet, or even external databases. Once the DataFrame is created, you can use the powerful Spark SQL capabilities to execute SQL queries directly on your DataFrames and perform your transformations before using them in the downstream. Overall, the flexibility and scalability of Spark Scala’s DataFrame API empower data engineers and analysts to efficiently process and analyse large-scale datasets within the Spark ecosystem.

SQL Tuning

Rajesh Ranga Rao — Sat, 25 Feb 2023 13:32:23 +0000

In D & A Projects, building efficient SQL Queries is critical to achieving the Extraction and Load Batch cycles to complete faster and to meet the desired SLAs. The below observations are towards following the approaches to ensure writing SQL queries that meet the Best Practices to facilitate performance improvements.

Tuning Approach

Pre-Requisite Checks

Before we get into subjecting a SQL Query against Performance Improvements, below are steps to be adopted:

Deep Dive into the current SQL Query
- Complexity of the SQL (# of Tables/Joins/Functions)
- Design of the SQL Query (Sub-Query/Correlated Sub-Query/Join/Filter Sequences)
- Whether Best Practices followed: Is it modularized? When joined, does it contain functions and derivations?
Verify the As-Is Metrics of the SQL
- Duration to return 1^st record and first 100 records
- Extract the Explain Plan Metrics
  - Cost (Resource Usage)
  - Cardinality (# of Rows returned per Task Operations)
  - Access Method (Full Table/ROWID/Index Unique/Full Index/Index Skip Scan)
  - Join Method (Hash/Nested-Loop/Sort-Merge/Outer Join)
  - Join Order (Multiple tables join sequence)
  - Partition
  - Parallel Processing (Exec on Multiple Nodes)

After ensuring the above prerequisites are taken care of and possible bottlenecks identified, tuning practices can be applied to the SQL Query for performance improvements.

Tuning Guidelines

Basic Guidelines are listed below:

Query Design Perspective
- Extract only the required columns in the code via SELECT (instead of SELECT *)
- Use Inner joins well ahead of Outer joins
- Filters applied ahead with Inner Joins rather than at the end using WHERE clause
- Avoid Sub-queries and Correlated Sub-queries as much as possible
- Create TEMP tables
  - to hold Sub-Query logic
  - to Modularize Complex Logic with related Columns and Derivations
  - to hold a reference list of values (used as Joins instead of IN clause)
  - to hold Functions, Calculations, and Derivations Attributes for later JOIN with Tables
  - to hold Complex Query Logic and subsequently apply RANK()/ROW_NUMBER()
- Create Physical tables (instead of TEMP) if high volume
- Drop the TEMP or Physical tables after intermediate processing completes
- Complex Query with too many LEFT joins can be broken into parts and then JOINed
- Avoid Duplicates as early as possible before subjecting the Derived tables to JOINs
- On MPP DBs, do not use DISTRIBUTION for Smaller tables
- On MPP DBs, DISTRIBUTION column-based joins provide faster results
Functions Perspective
- Use EXISTS instead of IN if presence alone requires to be checked
- Instead of MINUS, use LEFT JOIN with IS NULL condition
- If DISTINCT causes slowness, try ROW_NUMBER() to select 1 record out of Multiples
- Do not use Functions on Joins
DBA Perspective
- Collect STATISTICS
- Create Indexes (Single/Multiple) (on frequently used Joins/Predicates as required)
- Create Partitions (for Optimized Scans)
Space and Computing Perspective
- Increase the DB Server storage space
- Increase the DB Server Computing Abilities
- Multi-Node Processing of Queries

Conclusion

On a high level, below are the inferences:

Check Explain Plan
Subject the Query to effective Design
Focus on DBA, Space, and Computing Abilities
Follow the Best Practices

SQL Best Practices and Performance Tuning

Novil Pawar — Tue, 10 Jan 2023 09:55:04 +0000

The goal of performance tuning in SQL is to minimize the execution time of the query and reduce the number of resources while processing the query. Whenever we run the query, performance depends on the amount of data and the complexity of the calculations we are working on. So, by reducing the no of calculations and data, we can improve the performance. For that, we have some best practices and major factors which we are going to discuss in detail.

Data Types – Deciding on the right data type can reduce the storage space and improve our performance. We should always choose the minimum size of the data type which will work for all the values in all columns.

Choosing a specific data type helps to ensure that only specific values are stored in a particular column and reduce storage size.
Sometimes we need to convert one datatype to another which increases resource utilization and thereby reduces performance. So, to avoid that while creating a table care should be taken that we use the correct datatype across the tables in our data model, by doing so we are reducing the chances of changing them in the future. In this way, we don’t need to convert data type implicitly or explicitly and our query runs faster.
We should use a new data type instead of the deprecated data type.
Store the date and time in a separate column. It helps to aggregate data on the date and timewise also helps when we filter the data.
When we have a column with a fixed length, go for the fixed length data type, for example- Gender, Flag value, Country code, Mobile number, Postal code, etc.

Filtering Data- Query performance depends on how much data we are processing so it is important to take only the required data for our query. Also, at which level we are filtering the data.

Let’s see some of the scenarios –

For example, if we want to see aggregation for the year 2022 and for the ABC department then we should always filter data before aggregation in the Where clause instead of Having.
If we want to join two tables with specific data, then we should filter the required data before joining the table.

Joins – Join is a very common and useful concept in Databases and data warehouses. In order to improve performance choosing the appropriate join for our requirements is very important. Below are some best practices of join.

If we want only matching records from joining tables, then we should go for inner join. If we want full data from any one table, then we should go for left or right outer join and if we want full data from both the table then we should go for full outer join. Always try to avoid Cross join
Use ON instead of writing join condition on where clause.
Use alias name for table and column.
Avoid OR in the join condition.
Always prefer to join instead of a correlated subquery. correlated subquery has poor performance as compared to the joins.

Exist vs IN –

We should use EXIST instead of IN whenever the subquery returns a large amount of data.
We should use IN when the subquery returns a small amount of data.

Index- If we talk about performance tuning in SQL, Index plays a very important role. We can create an index either implicitly or explicitly. We have to use the index very carefully because, on one hand, it increases performance in searching, sorting, and grouping record, and on another hand, it increases disk space and takes more time while inserting, updating, and deleting data. There are two types of indexes, Cluster and Non-Cluster indexes. We can have only one Cluster index per table and whenever we create a primary key in the table, the database creates a clustered index implicitly. We can have multiple non-cluster indexes in the table. Whenever we create Unique Key on the table, the database creates a non-cluster index.

Below are some best practices for creating indexes.

It is always recommended that we should create a clustered index before creating a non-cluster index.
Integer data type works faster with index as compared to string because integer has low space requirement. That is why it is recommendable to create a primary key on the Integer column.
Indexing in OLTP database -One should avoid multiple indexes in OLTP (online transaction processing) database since there is a need to frequently insert and modify the data hence multiple indexes might have a negative impact on performance.
Indexing in OLAP database –OLAP (online analytical processing) database is mostly used for analytical purposes hence we commonly use select statements to get the data. In this scenario, we can use more indexes on multiple columns without affecting the performance.

Union vs Union All- Union all is faster as compared to Union because union check for duplicates and returns distinct values while Union all returns all records from both the table. If you know that both tables have unique and distinct records from each other, then we can go for union all for better performance.

These are some best practices that we can follow to improve the SQL performance.

Sneak Peak of the SQL Order Of Execution

Arpit Malviya — Tue, 22 Nov 2022 19:28:28 +0000

After writing SQL codes for a few years now, I noticed I used to make mistakes in the order of execution of SQL Queries. So here I am with my new blog on the same topic.

Let’s jump in to understand the SQL order of execution & learn practical, correct ways to write SQL Codes.

Flowchart

Let’s take an example to understand this better.

To do that, I am going to use two simple unnormalized form tables: Citizen and City. They are described as followed:

The citizen table contains data on distinguished citizens and the identification number of the city they live in, and City is the table with city names and their respective identification numbers.

Let’s say that we want to know the name of only two city names, except San Bruno, where two or more citizens are living on. We also want the result ordered alphabetically.

This is the query to get the required information.

SELECT city.city_name AS “City”
FROM citizen
JOIN city
ON citizen.city_id = city.city_id
WHERE city.city_name != ‘San Bruno’
GROUP BY city.city_name
HAVING COUNT(*) >= 2
ORDER BY city.city_name ASC
LIMIT 2

Query Process Steps

1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Group Filter (Having)
5. Return Expressions (Select)
6. Order & Paging (Order by & Limit / Offset)

**Step 1: Getting Data (From, Join)**

FROM citizen
JOIN city

The first step in the process is the execution of the statements in From clause, followed by the Join clause. The result of these operations is getting a cartesian product of our two tables.

Cartesian Product

Once the From and Join were executed, the processor will get the qualified rows based on the condition On.

ON citizen.city_id = city.city_id

Step 2: Row Filter (Where)

After getting qualified rows, it is passed on to the Where clause. This evaluates every row using conditional expressions. When rows do not evaluate to true, they will be removed from the set.

WHERE city.city_name != ‘San Bruno’

Step 3: Grouping (Group by)

The next step is to execute Group by clause; it will group rows that have the same values into summary rows. After this point, all Select expressions will be evaluated per group instead of being evaluated per row.

GROUP BY city.city_name

Step 4: Group Filter (Having)

The Having clause consists of a logical predicate; it is processed after the Group by and can no longer refer to individual rows, only to groups of rows.

HAVING COUNT(*) >= 2

The result of executing this operation will keep the set, as shown in the figure above. This is because there are two or more elements in every group.

Step 5: Return Expressions (Select)

During this step, the processor evaluates what will be printed as a result of the query and if there are some functions to run on data like Distinct, Max, Sqrt, Date, Lower, etc. In this case, the select clause just prints the city names and alias the city_name column with the identifier “City.”

SELECT city.city_name AS “City”

Step 6: Order (Order by) and Paging (Limit / Offset)

The final processing steps of the query deal with presentation ordering and the ability to limit the size of the result set. In our example, it is required to present a maximum of two records ordered alphabetically.

ORDER BY city.city_name ASC
LIMIT 2

We got the desired output ( Name of only two city names, except San Bruno, where two or more citizens are living on, alphabetically)

Conclusion

Knowing this order will help in writing the right way of code & helps in understanding too in troubleshooting any error.

I wish you the best of luck in your Data science / Data analysis endeavors!

sql Articles / Blogs / Perficient

From Cloud to Local: Effortlessly Import Azure SQL Databases

Steps to Import a .bacpac File Using SqlPackage

1. Download SqlPackage

2. Install the Tool

3. Run SqlPackage

4. Adjust Parameters for Your Setup

5. Run and Wait

Snowflake: Master Real-Time Data Ingestion

Snowpipe

Streams in Snowflake

Setting up Snowpipe and Streams

Creating the Table

Creating a Stream

Configuring Snowpipe

Example Scenario

Output from Stream

Example Output

Additional use cases

Real-Time Monitoring of Financial Transactions

Analysis of User Behavior in Web and Mobile Applications

Conclusion

Discoveries from Q&A with Enterprise Data using GenAI for Oracle Autonomous Database

Several GenAI Models: Which One to Use?

Cohere Command:

OpenAI GPT-4:

OpenAI GPT-3.5-Turbo:

Sample Question: Compare sales for package size P between the Direct and Indirect Channels

Responses Generated by Each Model:

Tips on Preparing the Schema for GenAI Questions

Selective Schema Objects:

Table/View Joins:

Create Database Views:

Comments:

Is There a Better Way of Asking a Question?

Results are returned based on the following automatically generates SQL query:

SQL: DML, DDL and DCL

DML:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

DDL:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

DCL:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Syntax:

Example:

Official Documentation Reference: Types of SQL Statements

Introduction to Star and Snowflake schema

Snowflake Schema

Star Schema

Key Differences

Conclusion

Steps to Import a `.bacpac` File Using SqlPackage

**Step 1: Getting Data (From, Join)**