Jorge Chavez, Author at Perficient Blogs

Best Practices to Design a Database

Jorge Chavez — Wed, 14 Sep 2022 14:09:30 +0000

One of the basic aspects of building scalable and high-performance applications starts with a solid and good database design. A database is a structured set of data held in a computer, especially one that is accessible in diverse ways, and whose purpose is to provide a framework to ease the search for such data. Everything else around it is just additional details. If the database is well designed, the data and metadata should be filed by its applications and they can be retrieved as needed.

Every small decision at the start of the database design process has a big impact on the result. In projects I have worked on previously, I have seen clients implementing an unoptimized database design as suggested by product vendors engineers and paying the price of a bad decision by having to rework the initial design after years of the volume of data needing to be manually fixed and the stress of reevaluating the database to adjust to their new requirements. It can happen very quickly if you don’t take into account the future state of the data for clients. In order to help you avoid common mistakes that can affect applications, here are 10 best practices to consider when designing a database.

1. Involve Product Owners and Stakeholders Early

Get agreement on the expectations and think about the difficulty to train users and the database usability from their perspective.

2. Choose A Database Type

Depending on the data structures (or flexibility), you may want to choose between SQL and NoSQL database types. SQL databases are the standard for high integrity and structured data. The most common ones are: Oracle, SQL Server, MySQL, and SQLite, while NoSQL databases provide more speed, scalability, and fluid requirements which accommodate better applications like machine learning, web analytics, social networks and the Internet of Things (IoT). Some examples of NoSQL databases are: RavenDB, Couchbase, and MongoDB.

3. Data Normalization is Good

The more complex the structure and needs of the business, the more critical it becomes l to normalize data to achieve minimum redundancy.

Multi-valued attributes are the ones on an entity that may have more than one value linked with its primary key. For example, a gamer may have several gaming platforms or consoles, so the entity “console” or “game platform” should be its own entity to allow the normalization of data. Reduce or avoid multi-valued attributes and duplicated attributes as much as possible.

4. Focus on Transparency

Think of the future database users during the design process. Avoid the use of abbreviations, plurals, or random words to name the objects in the database. It is better to stick to consistent naming conventions. Change is a constant and the design will likely need to be changed in the future, so don’t make it difficult for those who may need to extend the database design.

5. Define Constraints to Maintain Data Integrity

Constraints will help you to keep data integrity and avoid inconsistency and bugs in the software. Usually, software applications have a validation system in place to prevent some of the persistent errors in the database, but we should ensure protection from bad data from the database level as well. Some recommendations to ensure data integrity are:

Always validate input data
Implement access controls
Keep an audit trail
Always backup data
Adopting security best practices
Educate your workforce
Implement the use of change control (versioning of files) tools.
Plan for business continuity.
Be accurate.

Depending on the database implementation, you could apply the following constraints to different database objects. For Oracle databases, this chart would help you identify the types of constraints and their applicable objects:

6. Document as much as needed

Although it might seem annoying, documentation is an essential piece of the database design. It is important to document the design, entity-relationship schemas, triggers, and other database objects for future users and engineers.

7. Plan for Disaster

Disaster recovery is an important consideration before delving too deeply into the design, both from natural or man-made disasters. Some strategies to put in place in order to ensure that the database structure remains intact are:

Failover clustering
Auto backups
Replication
Disaster Recovery as a Service using Cloud technologies.

8. Keep Privacy Primary

Privacy should be a priority for business users and customers alike. Encrypt passwords, and assign an administrator with privacy training and qualifications. Having vulnerabilities impact data integrity very highly which in turn affects everything else in the business. The database should be as closed as possible.

9. Optimize for Speed

Optimize by creating indexes for queries used regularly. It is recommended to use a database analyzer to determine if an index or a clustered index is necessary. Here are some database optimization tools that might prove useful depending on the requirements:

EverSQL – Free, supports MySQL, PostgreSQL, MongoDB databases, Windows, Linux, macOS operating systems, and cloud platforms.
Paessler PRTG Network Monitor – Although it does not optimize itself, it provides details about the queries analyzed and how to optimize.
dbForge Studio – May help to identify bottlenecks and analyze queries to optimize.
Plan Explorer – Free, but only available on Windows.
Holistic.dev – Only supports PostgreSQL databases.
Database Performance Analyzer – Uses cloud and machine learning techniques to identify optimization points in the database.
ApexSQL – Affordable subscription for business. Allows analysis of SQL Server execution plan.

10. Keep the Database on Its Own Server

Allocate the database server on separate hardware not just the webserver to lower CPU usage. This also helps to have additional security access and keep the data out of the reach of unauthorized users.

Final Thoughts

These are some of the best practices to consider. It is not a ruleset and every business might have restrictions or different needs that might not fit for them, but keep flexibility in mind and refine, re-direct or break the recommendations to best accommodate your needs to stay in line with the business goals.

Sources:

How to Improve SQL Performance – Antipatterns

Jorge Chavez — Wed, 29 Sep 2021 18:49:27 +0000

Ever come close to writing a perfectly looking and valid SQL just to see it take hours to run? SQL optimization might be the thing you need! For business and customers, time is limited and precious and as Software Developers, QA Engineers, DB Developers, etc we should strive to improve the efficiency of our queries to save their, and our time.

To improve the performance of your SQL query, you will need first to consider that the query goes through several steps at execution and is parsed into a “parse tree”. Then, the query is analyzed to validate its syntactical and semantic requirements. The parser creates an internal representation of the input query. This output is then passed on to the rewrite engine.

After these steps, the SQL optimizer tries to find the optimal execution plan, which defines the algorithm used for each operation and how the execution of the operations is orchestrated. The optimizer enumerates all possible execution plans, determines the quality or cost of each plan, takes information about the current database state, and then chooses the best one as the final execution plan. Because query optimizers can be imperfect, database users and administrators sometimes need to manually examine and tune the plans produced by the optimizer to get better performance.

The number of disk I/Os that are required to evaluate the plan, the plan’s CPU cost and the overall response time that can be observed by the database client, and the total execution time are essential factors to determine the quality or cost of each execution plan.

Next, the chosen execution plan is run, evaluated by the system’s execution engine, and the results of your query are returned.

The main “pain” points where issues might be present for optimization are:

The WHERE clause;
Any INNER JOIN or LEFT JOIN keywords; And,
The HAVING clause;

In this post, I will be focusing on the anti-patterns to try to avoid and the aspects to consider for improving the execution time of your queries.

WHERE Clause

Context is important. Whether you need to rewrite your query or not and how you would do it depends on the amount of data, the database, and the number of times you need to execute the query, among other things.

Only SELECT the columns you need, LIKE responsibly, and look out for DISTINCT results.

SELECT Statement

Always remove unnecessary columns from your SELECT statement so you make a habit to pull up data that serves your query goal.

If you have correlated subqueries that use EXISTS, try to use a constant in the SELECT statement of the subquery instead of selecting the value of an actual column.

For example:
SELECT security_id, name
FROM security
WHERE EXISTS
- (SELECT ‘1’
- FROM rating
- WHERE rating.security_id = security.security_id);

Tip: You can always consider getting rid of correlated subqueries by rewriting them with an INNER JOIN:

SELECT security_id, name
FROM security
INNER JOIN rating ON rating.security_alias = security.security_alias;

DISTINCT Clause

The SELECT DISTINCT statement is used to return only different values for the record columns included in your query. You should try to avoid the use of DISTINCT if possible since its inclusion increases the execution time.

LIKE Operator

When you use the LIKE operator in a query, the index isn’t used if the pattern starts with % or _. It will prevent the database from using an index – if it is available.

If not used correctly, it might cause too many records to be retrieved that don’t necessarily satisfy your query goal. Your knowledge of the data stored in the database can help you formulate a pattern that will filter correctly through all the data to find only the rows that matter for your query.

2. Limit Your Results

TOP, LIMIT, And ROWNUM Clauses

You can add the LIMIT or TOP clauses to your queries to set a maximum number of rows for the result set. Here are some examples:

SQL Server
- SELECT TOP 3 *
- FROM security;
MySQL
- SELECT *
- FROM security
- LIMIT 3;
Oracle
- SELECT *
- FROM security
- ORDER BY security_id
- FETCH FIRST 3 ROWS ONLY;

It is also possible to specify the PERCENT instead of a specific number if you change the first line of the query by SELECT TOP 50 PERCENT *.

SQL Server
- SELECT TOP 50 PERCENT *
- FROM security;
MySQL
- SELECT *
- FROM security
- WHERE RAND() < 0.5
Oracle
- SELECT *
- FROM security
- ORDER BY security_id
- FETCH FIRST 50 PERCENT ROWS ONLY;

Additionally, you can also add the ROWNUM clause, which is equivalent to using LIMIT in your query:

SELECT *
FROM rating
WHERE alpha = ‘A’ AND ROWNUM <= 3;

Data Type Conversions

Whenever possible, use the smallest data types possible for higher efficiency. However, consider that whenever you add data type conversion to your query, you increase the execution time.

An alternative is to avoid data type conversion as much as possible. Although it is not always possible to remove or omit the data type conversion from your queries, you should aim to be careful in including them and that when you do, you test the effect of the addition before you run the query.

3. Keep it simple

Queries can become complex quickly. Be careful and consider the next tips to keep them simple.

OR Operator

When you use the OR operator in your query, likely, you’re not using an index.

Remember that an index is a data structure that improves the speed of the data retrieval in your database table, but it comes at a cost: there will be additional writes and additional storage space is needed to maintain the index data structure. Indexes are used to quickly locate or lookup data without having to search every row in a database every time the database table is accessed. Indexes can be created by using one or more columns in a database table.

If you don’t make use of the indexes that the database includes, your query will inevitably take longer to run. That’s why it’s best to look for alternatives to using the OR operator in your query.

Consider the following query:

SELECT security_id, name
FROM security
WHERE security_id = 123456
OR security_id = 678910
OR security_id = 345678;

You can replace the operator with one of the following approaches:

A condition with IN operator; or

SELECT security_id, name
FROM security
WHERE security_id IN (123456, 678910, 345678);

Two SELECT statements with a UNION.

SELECT security_id, name
FROM security
WHERE security_id = 123456;
UNION
SELECT security_id, name
FROM security
WHERE security_id = 678910
UNION
SELECT security_id, name
FROM security
WHERE security_id = 345678;

Tip: Be careful not to unnecessarily use the UNION operation because you go through the same table multiple times. Also, you have to realize that when you use a UNION in your query, the execution time will increase. Alternatives to the UNION operation are: reformulating the query in such a way that all conditions are placed in one SELECT instruction or using an OUTER JOIN instead of UNION.

NOT Operator

When your query contains the NOT operator, likely, the index is not used, just like with the OR operator. This will inevitably slow down your query. If you don’t know what is meant here, consider the following query:

SELECT security_id, name

FROM security

WHERE NOT (creation_year > 2020);

This query will run slower than you would maybe expect, mainly because it’s formulated a lot more complex than it could be: in cases like this one, it’s best to look for an alternative. Consider replacing NOT by comparison operators, such as >, <> or!>; The example above might indeed be rewritten and become something like this:

SELECT security_id, name
FROM security
WHERE creation_year <= 2020;

AND Operator

The AND operator is another one that doesn’t make use of the index, and that can slow your query down if used in an overly complex and inefficient way, like in the example below:

SELECT security_id, name
FROM security
WHERE creation_year >= 2011 AND creation_year <= 2020;

It’s better to rewrite this query and use BETWEEN operator:

SELECT security_id, name
FROM security
WHERE creation_year BETWEEN 2011 AND 2020;

ANY and ALL Operators

Similar to the previously discussed OR, AND, and NOT operators, indexes configured in the database will not be used when using the ANY and ALL operators. Alternatives that will come in handy here are aggregation functions like MIN or MAX.

Tip: in cases where you make use of the proposed alternatives, you should be aware of the fact that all aggregation functions like SUM, AVG, MIN, MAX over many rows can result in a long-running query. In such cases, you can try to either minimize the number of rows to handle or pre-calculate these values. You see once again that it’s important to be aware of your environment, your query goal, … when you make decisions on which query to use!

Isolate Columns in Conditions

Also in cases where a column is used in a calculation or a scalar function, the index isn’t used. A possible solution would be to simply isolate the specific column so that it is no longer a part of the calculation or the function. Consider the following example:

SELECT security_id, name
FROM security
WHERE creation_date + 10 = 2020;

Instead, try reconsidering the calculation and rewriting the query to something like this:

SELECT security_id, name
FROM security
WHERE creation_date = 2010;

4. No Brute Force

This last tip means that you shouldn’t try to restrict the query too much because it can affect its performance. This is especially true for joins and the HAVING clause.

JOIN

The Order Of Tables

When you join two tables, it can be important to consider the order of the tables in your join. If you notice that one table is considerably larger than the other one, you might want to rewrite your query so that the biggest table is placed last in the join.

Redundant Conditions on Joins

When you add too many conditions to your joins, you force SQL to choose a particular path. It may be the case that this path isn’t the most efficient one.

HAVING Clause

HAVING is typically used with the GROUP BY clause to restrict the groups of returned rows to only those that meet certain conditions. However, if you use this clause in your query, the index is not used, which can result in longer execution times.

If you’re looking for an alternative, consider using the WHERE clause. Consider the following queries:

SELECT type, COUNT(*)
FROM security
WHERE type IN (‘FI’, ‘EQ’)
GROUP BY type
ORDER BY type

SELECT type, COUNT(*)
FROM security
GROUP BY type
HAVING state IN (‘FI’, ‘EQ’)
ORDER BY type

The first query uses the WHERE clause to restrict the number of rows that need to be summed, whereas the second query sums up all the rows in the table and then uses HAVING to throw away the sums it calculated. In these types of cases, the alternative with the WHERE clause is the better one, as you don’t waste any resources.

You will see that this is not about limiting the result set, but instead about limiting the intermediate number of records within a query.

Note that the difference between these two clauses lies in the fact that the WHERE clause introduces a condition on individual rows, while the HAVING clause introduces a condition on aggregations or results of a selection where a single result, such as MIN, MAX, SUM,… has been produced from multiple rows.

Evaluating the quality or cost of the execution plan, writing and rewriting of queries is not an easy job when you take into account that they need to be as efficient as possible. Avoiding anti-patterns and considering alternatives will also be a part of the responsibility when you write queries that you want to run on databases in a professional environment. Hopefully, this post will help beginners to improve their SQL skills. If you are interested to learn more about SQL and query optimization, I would recommend taking the EDX course: Querying with Transact-SQL, this will give you more insight into SQL and why it is important to provide value to users.