8.1 Why is denormalization necessary?
The most computationally expensive operation between large tables is the join. Accordingly, if in one query it is necessary to "ventilate" several tables consisting of many millions of rows, then the DBMS will spend a lot of time on such processing.
The user at this time can move away to drink coffee. The interactivity of processing practically disappears and approaches that of batch processing. Even worse, in batch mode, the user receives all the data requested the day before in the morning and calmly works with them, preparing new requests for the evening.
To avoid the situation of heavy joins, tables are denormalized. But not anyhow. There are some rules that allow you to consider transactionally denormalized tables as "normalized" according to the rules for constructing tables for data warehouses.
There are two main schemes that are considered “normal” in analytical processing: “snowflake” and “star”. The names reflect the essence well and follow directly from the picture of the related tables.
In both cases, the so-called fact tables are the central element of the schema, containing the events, transactions, documents, and other interesting things of interest to the analyst. But if in a transactional database one document is “smeared” across several tables (at least two: headers and rows of content), then in the fact table one document, more precisely, each of its rows or a set of grouped rows, corresponds to one record.
This can be done by denormalizing the two tables above.
8.2 Denormalization example
Now you can evaluate how much easier it will be for the DBMS to execute a query, for example, of the following type: to determine the volume of flour sales to the clients of Pirozhki LLC and Vatrushki CJSC for the period.
In a normalized transactional database:
SELECT SUM(dl.qty) AS total qty, SUM(dl.price) AS total amount, c.name FROM docs d INNER JOIN doc lines dl ON d.id doc = dl.id doc INNER JOIN customers c ON d.id customer = c.id customer INNER JOIN products p ON dl.id product = p.id product WHERE c.name IN (’Pirozhki LLC’, ’Vatrushki CJSC’) AND p.name = ’Flour’ AND d.date BETWEEN ’2014-01-01’ AND ’2014-02-01’ GROUP BY c.name
In the analytical database:
SELECT SUM(s.qty) AS total_qty, SUM(s.amount) AS total_amount, c.name FROM sales s INNER JOIN customers c ON d.id_customer = c.id_customer INNER JOIN products p ON dl.id_product = p.id_product WHERE c.name IN ('Pirozhki LLC', 'Vatrushki CJSC') AND p.name = 'Flour' AND s.date BETWEEN '2014-01-01' AND '2014-02-01' GROUP BY c.name
Instead of a heavy join between two tables of documents and their composition with millions of rows, the DBMS gets direct work with the fact table and light joins with small auxiliary tables, which you can also do without, knowing the identifiers.
SELECT SUM(s.qty) AS total_qty, SUM(s.amount) AS total_amount, s.id_customer FROM sales s WHERE s.id_customer IN (1025, 20897) AND s.id_product = 67294 AND s.date BETWEEN '2014-01-01' AND '2014-02-01' GROUP BY s.id_customer
Let's return to the "star" and "snowflake" schemes. Behind the scenes of the first picture were tables of customers, their groups, stores, sellers and, in fact, goods. When denormalized, these tables, called dimensions, are also joined to the fact table. If the fact table refers to dimension tables that have links to other dimensions (dimensions of the second level and above), then such a schema is called a "snowflake".
As you can see, for queries that include filtering by client groups, you have to make an additional connection.
SELECT sum(amount) FROM sales s INNER JOIN customers c ON s.id_customer = c.id_customer WHERE c.id_customer_group IN (1, 2, 10, 55)
In this case, denormalization can continue and drop the second-level dimension to the first, making it easier to query the fact table.
A schema in which a fact table only refers to dimensions that do not have a second level is called a star schema. The number of measurement tables corresponds to the number of "rays" in the star.
The Star schema completely eliminates the hierarchy of dimensions and the need to join the corresponding tables in a single query.
SELECT sum(amount) FROM sales s WHERE s.id_customer_group IN (1, 2, 10, 55)
The downside of denormalization is always redundancy , which causes an increase in the size of the database in both transactional and analytical applications. Let's calculate an approximate delta on the above example of converting "snowflake" to "star".
In some DBMS, such as Oracle, there are no special integer types at the level of database schema definitions, you must use the generic boolean type
numeric(N), where N is the number of stored bits. The storage size of such a number is calculated using a special formula given in the documentation for physical data storage, and, as a rule, it exceeds that of low-level types like "16 bit integer" by 1-3 bytes.
Suppose the sales table does not use data compression and contains about 500 million rows, and the number of customer groups is about 1000. In this case, we can use a short integer (shortint, smallint) occupying 2 bytes as an identifier type
We will assume that our DBMS supports a two-byte integer type (for example, PostgreSQL, SQL Server, Sybase and others). Then adding the corresponding column
id_customer_groupto the sales table will increase its size by at least
500 000 000 * 2 = 1 000 000 000 byte ~ 1 GByte.
8.3 When is denormalization needed?
Let's look at some common situations where denormalization can be useful.
Large number of table joins
In queries to a fully normalized database, you often have to join up to a dozen or even more tables. And each connection is a very resource-intensive operation. As a result, such requests consume server resources and are performed slowly.
In such a situation, it can help:
- denormalization by reducing the number of tables. It is better to combine into one several tables that are small in size, containing rarely changed (as they often say, conditionally constant or reference) information, and information that is closely related in meaning.
- In general, if you need to join more than five or six tables in a large number of queries, you should consider denormalizing the database.
- Denormalization by adding an additional field to one of the tables. In this case, data redundancy appears, additional actions are required to maintain the integrity of the database.
Often, queries are slow and consume a lot of resources, in which some complex calculations are performed, especially when using groupings and aggregate functions (Sum, Max, etc.). Sometimes it makes sense to add 1-2 additional columns to the table containing frequently used (and difficult to calculate) calculated data.
Suppose you want to determine the total cost of each order. To do this, you must first determine the cost of each product (according to the formula "number of product units" * "unit price of the product" - discount). After that, you need to group the costs by orders.
Executing this query is quite complex and, if the database stores information about a large number of orders, can take a long time. Instead of executing such a query, you can determine its cost at the stage of placing an order and store it in a separate column of the orders table. In this case, to obtain the desired result, it is enough to extract the pre-calculated values from this column.
Creating a column that contains precalculated values saves a lot of time when running a query, but requires you to update the data in that column in a timely manner.
If we have large tables in the database that contain long fields (Blob, Long, etc.), then we can seriously speed up the execution of queries to such a table if we move the long fields to a separate table. We want, say, to create a catalog of photos in the database, including storing photos themselves in blob fields (professional quality, high resolution, and appropriate size). From the point of view of normalization, the following table structure would be absolutely correct:
- Photo ID
- Author ID
- Camera Model ID
- the photo itself (blob field)
And now let's imagine how long the query will run, counting the number of photos taken by any author ...
The correct solution (albeit violating the principles of normalization) in such a situation would be to create another table consisting of only two fields - the photo ID and a blob field with the photo itself. Then selections from the main table (in which there is no longer a huge blob field) will go instantly, but when we want to see the photo itself, well, let's wait ...
How to determine when denormalization is justified?
8.4 Pros and cons of denormalization
One way to determine whether certain steps are justified is to conduct an analysis in terms of costs and possible benefits. How much will a denormalized data model cost?
Determine the requirements (what we want to achieve) → determine the data requirements (what we need to follow) → find the minimum step that satisfies these requirements → calculate the implementation costs → implement.
Costs include physical aspects such as disk space, the resources required to manage this structure, and lost opportunities due to the time delays associated with maintaining this process. You have to pay for denormalization. A denormalized database increases data redundancy, which can improve performance but require more effort to control related data. The process of creating applications will become more difficult, as the data will be repeated and harder to track. In addition, the implementation of referential integrity is not easy - related data is divided into different tables.
Benefits include faster query performance and the ability to get a faster response. You can also reap other benefits, including increased throughput, customer satisfaction, and productivity, as well as more efficient use of external developer tools.
Request Rate and Performance Consistency
For example, 72% of the 1,000 queries generated daily by an enterprise are summary-level queries, not drill-down queries. When using a summary table, queries run in approximately 6 seconds instead of 4 minutes, resulting in 3,000 minutes less processing time. Even after adjusting for the 100 minutes that must be spent maintaining the pivot tables each week, that saves 2,500 minutes per week, which justifies the creation of the pivot table. Over time, it may happen that most of the queries will not be addressed to summary data, but to detailed data. The fewer queries that use the summary table, the easier it is to drop it without affecting other processes.
The criteria listed above are not the only ones to consider when deciding whether to take the next step in optimization. Other factors need to be considered, including business priorities and end user needs. Users must understand how, from a technical standpoint, the system architecture is affected by the requirement of users who want all requests to be completed in a few seconds. The easiest way to achieve this understanding is to outline the costs associated with creating and managing such tables.
8.5 How to competently implement denormalization.
Save detailed tables
In order not to limit the capabilities of the database that are important to the business, it is necessary to adopt a strategy of coexistence, not replacement, that is, keep detailed tables for deep analysis, adding denormalized structures to them. For example, the hit counter. For business, you need to know the number of visits to a web page. But for the analysis (by period, by country...) we will very likely need detailed data - a table with information about each visit.
It is possible to denormalize a database structure and still enjoy the benefits of normalization by using database triggers to preserve the
integrityintegrity of duplicate data.
For example, when adding a calculated field, each of the columns on which the calculated field depends, is hung up with a trigger that calls a single stored procedure (this is important!), Which writes the necessary data to the calculated field. It is only necessary not to skip any of the columns on which the calculated field depends.
If you do not use built-in triggers and stored procedures, then application developers should take care of ensuring the consistency of data in a denormalized database.
By analogy with triggers, there should be one function that updates all fields that depend on the field being changed.
When denormalizing, it is important to maintain a balance between increasing the speed of the database and increasing the risk of inconsistent data, between making life easier for programmers writing
Select-s, and complicating the task of those who provide database population and data updates. Therefore, it is necessary to denormalize the database very carefully, very selectively, only where it is indispensable.
If it is impossible to calculate the pros and cons of denormalization in advance, then initially it is necessary to implement a model with normalized tables, and only then, to optimize problematic queries, carry out denormalization.
It is important to introduce denormalization gradually and only for those cases where there are repeated fetches of related data from different tables. Remember, when duplicating data, the number of records will increase, but the number of reads will decrease. It is also convenient to store calculated data in columns to avoid unnecessary aggregate selections.