5.1 The issue of simultaneity

Let's start with a little distant theory.

Any information system (or simply, an application) that programmers create consists of several typical blocks, each of which provides a part of the necessary functionality. For example, the cache is used to remember the result of a resource-intensive operation to ensure faster reading of data by the client, stream processing tools allow you to send messages to other components for asynchronous processing, and batch processing tools are used to "rake" the accumulated volumes of data with some periodicity. .

And in almost every application, databases (DBs) are involved in one way or another, which usually perform two functions: store data when received from you and later provide them to you upon request. Rarely does anyone think of creating their own database, because there are already many ready-made solutions. But how do you choose the right one for your application?

So, let's imagine that you have written an application with a mobile interface that allows you to load a previously saved list of tasks around the house - that is, read from the database, and supplement it with new tasks, as well as prioritize each specific task - from 1 ( highest) to 3 (lowest). Let's say your mobile application is used by only one person at a time. But now you dared to tell your mother about your creation, and now she has become the second regular user. What happens if you decide at the same time, right in the same millisecond, to set some task - "wash the windows" - to a different degree of priority?

In professional terms, your and mother's database queries can be considered as 2 processes that made a query to the database. A process is an entity in a computer program that can run on one or more threads. Typically, a process has a machine code image, memory, context, and other resources. In other words, the process can be characterized as the execution of program instructions on the processor. When your application makes a request to the database, we are talking about the fact that your database processes the request received over the network from one process. If there are two users sitting in the application at the same time, then there can be two processes at any particular moment in time.

When some process makes a request to the database, it finds it in a certain state. A stateful system is a system that remembers previous events and stores some information, which is called "state". A variable declared as integercan have a state of 0, 1, 2, or say 42. Mutex (mutual exclusion) has two states: locked or unlocked , just like a binary semaphore ("required" vs. "released") and generally binary (binary) data types and variables that can have only two states - 1 or 0.

Based on the concept of state, several mathematical and engineering structures are based, such as a finite automaton - a model that has one input and one output and is in one of a finite set of states at each moment of time - and the “state” design pattern, in which an object changes behavior depending on the internal state (for example, depending on what value is assigned to one or another variable).

So, most objects in the machine world have some state that can change over time: our pipeline, which processes a large data packet, throws an error and becomes failed , or the Wallet object property, which stores the amount of money left in the user's account, changes after payroll receipts.

A transition (“transition”) from one state to another—say, from in progress to failed —is called an operation. Probably, everyone knows the CRUD operations - create, read, update, deleteor similar HTTP methods - POST, GET, PUT, DELETE. But programmers in their code often give other names to operations, because an operation can be more complex than just reading a certain value from the database - it can also check the data, and then our operation, which has taken the form of a function, will be called, for example, And validate()who performs these operations-functions? processes already described.

A little more, and you will understand why I describe the terms in such detail!

Any operation - be it a function, or, in distributed systems, sending a request to another server - has 2 properties: the invocation time and the completion time (completion time) , which will be strictly greater than the invocation time (researchers from Jepsen proceed from the theoretical assumptions that both of these timestamps will be given imaginary, fully synchronized, globally available clocks).

Let's imagine our to-do list application. You make a request to the database through the mobile interface in 14:00:00.014, and your mother in 13:59:59.678(that is, 336 milliseconds before) updated the to-do list through the same interface, adding washing dishes to it. Taking into account the network delay and the possible queue of tasks for your database, if, in addition to you and your mother, all your mother's friends also use your application, the database can execute mother's request after it processes yours. In other words, there is a chance that two of your requests, as well as requests from your mother's girlfriends, will be sent to the same data at the same time (concurrently).

So we have come to the most important term in the field of databases and distributed applications - concurrency. What exactly can the simultaneity of two operations mean? If some operation T1 and some operation T2 are given, then:

  • T1 can be started before the start time of execution T2, and finished between the start and end time of T2
  • T2 can be started before the start time of T1, and finished between the start and end of T1
  • T1 can be started and finished between the start and end time of T1 execution
  • and any other scenario where T1 and T2 have some common execution time

It is clear that within the framework of this lecture, we are talking primarily about queries entering the database and how the database management system perceives these queries, but the term concurrency is important, for example, in the context of operating systems. I won't deviate too far from the topic of this article, but I think it's important to mention that the concurrency we're talking about here is not related to the dilemma of concurrency and concurrency and their difference, which is discussed in the context of operating systems and high-performance. computing. Parallelism is one way to achieve concurrency in an environment with multiple cores, processors, or computers. We are talking about concurrency in the sense of simultaneous access of different processes to common data.

And what, in fact, can go wrong, purely theoretically?

When working on shared data, numerous problems related to concurrency, also called "race conditions", can occur. The first problem occurs when a process receives data that it shouldn't have received: incomplete, temporary, canceled, or otherwise "incorrect" data. The second problem is when the process receives stale data, that is, data that does not correspond to the last saved state of the database. Let's say some application has withdrawn money from a user's account with a zero balance, because the database returned the account status to the application, not taking into account the last withdrawal of money from it, which happened just a couple of milliseconds ago. The situation is so-so, isn't it?

5.2 Transactions Came to Save Us

In order to solve such problems, the concept of a transaction appeared - a certain group of sequential operations (state changes) with a database, which is a logically single operation. I will give an example with a bank again - and not by chance, because the concept of a transaction appeared, apparently, precisely in the context of working with money. The classic example of a transaction is the transfer of money from one bank account to another: you need to first withdraw the amount from the source account and then deposit it into the target account.

For this transaction to be carried out, the application will need to perform several actions in the database: checking the sender's balance, blocking the amount on the sender's account, adding the amount to the recipient's account, and deducting the amount from the sender. There will be several requirements for such a transaction. For example, the application cannot receive outdated or incorrect information about the balance - for example, if at the same time a parallel transaction ended in an error halfway through, and the funds were not debited from the account - and our application has already received information that the funds were written off.

To solve this problem, such a property of a transaction as “isolation” was called upon: our transaction is executed as if there were no other transactions being performed at the same moment. Our database performs concurrent operations as if it were executing them one after another, sequentially - in fact, the highest isolation level is called Strict Serializable . Yes, the highest, which means that there are several levels.

"Stop," you say. Hold your horses, sir.

Let's remember how I described that each operation has a call time and an execution time. For convenience, you can consider calling and executing as 2 actions. Then the sorted list of all call and execution actions can be called the history of the database. Then the transaction isolation level is a set of histories. We use isolation levels to determine which stories are "good". When we say that a story "breaks serializability" or "is not serializable", we mean that the story is not in the set of serializable stories.

To make it clear what kind of stories we are talking about, I will give examples. For example, there is such a kind of history - intermediate read . It occurs when transaction A is allowed to read data from a row that has been modified by another running transaction B and has not yet been committed ("not committed") - that is, in fact, the changes have not yet been finally committed by transaction B, and it can at any time cancel them. And, for example, aborted read is just our example with a canceled withdrawal transaction

There are several possible anomalies. That is, anomalies are some kind of undesirable data state that can occur during competitive access to the database. And in order to avoid certain unwanted states, databases use different levels of isolation - that is, different levels of data protection from unwanted states. These levels (4 pieces) were listed in the ANSI SQL-92 standard.

The description of these levels seems vague to some researchers, and they offer their own, more detailed, classifications. I advise you to pay attention to the already mentioned Jepsen, as well as the Hermitage project, which aims to clarify exactly what isolation levels are offered by specific DBMS, such as MySQL or PostgreSQL. If you open the files from this repository, you can see what sequence of SQL commands they use to test the database for certain anomalies, and you can do something similar for the databases you are interested in). Here's one example from the repository to keep you interested:

-- Database: MySQL

-- Setup before test
create table test (id int primary key, value int) engine=innodb;
insert into test (id, value) values (1, 10), (2, 20);

-- Test the "read uncommited" isolation level on the "Intermediate Reads" (G1b) anomaly
set session transaction isolation level read uncommitted; begin; -- T1
set session transaction isolation level read uncommitted; begin; -- T2
update test set value = 101 where id = 1; -- T1
select * from test; -- T2. Shows 1 => 101
update test set value = 11 where id = 1; -- T1
commit; -- T1
select * from test; -- T2. Now shows 1 => 11
commit; -- T2

-- Result: doesn't prevent G1b

It is important to understand that for the same database, as a rule, you can choose one of several types of isolation. Why not choose the strongest insulation? Because, like everything in computer science, the chosen isolation level should correspond to a trade-off that we are ready to make - in this case, a trade-off in execution speed: the stronger the isolation level, the slower the requests will be processed. To understand what level of isolation you need, you need to understand the requirements for your application, and to understand whether the database you have chosen offers this level, you will have to look into the documentation - for most applications this will be enough, but if you have some particularly tight requirements, it is better to arrange a test like what the guys from the Hermitage project do.

5.3 "I" and other letters in ACID

Isolation is basically what people mean when they talk about ACID in general. And it is for this reason that I began the analysis of this acronym with isolation, and did not go in order, as those who try to explain this concept usually do. Now let's look at the remaining three letters.

Recall again our example with a bank transfer. A transaction to transfer funds from one account to another includes a withdrawal operation from the first account and a replenishment operation on the second. If the replenishment operation of the second account failed, you probably do not want the withdrawal operation from the first account to occur. In other words, either the transaction succeeds completely, or it does not occur at all, but it cannot be made only for some part. This property is called "atomicity", and it's an "A" in ACID.

When our transaction is executed, then, like any operation, it transfers the database from one valid state to another. Some databases offer so-called constraints - that is, rules that apply to stored data, such as primary or secondary keys, indexes, default values, column types, etc. So, when making a transaction, we must be sure that all these constraints will be fulfilled.

This guarantee is called "consistency" and a letter Cin ACID (not to be confused with consistency from the world of distributed applications, which we will talk about later). I will give a clear example for consistency in the sense of ACID: an application for an online store wants to add ordersa row to the table, and the ID from the table product_idwill be indicated in the column - typical .productsforeign key

If the product, say, was removed from the assortment, and, accordingly, from the database, then the row insert operation should not happen, and we will get an error. This guarantee, compared to others, is a bit far-fetched, in my opinion - if only because the active use of constraints from the database means shifting responsibility for the data (as well as partial shifting of business logic, if we are talking about such a constraint as CHECK ) from the application to the database, which, as they say now, is just so.

And finally, it remains D- "resistance" (durability). A system failure or any other failure should not lead to loss of transaction results or database content. That is, if the database replied that the transaction was successful, then this means that the data was recorded in non-volatile memory - for example, on a hard disk. This, by the way, does not mean that you will immediately see the data on the next read request.

Just the other day, I was working with DynamoDB from AWS (Amazon Web Services), and sent some data for saving, and after receiving an answer HTTP 200(OK), or something like that, I decided to check it - and did not see this data in the database for the next 10 Seconds. That is, DynamoDB committed my data, but not all nodes instantly synchronized to get the latest copy of the data (although it may have been in the cache). Here we again climbed into the territory of consistency in the context of distributed systems, but the time to talk about it still has not come.

So now we know what ACID guarantees are. And we even know why they are useful. But do we really need them in every application? And if not, when exactly? Do all DBs offer these guarantees, and if not, what do they offer instead?