8.1 Transaction IDs

It is designated as XID or TxID (if there is a difference, tell me). Timestamps can be used as TxID, which can play into the hands if we want to restore all actions to some point in time. The problem can arise if the timestamp is not granular enough - then transactions can get the same ID.

Therefore, the most reliable option is to generate unique UUID prod IDs. In Python this is very easy:

>>> import uuid 
>>> str(uuid.uuid4()) 
'f50ec0b7-f960-400d-91f0-c42a6d44e3d0' 
>>> str(uuid.uuid4()) 
'd15bed89-c0a5-4a72-98d9-5507ea7bc0ba' 

There is also an option to hash a set of transaction-defining data and use this hash as the TxID.

8.2 Retries

If we know that a certain function or program is idempotent, then this means that we can and should try to repeat its call in case of an error. And we just have to be prepared for the fact that some operation will give an error - given that modern applications are distributed over the network and hardware, the error should be considered not as an exception, but as the norm. The error can occur due to a server crash, network error, remote application congestion. How should our application behave? That's right, try to repeat the operation.

Since one piece of code can say more than a whole page of words, let's use one example to understand how the naive retrying mechanism should ideally work. I'll demonstrate this using the Tenacity library (it's so well-designed that even if you don't plan to use it, the example should show you how you can design the recurrence mechanism):

import logging
import random
import sys
from tenacity import retry, stop_after_attempt, stop_after_delay, wait_exponential, retry_if_exception_type, before_log

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
logger = logging.getLogger(__name__)

@retry(
	stop=(stop_after_delay(10) | stop_after_attempt(5)),
	wait=wait_exponential(multiplier=1, min=4, max=10),
	retry=retry_if_exception_type(IOError),
	before=before_log(logger, logging.DEBUG)
)
def do_something_unreliable():
	if random.randint(0, 10) > 1:
    	raise IOError("Broken sauce, everything is hosed!!!111one")
	else:
    	return "Awesome sauce!"

print(do_something_unreliable.retry.statistics)

> Just in case, I'll say: \@retry(...) is a special Python syntax called a "decorator". It's just a retry(...) function that wraps another function and does something before or after it's executed.

As we can see, retries can be designed creatively:

  • You can limit attempts by time (10 seconds) or number of attempts (5).
  • Can be exponential (that is, 2 ** some increasing number n ). or somehow else (for example, fixed) to increase the time between separate attempts. The exponential variant is called "congestion collapse".
  • You can retry only for certain types of errors (IOError).
  • Retry attempts can be preceded or completed by some special entries in the log.

Now that we have completed the young fighter course and know the basic building blocks that we need to work with transactions on the application side, let's get acquainted with two methods that allow us to implement transactions in distributed systems.

8.3 Advanced tools for transaction lovers

I will only give fairly general definitions, since this topic is worthy of a separate large article.

Two-phase commit (2pc) . 2pc has two phases: a preparation phase and a commit phase. During the prepare phase, all microservices will be asked to prepare for some data changes that can be done atomically. Once they are all ready, the commit phase will make the actual changes. To coordinate the process, a global coordinator is needed, which locks the necessary objects - that is, they become inaccessible for changes until the coordinator unlocks them. If a particular microservice is not ready for changes (for example, does not respond), the coordinator will abort the transaction and begin the rollback process.

Why is this protocol good? It provides atomicity. In addition, it guarantees isolation when writing and reading. This means that changes to one transaction are not visible to others until the coordinator commits the changes. But these properties also have a disadvantage: since this protocol is synchronous (blocking), it slows down the system (despite the fact that the RPC call itself is quite slow). And again, there is a danger of mutual blocking.

Saga . In this pattern, a distributed transaction is executed by asynchronous local transactions across all associated microservices. Microservices communicate with each other via an event bus. If any microservice fails to complete its local transaction, other microservices will perform compensating transactions to roll back the changes.

The advantage of Saga is that no objects are blocked. But there are, of course, downsides.

Saga is hard to debug, especially when there are many microservices involved. Another disadvantage of the Saga pattern is that it lacks read isolation. That is, if the properties indicated in ACID are important to us, then Saga is not very suitable for us.

What do we see from the description of these two techniques? The fact that in distributed systems, the responsibility for atomicity and isolation lies with the application. The same thing happens when using databases that do not provide ACID guarantees. That is, things like conflict resolution, rollbacks, commits, and freeing up space fall on the shoulders of the developer.

8.4 How do I know when I need ACID guarantees?

When there is a high probability that a certain set of users or processes will simultaneously work on the same data .

Sorry for the banality, but a typical example is financial transactions.

When the order in which transactions are executed matters.

Imagine that your company is about to switch from FunnyYellowChat messenger to FunnyRedChat messenger, because FunnyRedChat allows you to send gifs, but FunnyYellowChat cannot. But you are not just changing the messenger - you are migrating your company's correspondence from one messenger to another. You do this because your programmers were too lazy to document programs and processes somewhere centrally, and instead they published everything in different channels in the messenger. Yes, and your salespeople published the details of negotiations and agreements in the same place. In short, the whole life of your company is there, and since no one has time to transfer the whole thing to a service for documentation, and the search for instant messengers works well, you decided instead of clearing the rubble to simply copy all the messages to a new location. The order of the messages is important

By the way, for correspondence in a messenger, the order is generally important, but when two people write something in the same chat at the same time, then in general it is not so important whose message will appear first. So, for this particular scenario, ACID would not be needed.

Another possible example is bioinformatics. I don’t understand this at all, but I assume that order is important when deciphering the human genome. However, I heard that bioinformaticians generally use some of their tools for everything - perhaps they have their own databases.

When you can't give a user or process stale data.

And again - financial transactions. To be honest, I couldn't think of any other example.

When pending transactions are associated with significant costs. Imagine the problems that can arise when a doctor and a nurse both update a patient record and erase each other's changes at the same time, because the database cannot isolate transactions. The healthcare system is another area, besides finance, where ACID guarantees tend to be critical.

8.5 When do I not need ACID?

When users update only some of their private data.

For example, a user leaves comments or sticky notes on a web page. Or edits personal data in a personal account with a provider of any services.

When users do not update data at all, but only supplement with new ones (append).

For example, a running application that saves data on your runs: how much you ran, for what time, route, etc. Each new run is new data, and the old ones are not edited at all. Perhaps, based on the data, you get analytics - and just NoSQL databases are good for this scenario.

When business logic does not determine the need for a certain order in which transactions are performed.

Probably, for a Youtube blogger who collects donations for the production of new material during the next live broadcast, it is not so important who, when and in what order, threw him money.

When users will stay on the same web page or application window for several seconds or even minutes, and therefore they will somehow see stale data.

Theoretically, these are any online news media, or the same Youtube. Or "Habr". When it doesn’t matter to you that incomplete transactions can be temporarily stored in the system, you can ignore them without any damage.

If you are aggregating data from many sources, and data that is updated at a high frequency - for example, data on the occupancy of parking spaces in a city that changes at least every 5 minutes, then in theory it will not be a big problem for you if at some point the transaction for one of the parking lots will not go through. Although, of course, it depends on what exactly you want to do with this data.