2.1 Emergence of the term NoSQL

Recently, the term "NoSQL" has become very fashionable and popular, all kinds of software solutions are being actively developed and promoted under this sign. NoSQL has become synonymous with huge amounts of data, linear scalability, clusters, fault tolerance, non-relationality. However, few people have a clear understanding of what NoSQL storage is, how the term appeared and what common characteristics they have. Let's try to fill this gap.

The most interesting thing about the term is that despite the fact that it was first used in the late 90s, it only acquired real meaning in the form in which it is used now in mid-2009. Initially, this was the name of an open-source database created by Carlo Strozzi, which stored all data as ASCII files and used shell scripts instead of SQL to access the data. It had nothing to do with "NoSQL" in its current form.

In June 2009 Johan Oskarsson organized a meeting in San Francisco to discuss new trends in the IT storage and processing market. The main impetus for the meeting was new open source products like BigTable and Dynamo. For a bright sign for a meeting, it was necessary to find a capacious and concise term that would fit perfectly into the Twitter hashtag. One of these terms was proposed by Eric Evans from RackSpace - "NoSQL". The term was planned for only one meeting and did not have a deep semantic load, but it so happened that it spread throughout the global network like a viral advertisement and became the de facto name of a whole trend in the IT industry. By the way, Voldemort (Amazon Dynamo clone), Cassandra, Hbase (analogues of Google BigTable), Hypertable, CouchDB, MongoDB spoke at the conference.

It is worth emphasizing once again that the term "NoSQL" is completely spontaneous in origin and does not have a generally accepted definition or scientific institution behind it. This name rather characterizes the vector of IT development away from relational databases. It stands for Not Only SQL, although there are supporters of the direct definition of No SQL. Pramod Sadalaj and Martin Fowler tried to group and systematize knowledge about the NoSQL world in their recent book "NoSQL Distilled".

2.2 Basic characteristics of NoSQL databases

There are few common characteristics for all NoSQL, since many heterogeneous systems are now hidden under the NoSQL label (perhaps the most complete list can be found at http://nosql-database.org/). Many characteristics are peculiar only to certain NoSQL databases, I will definitely mention this when listing.

1. No SQL is used

I mean ANSI SQL DML, since many databases try to use query languages ​​similar to the well-known favorite syntax, but no one has managed to fully implement it and is unlikely to succeed. Although there are rumored startups that are trying to implement SQL, for example in hadup ( http://www.drawntoscalehq.com/ and http://www.hadapt.com/ ).

2. Unstructured (schemaless)

The meaning is that in NoSQL databases, unlike relational databases, the data structure is not regulated (or weakly typed, if we draw analogies with programming languages) - you can add an arbitrary field in a separate line or document without first declaratively changing the structure of the entire table. Thus, if there is a need to change the data model, then the only sufficient action is to reflect the change in the application code.

For example, when renaming a field in MongoDB:

BasicDBObject order = new BasicDBObject();
order.put("date", orderDate); // this field was a long time ago
order.put("totalSum", total); // before we just used "sum"

If we change the application logic, then we expect a new field also when reading. But due to the lack of a data schema, the totalSum field is missing from other already existing Order objects. In this situation, there are two options for further action.

The first is to crawl all documents and update this field in all existing documents. Due to the volume of data, this process occurs without any locks (comparable to the alter table rename column command), so during the update, already existing data can be read by other processes. Therefore, the second option - checking in the application code - is inevitable:

BasicDBObject order = new BasicDBObject();
Double totalSum = order.getDouble("sum"); // This is the old model
if (totalSum  == null)
totalSum = order.getDouble("totalSum"); // This is the updated model

And already when we re-record, we will write this field to the database in a new format.

A pleasant consequence of the absence of a schema is the efficiency of working with sparse data. If one document has a date_published field, and the second does not, then no empty date_published field will be created for the second one. This, in principle, is logical, but a less obvious example is column-family NoSQL databases, which use the familiar concepts of tables / columns. However, due to the lack of a schema, columns are not declared declaratively and can be changed/added during a user's database session. This allows, in particular, the use of dynamic columns for the implementation of lists.

The unstructured schema has its drawbacks - in addition to the above-mentioned overhead in the application code when changing the data model - the absence of all kinds of restrictions from the base (not null, unique, check constraint, etc.), plus there are additional difficulties in understanding and controlling the structure data when working with the database of different projects in parallel (there are no dictionaries on the side of the database). However, in a rapidly changing modern world, such flexibility is still an advantage. An example is Twitter, which five years ago, along with the tweet, stored only a little additional information (time, Twitter handle and a few more bytes of meta-information), but now, in addition to the message itself, a few more kilobytes of metadata are stored in the database.

(Hereinafter, we are talking mainly about key-value, document and column-family databases, graph databases may not have these properties)

2.3. Representation of data in the form of aggregates (aggregates)

Unlike the relational model, which stores the application's logical business entity into various physical tables for normalization purposes, NoSQL stores operate on these entities as holistic objects:

This example demonstrates aggregations for a standard e-commerce conceptual relational model "Order - Order Items - Payments - Product". In both cases, the order is combined with positions into one logical object, while each position stores a link to the product and some of its attributes, for example, the name (such denormalization is necessary in order not to request a product object when retrieving an order - the main rule of distributed systems is "joins" between objects). In one aggregate, payments are combined with the order and are an integral part of the object, in the other they are placed in a separate object. This demonstrates the main rule for designing a data structure in NoSQL databases - it must obey the requirements of the application and be optimized as much as possible for the most frequent requests.

Many will object, noting that working with large, often denormalized, objects is fraught with numerous problems when trying arbitrary queries on data when queries do not fit into the structure of aggregates. What if we use orders along with order line items and payments (this is how the app works), but the business asks us to count how many units of a particular product were sold last month? In this case, instead of scanning the OrderItem table (in the case of a relational model), we will have to retrieve the entire orders in NoSQL storage, although we will not need much of this information. Unfortunately, this is a compromise that has to be made in a distributed system: we cannot normalize data as in a conventional single-server system,

I tried to group the pros and cons of both approaches in a table: