3.1 History of the emergence of the term BigData

The term Big Data appeared relatively recently. Google Trends shows the beginning of an active growth in the use of the phrase since 2011:

At the same time, only the lazy one does not use the term now. Especially often, the term is used inappropriately by marketers. So what is Big Data really? Since I decided to systematically state and highlight the issue, it is necessary to define the concept.

In my practice, I met with different definitions:

  • Big Data is when there is more than 100GB of data (500GB, 1TB, whatever you like).
  • Big Data is data that cannot be processed in Excel.
  • Big Data is data that cannot be processed on a single computer.

And even these:

  • Big Data is generally any data.
  • Big Data does not exist, it was invented by marketers.

I will stick to the definition from wikipedia:

Big data is a series of approaches, tools and methods for processing structured and unstructured data of huge volumes and significant diversity in order to obtain human-perceptible results that are effective in conditions of continuous growth, distribution over numerous nodes of a computer network, formed in the late 2000s years, alternative to traditional database management systems and business intelligence class solutions.

Thus, by Big Data I will understand not some specific amount of data and not even the data itself, but methods of processing them, which allow distributed processing of information. These methods can be applied to large datasets (such as the content of all pages on the Internet) as well as small datasets (such as the content of this lecture).

Here are some examples of what might be a data source that requires big data techniques:

  • Logs of user behavior on the Internet
  • GPS signals from cars for a transport company
  • Data taken from sensors in the Large Hadron Collider
  • Digitized books in the Russian State Library
  • Information about transactions of all bank customers
  • Information about all purchases in a large retail chain, etc.

The number of data sources is growing rapidly, which means that data processing technologies are becoming more and more in demand.

3.2 Big data principles

Based on the definition of Big Data, we can formulate the basic principles of working with such data:

1. Horizontal scalability. Since there can be an arbitrarily large amount of data, any system that involves processing large data must be extensible. The volume of data increased by 2 times - the amount of iron in the cluster was increased by 2 times and everything continued to work.

2. Fault tolerance. The principle of horizontal scalability implies that there can be many machines in a cluster. For example, Yahoo's Hadoop cluster has over 42,000 machines (you can see cluster sizes across organizations at this link). This means that some of these machines will be guaranteed to fail. Big data practices need to be aware of these disruptions and survive them without any significant consequences.

3. Data locality. In large distributed systems, data is distributed across a large number of machines. If the data is physically located on one server and processed on another, the costs of data transfer may exceed the costs of the processing itself. Therefore, one of the most important principles for designing BigData solutions is the principle of data locality - if possible, we process data on the same machine on which we store it.

All modern big data tools follow these three principles in one way or another. In order to follow them, it is necessary to come up with some methods, methods and paradigms for developing data development tools. One of the most classic methods I will analyze in today's lecture.

3.3 MapReduce

MapReduce is a distributed data processing model proposed by Google for processing large amounts of data on computer clusters. MapReduce is well illustrated by the following picture:

MapReduce assumes that the data is organized into some records. Data processing occurs in 3 stages:

1. Map stage . At this stage, the data is preprocessed using the map() function, which is defined by the user. The work of this stage is to preprocess and filter the data. The operation is very similar to the map operation in functional programming languages ​​- a custom function is applied to each input record.

The map() function applied to a single input record produces many key-value pairs. Set - that is, it can return only one record, it may return nothing, or it may return several key-value pairs. What will be in the key and in the value is up to the user, but the key is a very important thing, since data with one key in the future will fall into one instance of the reduce function.

2. Stage Shuffle. It goes unnoticed by the user. In this stage, the output of the map function is "binned" - each bin corresponds to one output key of the map stage. In the future, these baskets will serve as an input for reduce.

3. Stage Reduce. Each "basket" with values ​​generated at the shuffle stage gets to the input of the reduce() function.

The reduce function is given by the user and calculates the final result for a single "basket" . The set of all values ​​returned by the reduce() function is the final result of the MapReduce task.

Some additional facts about MapReduce:

  1. All runs of the map function work independently and can run in parallel, including on different cluster machines.
  2. All runs of the reduce function work independently and can run in parallel, including on different cluster machines.
  3. Shuffle internally represents a parallel sort, so it can also work on different cluster machines. Points 1-3 allow you to implement the principle of horizontal scalability .
  4. The map function is usually used on the same machine where the data is stored - this reduces the transmission of data over the network (the principle of data locality).
  5. MapReduce is always a full data scan, there are no indexes. This means that MapReduce is poorly applicable when a response is required very quickly.

3.4 Examples of tasks effectively solved with MapReduce

Word Count

Let's start with the classic task - Word Count. The task is formulated as follows: there is a large corpus of documents. The task is to calculate the total number of times that it occurs in the corpus for each word that occurs at least once in the corpus.

Solution:

Since we have a large corpus of documents, let one document be one input record for the MapRreduce task. In MapReduce, we can only define user-defined functions, which we will do (we will use python-like pseudocode):

def map(doc): 
for word in doc: 
yield word, 1 
def reduce(word, values): 
yield word, sum(values) 

The map function turns the input document into a set of pairs (word, 1), shuffle transparently for us turns it into pairs (word, [1,1,1,1,1,1]), reduce sums these ones, returning the final answer for the word .

Processing advertising system logs

The second example is taken from the real practice of the Data-Centric Alliance.

Task: there is a csv-log of the advertising system of the form:

<user_id>,<country>,<city>,<campaign_id>,<creative_id>,<payment></p> 
 
11111,RU,Moscow,2,4,0.3 
22222,RU,Voronezh,2,3,0.2 
13413,UA,Kyiv,4,11,0.7 
… 

It is necessary to calculate the average cost of displaying advertising in the cities of Russia.

Solution:

def map(record): 
user_id, country, city, campaign_id, creative_id, payment = record.split(",") 
payment=float(payment) 
if country == "RU": 
yield city, payment 
def reduce(city, payments): 
yield city, sum(payments)/len(payments) 

The map function checks if we need this entry - and if we do, it leaves only the necessary information (city and payment amount). The reduce function calculates the final answer for a city given a list of all payments in that city.