Imagine you ordered a pizza through an app. You tap the "Order" button, and your request goes to a microservice that handles order placement. That microservice queries another service — the one that manages the pizza catalog. Then another service kicks in to calculate delivery, and another one for payment. So a seemingly simple request travels across many services, each leaving its own "trace".
Distributed tracing (Distributed Tracing) is the process of following your request's "journey" across all those services. It helps you see which components the request went through, how much time it spent at each step, and exactly where it might be slowing down.
Main goals of distributed tracing:
- Show the full execution path of a request.
- Detect bottlenecks and slow operations.
- Diagnose errors and failures.
- Improve performance and optimize the architecture.
When we talk about distributed tracing in the microservices world, it becomes the main tool to avoid getting lost in the complex web of service interactions.
Fundamental concepts
There are two key concepts in distributed tracing:
- Trace – the complete path of a request from start to finish. For example, an order request might start in the web app and end in the delivery service, and the Trace ties all those steps together.
- Span – an individual step inside a Trace. Each microservice call, a database access, or an HTTP call counts as a Span.
In practice this looks like a tree: each Trace contains many Spans, split into parent and child actions.
Why use distributed tracing?
- Understand complex interactions: when a request "gets stuck", tracing helps pinpoint where things went wrong. For example, the delivery service might be slow because of a sluggish DB connection.
- Reduce downtime: tracing helps you quickly identify which components are failing and where the bottlenecks are.
- Optimize performance: timing measurements let you find code paths or interactions that need improvement.
- Provide transparency: when you're working in a team, tracing makes the system clearer for everyone. Everything happening between services gets documented.
Main components of distributed tracing
Span and Trace: the duo that explains everything.
A bit about Span:
- Span is a single action, e.g., a REST API call or a database write operation.
- Each Span has a unique identifier that links it to the trace and to other spans.
And a bit about Trace:
- Trace groups many Spans into a single unit.
- Trace also has its own unique identifier to show that all the Spans belong to the same request.
Example:
Trace ID: 987654321
├── Span ID: 1 (Customer Service - order processing)
├── Span ID: 2 (Pizza Service - checking pizza availability)
├── Span ID: 3 (Delivery Service - delivery calculation)
└── Span ID: 4 (Payment Service - payment processing)
Note: each step is the length of a Span.
Tools to implement distributed tracing
There are several tools that make adding tracing to microservices easier. Let's look at two of the most popular:
1. Zipkin
Zipkin is a tool for collecting and visualizing tracing data. It lets you store and analyze timing information for each request.
- Pros: Easy to set up, supports many programming languages.
- Cons: Limited metrics and monitoring capabilities (tracing only).
2. OpenTelemetry
OpenTelemetry is a more modern approach that combines tracing, metrics, and logs in one solution. You can think of it as a universal Observability toolkit.
- Pros: Support for multiple standards, flexible.
- Cons: More complex setup compared to Zipkin.
Official OpenTelemetry documentation
How does distributed tracing work?
The process works roughly like this:
- Trace ID generation: when a request enters the system, a unique identifier for the Trace is created.
- Span ID generation: each action within that Trace gets its own unique Span ID.
- Context propagation: Trace ID and Span ID are passed between services using HTTP headers.
- Data collection: each service records timing data, errors, and other events.
- Analysis: all data is sent to a tool (for example, Zipkin), where you can visualize the whole flow.
Example of microservice interaction with tracing
Imagine a flow like this:
Client -> Order Service -> Inventory Service -> Payment Service
First the Client sends a request to the Order Service (the order placement microservice). That service, in turn, calls:
- Inventory Service to check stock levels.
- Payment Service to confirm payment.
Distributed tracing will build a request tree that shows:
- How much time the Order Service took.
- Exactly where the request was delayed (for example, during inventory check).
- If an error occurred, which service it happened in.
How this helps in real life?
Distributed tracing becomes especially useful when a system grows to dozens or hundreds of microservices. In such a system:
- Errors can happen anywhere.
- Manual debugging becomes impossible.
- Testing all interactions gets complicated.
Tracing takes on the job of "tagging" each request and gives the developer everything needed for fast analysis.
In the next lecture we'll talk about how to integrate Spring Boot with Zipkin and Sleuth. We'll take a few microservices, add tracing libraries, and learn how to actually "see" requests in the Zipkin UI. Get ready for hands-on work — it's going to be fun!
GO TO FULL VERSION