CodeGym /Courses /SQL SELF /Using pg_stat_statements to Analyze Index a...

Using pg_stat_statements to Analyze Index and Filter Usage

SQL SELF
Level 42 , Lesson 3
Available

Indexes are like bookmarks in a book. They help you quickly find the data you need. But what if we added a ton of bookmarks and nobody ever uses them? Or worse, poorly placed bookmarks make us flip through the whole book from start to finish? That’s when you really need to analyze how your indexes are being used.

Poorly written queries can ignore indexes, which leads to expensive sequential scans (Seq Scan). That slows down your queries and puts extra load on your server. Our goal is to figure out which queries aren’t using indexes and why.

How do you know if indexes are being used?

Let’s look at two key questions:

  1. Are the indexes we created actually being used?
  2. If they are, are they effective?

To answer these, we can analyze query stats in pg_stat_statements and pay attention to a few columns:

  • rows: number of rows processed by the query.
  • shared_blks_hit: number of pages read from memory (not from disk).
  • shared_blks_read: number of pages actually read from disk.

The fewer rows your query processes, and the higher the shared_blks_hit compared to total pages, the better your index is working.

Index Analysis Example

Let’s say we have a students table:

CREATE TABLE students (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    birth_date DATE,
    grade_level INTEGER
);

-- Let’s add an index on grade_level
CREATE INDEX idx_grade_level ON students(grade_level);

Now let’s insert some data for our experiment:

INSERT INTO students (first_name, last_name, birth_date, grade_level)
SELECT 
    'Student ' || generate_series(1, 100000),
    'LastName',
    '2000-01-01'::DATE + (random() * 3650)::INT,
    floor(random() * 12)::INT
FROM generate_series(1, 100000);

Let’s run a query to find students with a certain grade level:

SELECT *
FROM students
WHERE grade_level = 10;

Checking in pg_stat_statements

After running the query a few times, we can check the stats:

SELECT query, calls, rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
WHERE query LIKE '%grade_level = 10%';

How to interpret the results:

  • rows: If the query returns too many rows, is the index even worth it? Maybe you don’t need it for low-selectivity conditions.
  • shared_blks_hit and shared_blks_read: If a lot of pages are read from disk (shared_blks_read), either the index isn’t working or the data isn’t in the buffer pool.

Index Optimization

Creating an index is only half the job. It’s important that PostgreSQL actually uses it. Sometimes, despite your best efforts, the database picks a sequential scan instead of your index. Why does that happen? Let’s figure it out.

First, let’s see why an index might be ignored even when it seems obviously useful. Then, we’ll look at some tricks to make the database “remember” that your index exists and actually use it.

What if the index isn’t used?

Sometimes PostgreSQL ignores the index and does a sequential scan (Seq Scan). There are a few reasons for this:

  1. Low selectivity condition. If your query returns more than half the table, a sequential scan might actually be faster.
  2. Data types or functions. If you use a function on the indexed column in your query, the index might be ignored. For example:
   SELECT *
   FROM students
   WHERE grade_level + 1 = 11; -- Index not used
In these cases, you can rewrite the query:
   SELECT * 
   FROM students
   WHERE grade_level = 10; -- Uses index
  1. Wrong index type. For full-text search, you’re better off with GIN or GiST indexes, not B-TREE.

  2. Bad stats. If your stats are outdated, the optimizer might make bad decisions. Use ANALYZE:

    ANALYZE students;
    

Improving the Query

Back to our example. If the index isn’t being used, try these steps:

  1. Make sure your query uses filters that can use the index: don’t use functions, type casts, etc.
  2. If your filter returns a lot of values, think about whether you need the index. If it’s a frequent query, try changing your table structure or adding materialized views.
  3. If you’re getting Seq Scan because of a huge amount of data, try splitting the table into partitions (PARTITION BY).

Checking Index Efficiency

After optimizing, run the query again and check the stats:

SELECT query, calls, rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
WHERE query LIKE '%grade_level%';

Compare the metrics before and after. You should see less disk reading (shared_blks_read) and more hits (shared_blks_hit).

Real-World Cases

  1. Incorrect Index Usage

We have a products table with a text description field:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    description TEXT
);

-- Index for full-text search
CREATE INDEX idx_description ON products USING GIN (to_tsvector('english', description));

If we run this query:

SELECT *
FROM products
WHERE description ILIKE '%smartphone%';

The index won’t be used! That’s because ILIKE isn’t compatible with GIN. To use the index, rewrite the query like this:

SELECT *
FROM products
WHERE to_tsvector('english', description) @@ to_tsquery('smartphone');
  1. Missing Index Where You Need One

Let’s say this query:

SELECT *
FROM students
WHERE birth_date BETWEEN '2001-01-01' AND '2003-01-01';

does a sequential scan (Seq Scan). That might be because there’s no index on birth_date. If you create an index:

CREATE INDEX idx_birth_date ON students(birth_date);

and update the stats (ANALYZE students), you can speed up this query a lot.

2
Task
SQL SELF, level 42, lesson 3
Locked
Index Usage Analysis
Index Usage Analysis
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION