An Index Gone Rogue

Our investigation recipe for loaded RDS PostgreSQL instance

Published in

Nanit Engineering

7 min readJan 18, 2022

One of the most common ways to improve RDBMS query performance is to add an index based on the query’s selected fields and conditions. But sometimes indexes might degrade database performance instead of improving it. In this blog post I’ll describe how deleting an index helped us improve performance and mitigate load on a RDS instance.

A Few weeks ago we received a CloudWatch alert indicating that one of our RDS instances suffers from low burst balance. Before we dive into the details let me first explain about the usage of this RDS instance and what is burst balance.

The RDS instance is being used by a service called Inbox. Inbox is an HTTP service serving notifications for events like motion detection and high temperature to Nanit’s customers. Inbox uses RDS for storage and the instance uses a General Purpose (gp2) SSD volume.

In order to explain about burst balance, let’s first dive into RDS volume performance.

IOPS, IO Credits and Burst Balance

RDS volume performance is measured by the number of IO operations per second (IOPS).

The baseline performance is determined as the volume size in GiB x 3, so for example the baseline performance of a 500 GiB volume is 1500 IOPS.

When baseline performance is insufficient the volume can consume IO credits. Credits accumulate at a rate of 3 IOPS per GiB as long as the volume performance is below the baseline. This means that a volume of 500 GiB accumulates credits at a rate of 1500 IOPS.

It is important to note that volumes below 1 TiB are limited in the use of credits to 3000 IOPS for 30 minutes.

Burst balance is the percentage of available credits.

Mitigation and Application Side Investigation

Now, let’s get back to our discussion.

An immediate mitigation to the low burst balance alert was to expand the RDS instance volume size which increases the baseline IOPS performance. Expanding instance volume size cannot be rolled back but we decided to apply it anyway because it impacted our responsiveness to users and this migration involves minimum disruption on the instance.

The disk migration was done and now the DB had sufficient IOPS to handle the load but the IOPS rate was still X5 than the usual so we embarked on a journey to understand the root cause for the high read IOPS.

Our first suspicion was that the mobile application sends far more requests than needed or that it entered an infinite loop of requests creating some kind of DDOS effect on our servers. We allayed our suspicion by looking at our HTTP requests rate graph. The graph showed no sign of increase.

We moved on with our next hunch, a “limit” query parameter passed in the most used HTTP request in Inbox. The “limit” query parameter represents the number of notifications that should be included in the response. If the value of the limit parameter became higher for some reason, we would have ended up reading more rows from the DB leading to the increased IOPS we observed. We turned in two directions for this — the mobile application and Inbox. We checked that the mobile application sends the “limit” parameter and we checked that Inbox uses this parameter and has a default in case no value is passed. The mobile sent the parameter with a proper value and Inbox did use it.

We moved on to the next graph that describes the rate of queries to the database, to rule out inbox is misbehaving and executing more queries than required, but Inbox was acting normal.

After dismissing all of the above, we turned to a different path — investigating the database.

AUTOVACUUM

First thing we did was to check that the autovacuum daemon is running on the appropriate table. If it’s not running table statistics are not being updated which might result in unoptimized query plans by the query planner. We concluded that it’s running by running the following query to indicate autovacuum in running on the required table:

Slow Query Logs

Next step we took was examining the slow query logs. We use the log_min_duration_statement parameter in our RDS instance to log queries that take longer than 50ms. pgbadger is a tool we use to analyze slow query logs which generates a summarized report in HTML format. Once we had the report, we examined The Most Time Consuming Queries section.

Most Time Consuming Queries from a pgbadger report

This section lists queries sorted from biggest to smallest by the query’s consumption time, which is the average runtime multiplied by the number of executions. Each query includes examples of actual queries that are in the log file. We tried executing examples of the top query, but surprisingly it returned immediately. We continued by trying examples from the 2nd top query and this time we found the smoking gun — a query that repeatedly returned very slowly. The query limits the result set to include only 1 row using the LIMIT option so it seemed odd that this query takes 10 seconds to retrieve 1 row.

EXPLAIN ANALYZE

To find out why the query execution was slow we used the EXPLAIN ANALYZE command. The EXPLAIN command shows the execution plan of a query without actually running it.

Using the ANALYZE option the query is executed and the result includes the actual run times.

Let’s see an example.

First we’ll create an orders table:

Then we’ll insert some random data:

Now let’s run an EXPLAIN query:

The output is a detailed query execution plan which includes an estimated cost and estimated returned rows for each step. The cost estimation part includes two values — the startup cost (to return the first row) and total cost (to return all rows). The cost unit is the number of disk page fetches.

Let’s use the ANALYZE option now:

This time the query includes actual run time and number of rows returned from each step. The actual time part has two values — the startup time (time spent until first row returned) and total time (time spent until all rows returned).

In our example it took 0.084ms for the sequential scan and 0.058ms (0.142–0.084) for limiting the result.

Now back to our slow query. The image above shows the result of running EXPLAIN ANALYZE on the slow query. We identified that the index scan takes 10 seconds (10444.278 to be precise) and that it returns 9019 rows which are then sorted in-memory by the database. Returning 9019 rows from an index scan means a lot of redundant read IOPS if all we need is a single row and we figured out that might be the reason for the surge of IOPS we’ve been seeing in the graphs.

Index Usage

Next thing was to understand why the query planner uses this index and how we can reduce the query timing. We were interested in the index’s structure to see if it fits the query’s structure (selected columns, where clause, limits) so we ran the \d command in the psql client which returns all indices of a table and their structure. Now we had the index’s structure, but surprisingly the list included another index whose structure includes the exact fields and criteria that are used in the query. Our question was why the query planner didn’t use the proper index? Next thing we did was to see if the proper index is being used at all by using the following the query:

The query outputs the number_of_scans initiated on an index. By running this query multiple times we can determine which index is being used by comparing number_of_scans between executions and see it’s increasing.

We noticed that the proper index is not being used at all, i.e. the number of scans remains the same, while the number of scans of the less fitting index increased every time we executed the query. What surprised us is that the number of scans of the less optimized index was extremely low than the number of scans of the proper index. This datum made us realize that the index usage shifted from the proper index to the less optimized one which increased read IOPS.

Index Deletion

Next thing we did was to delete the less optimized index using DROP INDEX CONCURRENTLY query. We used the CONCURRENTLY option to avoid locking the table. Once the index was deleted the query planner started using the proper index.

As you can see in the screenshot above, the total execution time is now 1ms. Moreover the index usage changed from Index Scan to Index Only Scan. Index scan fetches rows from both the index and the table’s heap while Index Only Scan returns rows directly from the index’s data without accessing the table’s heap.

But the cherry on top is that the Index Only Scan returns only 1 row compared to 9019 rows returned using the previous index.

Now that the query planner started using the proper index the read IOPS dropped and Inbox was fully recovered.

Summary

A rouge index sent us on a journey that started in the frontend application, visited the backend code and ended in the database itself. We learned a lot from it, especially on debugging planner query execution output and index usage.

Our only remaining question left is why did the query planner switch its index usage?

We will try to keep investigating this to understand the planner’s behavior and share our conclusions as soon as we have them.