Engineering

Monitoring and Optimizing Slow Database Queries: A Proactive Approach with Locale

Facing heavy database loads, we transformed our approach at Locale. This post explores our journey from frequent CPU spikes to proactive database query optimization, leveraging Locale's alerting system and SQL efficiency to prevent downtime and enhancing customer reliability

May 23, 2023

Graphic introducing Monitoring And Optimizing Slow Database Queries

At Locale, we work extensively with data to gain customer insights and product metrics. Our reliance on data, coupled with our use of Hasura, a GraphQL layer over our Postgres DB, means that we occasionally experience heavy query loads on our database instances. Initially, this was a manageable issue. However, as we scaled and the volume of data queried increased, we started experiencing CPU utilization spikes in our database instances. These spikes, if unchecked, can slow query performance, increase response times, and can even lead to potential downtime.

As these spikes became more frequent, we knew it was time to monitor and manage our database queries proactively.

Navigating Through Heavy Queries

Initially, we focused on frequent queries, but given the nature of the CPU utilization spikes, we quickly ruled them out. They were abrupt, lasting 5-10 minutes, and did not correlate with the number of active connections. Our investigation then led us to query times. Our investigation then led us to query times.

We leveraged the pg_stat_statements extension, a valuable tool that keeps track of all queries and stores useful metadata. This extension even parameterizes the queries, so slight variations (like variables) don't lead to multiple entries.

Don’t worry if you don’t have it enabled already, it can be enabled easily by just running the following SQL query:


CREATE EXTENSION pg_stat_statements;

If you’re using a cloud hosting service for your database, there's a good chance that it is already enabled for you.

Through this, we focused on the queries with a high mean execution time and those called more than 10 times.



SELECT  
query, calls, mean_exec_time, max_exec_time, stddev_exec_time
FROM  
pg_stat_statements
WHERE  
calls > 10ORDER 
BY  
mean_exec_time DESC
LIMIT
100;

The idea was to disregard one-off heavy queries that wouldn’t significantly impact our instances in the future.

We discovered that these long-running queries shared common characteristics, such as multiple joins, complex where clauses, and order by on large tables. The solution to this problem seemed straightforward: create the right indexes. However, we needed to maintain a balance between creating the right indexes and improving SQL query efficiency.

Keeping this in mind, we decided on a three-pronged approach. This involved moving unsuitable GraphQL queries to SQL, rewriting analytics queries to make them more efficient, and creating the right indexes if the previous options did not help.

However, we discovered that most of these queries were not set in SQL queries in the backend. Instead, they were either generated by Hasura or were heavy and inefficient analytics queries. As our product and schemas are constantly evolving, we need to stay on top of the queries being run so we can modify them or create indexes in a timely manner. To proactively monitor our database instances and prevent slow product response times and potential downtime, we decided to use Locale, which allows us to create alerts to monitor data and receive necessary notifications.

Proactive Monitoring with Locale

With Locale, we set an alert for queries with a mean execution time greater than 0.5 seconds and a max execution time greater than 1 second.

We gave the incidents a helpful title and also added the entire query as the incident description so we have all the necessary information in one place.

Screenshot of Locale's alert creation screen
A snapshot showing alert titles

We also configured the notification channels so that the engineering team is notified on Slack every time an incident is created.

A snapshot showing alert previews on Locale
A snapshot showing alert previews on Locale

Now, whenever Locale detects a new row that meets these criteria, an incident will be created. This alert, configured to run every 5 minutes, sends notifications to the right people in our engineering team via Slack, ensuring we can immediately work on modifying the queries and creating the right indexes.

This not only aids in query performance optimization but also prevents potential system downtime.

Snapshot of a running Incident on Locale's dashboard
Snapshot of a Slack notification on an incident

Snapshot of a running Incident on Locale's dashboard

With Locale, we can now confidently monitor any outlying queries and address any CPU utilization spikes promptly. This proactive approach to database performance tuning has greatly improved our response times to potential issues, optimizing our Postgres database performance.

As we continue to scale and evolve, we look forward to exploring more innovative ways to leverage Locale in solving engineering problems and improving performance and reliability for our customers.