Recently we encountered some severe performance problems in one of our projects. Some SQL queries were very slow.
A certain query took over 6 seconds to finish in Azure’s Postgres service and about 350ms locally on my laptop. Both are unacceptable.
I don’t know much more than the postgres basics and the project was in all parts relevant here more or less new to me. But let me show you how I approach the unknown, what I learned and how I solved it.
By the way: The issue is a prime example of not spending more time on optimisation than what feels necessary at the time. So even though the initial code looks a bit dumb in hindsight I still consider this ok as the application was fast enough for over 2 years in production and as we later learned the issue was caused by some up to that point unprecedented data.
Now, let’s dive in!
I managed to find the slow query:
SELECT code FROM regions WHERE cast(principal as text) = '1';
Now this doesn’t look very evil. Though without knowing the rest of the application the cast looks rather strange. So I tried a few things:
- getting rid of the cast -> no change
- checking the size of the table -> 35k entries so this should not be it
- checking if there is an index on the column from the where clause -> yes there was (and since removing the cast did nothing, I skipped creating an index that contained the cast)
- the column that is selected is of type TEXT and all entries of the table are rather short (less than 6 characters)
So at this point I felt like I have checked all the obvious possible culprits and did not find anything. Next thing to check: Just enter the same query into our db client. This was rather interesting because it only took about 16ms here which is a bit closer to what we would expect.
I had a bit of a moment. Why. On earth. Is. It. Fast. In my db client!?!?
After a bit of frustration I had a closer look at all the queries actually being executed during the request in question. There were multple queries and among those I found out that there was one particular query picking up a postgres role. Sure enough the postgres role that was picked up was different from the one used in my db client. So I dug into that.
I found out that the project has very complex requirements on who can access which data. So sometimes the data carries access control information itself and this is enforced by using different postgres roles and row-level-security (RLS). The RLS for this table looks like this
ALTER TABLE regions ENABLE ROW LEVEL SECURITY; GRANT SELECT ON regions TO read_own_regions; CREATE POLICY own_regions ON regions TO read_own_regions USING (principal = current_principal());
I haven’t used those advanced concepts in a while so I did not 100% understand how that affects the queries. My very dumbed down understanding was that more or less the policy is appended to the WHERE clause of all queries on that table. RLS has some potential to make stuff slow but according to my online research not nearly as bad as we were experiencing it.
But now that I know the performance differs a lot between different database users, I could take a look at what postgres does different in between those. Postgres has this nice feature of explaining the query plan when you prepend EXPLAIN ANALYSE VERBOSE to your query. So I ran the following for the fast and the slow case:
EXPLAIN ANALYSE VERBOSE SELECT code FROM regions WHERE cast(principal as text) = '1';
For the fast case it yielded
Seq Scan on public.regions (cost=0.00..903.85 rows=185 width=5) (actual time=0.020..16.848 rows=6 loops=1) Output: code Filter: ((regions.principal)::text = '1'::text) Rows Removed by Filter: 37014 Planning time: 0.084 ms Execution time: 16.876 ms
And for the slow case it yielded
Subquery Scan on regions (cost=0.00..10103.32 rows=37 width=5) (actual time=0.340..365.260 rows=6 loops=1) Output: regions.code Filter: ((regions.principal)::text = '1'::text) -> Seq Scan on public.regions regions_1 (cost=0.00..9973.75 rows=7404 width=13) (actual time=0.336..365.252 rows=6 loops=1) " Output: regions_1.code, regions_1.principal" Filter: (regions_1.principal = current_principal()) Rows Removed by Filter: 37014 Planning time: 0.115 ms Execution time: 365.284 ms
Sequential scans are slow and in the slow case there are 2 scans put together. So in the worst case this is a 35k squared = 1.225.000k entries to scan. So now I understand why it might be slow but I don’t understand why postgres doesn’t use any indexes for example.
At this point I don’t really know what to do next. As I often have to do, I just followed intuition and the general smell of what feels unusual. I often encourage people to do that consciously when exploring the unknown. In this case it is the condition of the RLS policy which contains a function call that feels strange.
If I replace the function call with a static value our performance is good again. So we’re on the right track on the first try – yeahh 🙂 .
How often is this function actually called and how does it look like?
CREATE OR REPLACE FUNCTION current_principal() RETURNS bigint AS $$ DECLARE result bigint; BEGIN SELECT principal into result FROM rls_users WHERE id::TEXT = current_setting('acme.current_user'); RETURN result; END; $$ LANGUAGE plpgsql;
I put some log output
RAISE INFO 'Being called'; into the function so that I can see in my db client when this function is called. And the result is that is function is being called very often. I did skip counting how often. This function performs another query and that makes everything all the more slower.
The value returned by this function should not change during a query but it is being called multiple times. More research shows that Postgres knows 3 types of functions: VOLATILE, STABLE and IMMUTABLE.
IMMUTABLE functions must never change their value for the same parameters over the runtime of the database.
STABLE functions must never change their value over the runtime of a query.
VOLATILE functions have no restrictions.
Looking at the code neither is specified. The default turns out to be VOLATILE. This explains why postgres is not able to optimise the call count of this function. Now IMMUTABLE is completely wrong for this function because we want to make requests for different principals. But STABLE fits our bill perfectly.
So I made this function STABLE
CREATE OR REPLACE FUNCTION current_principal() RETURNS bigint STABLE AS $$ DECLARE result bigint; BEGIN SELECT principal into result FROM rls_users WHERE id::TEXT = current_setting('dsrs.current_user'); RETURN result; END; $$ LANGUAGE plpgsql;
And have a look at the query plan:
Subquery Scan on regions (cost=0.54..13.05 rows=1 width=5) (actual time=0.041..0.046 rows=6 loops=1) Output: regions.code Filter: ((regions.principal)::text = '1'::text) -> Index Scan using regions_p_rescue_index on public.regions regions_1 (cost=0.54..12.96 rows=5 width=13) (actual time=0.038..0.040 rows=6 loops=1) " Output: regions_1.code, regions_1.principal" Index Cond: (regions_1.principal = current_principal()) Planning time: 0.357 ms Execution time: 0.062 ms
Yeahh 🙂 ! Now we perform well – really well actually. Problem solved I would say.
One last thing: Actually we perform even two orders of magnitude better that the initial fast query. And that query performs still exactly the same. I’ll let this up to you to figure out 😉 .
Have a good one 🙂 !