Performance issues with ORed conditions

An issue that comes up from time to time on the IRC channel is performance optimization of queries that contain ORed conditions. Here is a summary of the common problems and some solutions.

Why OR is a problem

When multiple conditions are ANDed together, the planner has considerable flexibility in planning the query, because it can generate a plan that satisfies only some of the conditions, and then apply any remaining ones as filters.

But with ORed conditions, this isn’t possible; the planner can’t (at present) generate a plan that produces a partial result satisfying some conditions, and then add more rows to it to satisfy others. This means that a group of ORed conditions has to be kept together and applied at a single point.

When OR can be fast

The one optimization that does exist for ORs is the BitmapOr operation; if every condition in the ORed set is an indexable restriction on the same single table, then the planner can generate Bitmap Index Scan for each condition, combine them with BitmapOr, and use a Bitmap Heap Scan to get the data (unordered).

This can only work if every condition in the OR’d set can be expressed as a bitmap operation itself. If even one of the conditions lacks a usable index, then the planner would have to generate a Seq Scan to locate the rows matching that condition, which defeats the entire point of using indexes for the other conditions. (The UNION fix described below is no use here either.)

ORs across tables, and how to fix them

A classic problem case is this:

SELECT ...
  FROM foo JOIN bar ON ...
 WHERE foo.x = 1 OR bar.y = 2;

This performs poorly because the entire join result must be generated and then filtered according to the ORed conditions; there’s no way to use indexes on foo.x or bar.y to speed things up. The fix is to rewrite the query to use UNION instead of OR:

SELECT ...
  FROM foo JOIN bar ON ...
 WHERE foo.x = 1
UNION
SELECT ...
  FROM foo JOIN bar ON ...
 WHERE bar.y = 2;

This can perform much better because the first branch of the query can use an index on foo.x to drive the join, and the second branch can use an index on bar.y.

ORs in join conditions

The worst case is where the join condition itself contains an OR that doesn’t meet the conditions for BitmapOr as given above. This forces the planner to generate a full cross join of the relevant tables and filter the output.

When considering using OR in a join clause, it’s worth thinking about whether it’s really the correct condition to use; should you perhaps instead be doing separate outer joins on the individual conditions?

Even in the case where the ORed conditions qualify for a BitmapOr, this still rules out the use of other possible plans (such as hash joins), and may constrain the join order if more than two tables are involved.

But there are still common cases that work well, for example:

SELECT ...
  FROM messages m
       JOIN users u ON m.sender_id=u.id OR m.recipient_id=u.id
 WHERE u.username='Fred';

which might produce a plan like:

 Nested Loop
   ->  Index Scan using users_username_idx on users u
         Index Cond: (username = 'Fred'::text)
   ->  Bitmap Heap Scan on messages m
         Recheck Cond: ((sender_id = u.id) OR (recipient_id = u.id))
         ->  BitmapOr
               ->  Bitmap Index Scan on messages_sender_id_idx
                     Index Cond: (sender_id = u.id)
               ->  Bitmap Index Scan on messages_recipient_id_idx
                     Index Cond: (recipient_id = u.id)

which should perform acceptably in most cases. But notice that if we only wanted to select the message ids, or only count the messages, then this might not be the best plan because we won’t have any chance of an index-only scan; consider this alternative query:

SELECT count(distinct s.id)
  FROM (SELECT m.id FROM messages m
                         JOIN users u ON m.sender_id=u.id
         WHERE u.username='Fred'
        UNION ALL
        SELECT m.id FROM messages m
                         JOIN users u ON m.recipient_id=u.id
         WHERE u.username='Fred'
       ) s;

which might get this plan:

 Aggregate
   ->  Append
         ->  Nested Loop
               ->  Index Scan using users_username_idx on users u
                     Index Cond: (username = 'Fred'::text)
               ->  Index Only Scan using messages_sender_id_id_idx on messages m
                     Index Cond: (sender_id = u.id)
         ->  Nested Loop
               ->  Index Scan using users_username_idx on users u_1
                     Index Cond: (username = 'Fred'::text)
               ->  Index Only Scan using messages_recipient_id_id_idx on messages m_1
                     Index Cond: (recipient_id = u_1.id)

which might perform much better if "messages" is large or poorly correlated, and thus the overhead of fetching the heap blocks is significant.

How to spot problematic ORs

To find cases where OR is causing performance issues, look first at the output of EXPLAIN ANALYZE for significant numbers given for "Rows removed by filter" or "Rows removed by join filter". This is a good general rule for spotting problems with index use, whether caused by OR usage or just missing indexes.

This entry was posted in PlanetPG, Postgres, SQL. Bookmark the permalink.

2 Responses to Performance issues with ORed conditions

  1. Jörg Sonnenberger says:

    It would be nice if the query planer would turn
    “foo.bar=1 OR foo.bar=2 OR foo.bar=15” into “foo.bar IN (1,2,15)” automatically, since the latter is often handled significantly better.

Comments are closed.