The question of how to select some random items from a table is one that comes up fairly often in the IRC channel (and as the subject of blog posts, such as this one from depesz).
While there is a simple solution of this form (let’s assume for now that we want to select 5 uniformly random rows from table “items”):
select * from items order by random() limit 5;
this is unfortunately slow if the table has more than a small number of rows. It is possible to do far better, though there aren’t really any perfect solutions that don’t resort to procedural logic in some way. Let’s start by looking at how to do better in plain SQL.
Obviously, to get a faster solution, we need to be able to write the query in a way that can make use of some index. Either the table already has a
serial column, or one can be added simply for this purpose. Then we can just look up some random ids and return those rows….
Well, it’s not quite that simple. The obvious obstacle is that a serial column is not likely to contain only contiguous values; there will be holes in the sequence both for deleted rows, and also for rows that never existed (rolled back during insert, sequence values skipped during recovery, etc.). However, as long as the column values are not too sparse, we can just generate more random values than we need, and hope we get enough that actually match:
select * from items where item_id in (select floor(random() * (max_id - min_id + 1))::integer + min_id from generate_series(1,15), (select max(item_id) as max_id, min(item_id) as min_id from items) s1 limit 15) order by random() limit 5;
There are a number of subtleties here that aren’t immediately obvious.
- We use
floor()rather than just a cast to integer, because casting rounds the value rather than truncating towards zero.
- We need to generate more values than we need not only to account for missing values of
item_id, but also to allow for the fact that the subquery may return duplicate values (which IN will remove).
- We use a LIMIT clause in the subselect as a way to tell the optimizer how many rows are involved; without that, it’s much more likely to choose a suboptimal plan due to expecting the
generate_series()call to return 1000 rows.
- We still need the
order by random()at the outer level; if we get more rows than we need, which is the normal case, we have to avoid bias in which ones we ultimately return. (Some of the possible plans for the IN join will return the result rows in a non-random order.)
This query only works if the sparseness of the
id column is within a reasonable range. In the example above we’re fetching up to 15 rows hoping to get 5; if half the item_id values between min_id and max_id are missing, then the query will come up short approximately 6% of the time, whereas if 2/3rds of the values are missing, the query will come up short 40% of the time.
It’s not really feasible to try and make this query always work regardless of how sparse the data is; increasing the number of attempts in the inner subquery will eventually result in it using a plan that scans the whole
items table anyway, at which point we might as well have used
order by random() anyway. But for the (apparently common) case where the ids are not too sparse, this method is extremely fast.