<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.1" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Selecting random rows from a table</title>
	<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/</link>
	<description>an experimental blog</description>
	<pubDate>Thu, 09 Sep 2010 05:59:37 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.1</generator>

	<item>
		<title>By: Andrew</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-4221</link>
		<author>Andrew</author>
		<pubDate>Sat, 29 Aug 2009 20:13:08 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-4221</guid>
		<description>That's not unbiased; some rows are more likely than others to be returned.</description>
		<content:encoded><![CDATA[<p>That&#8217;s not unbiased; some rows are more likely than others to be returned.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-4172</link>
		<author>Andrew</author>
		<pubDate>Thu, 27 Aug 2009 16:51:31 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-4172</guid>
		<description>If you don't really care about insert performance, and you only need one (or a few) random rows at once you can do this:

ALTER TABLE x ADD COLUMN r DOUBLE PRECISION;
ALTER TABLE x ALTER COLUMN r SET DEFAULT random();
UPDATE x SET r = random() WHERE r IS NULL; -- this will be slow
ALTER TABLE x ALTER COLUMN r SET NOT NULL;
CREATE INDEX i ON x(r); -- also slow
ANALYZE x(r);

Then take a sample row quickly by running this:
SELECT * FROM x WHERE r &#62;= (SELECT random()) ORDER BY r LIMIT 1;

I'm not sure if asking for more than one row in the LIMIT clause would be statistically sound or not.  The "random" order is fixed, so whenever you land in an overlapping spot the sequence will be the same.

If you just need a few rows, you can UNION a few of those together, and that should be as random as you could care for.</description>
		<content:encoded><![CDATA[<p>If you don&#8217;t really care about insert performance, and you only need one (or a few) random rows at once you can do this:</p>
<p>ALTER TABLE x ADD COLUMN r DOUBLE PRECISION;<br />
ALTER TABLE x ALTER COLUMN r SET DEFAULT random();<br />
UPDATE x SET r = random() WHERE r IS NULL; &#8212; this will be slow<br />
ALTER TABLE x ALTER COLUMN r SET NOT NULL;<br />
CREATE INDEX i ON x(r); &#8212; also slow<br />
ANALYZE x(r);</p>
<p>Then take a sample row quickly by running this:<br />
SELECT * FROM x WHERE r &gt;= (SELECT random()) ORDER BY r LIMIT 1;</p>
<p>I&#8217;m not sure if asking for more than one row in the LIMIT clause would be statistically sound or not.  The &#8220;random&#8221; order is fixed, so whenever you land in an overlapping spot the sequence will be the same.</p>
<p>If you just need a few rows, you can UNION a few of those together, and that should be as random as you could care for.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joanmi</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1079</link>
		<author>Joanmi</author>
		<pubDate>Fri, 17 Apr 2009 23:18:11 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1079</guid>
		<description>Tell it to my boss.

Sorry for the noise.

Regards.</description>
		<content:encoded><![CDATA[<p>Tell it to my boss.</p>
<p>Sorry for the noise.</p>
<p>Regards.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1076</link>
		<author>Andrew</author>
		<pubDate>Fri, 17 Apr 2009 15:28:50 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1076</guid>
		<description>I don't have much sympathy for people still using 7.4 (which is approaching EOL).

The optimization for min() and max() to use indexes was added in 8.2.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t have much sympathy for people still using 7.4 (which is approaching EOL).</p>
<p>The optimization for min() and max() to use indexes was added in 8.2.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joanmi</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1074</link>
		<author>Joanmi</author>
		<pubDate>Fri, 17 Apr 2009 09:52:19 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1074</guid>
		<description>Dear Andrew,

We are just working on migration to Postgresql 8.3 but, I can say that, at least in Postgres 7.4, max() and min() functions causes a sequential scan which in tables with a few milions of rows, can be quite expensive.

I know that using the sequence last_value is not a good idea, but is the (tested) best way that I found.

Also, in Postgres 7.4, does'nt exist generate_series() function (which I thought were user-defined function which author didn't reproduced but now, I found it in Postgres 8.3 documentation and, yes, is really better way to get some rows (but with Postgres 7 we have'nt this option).

You are right in that we need the outer level order by random() to guarantee the random order. I apologize for that.


So I think in Postgres 7, the best solution will be someting like this:

select * from (
  select * from item
  where item_id in (
    select floor(random() * (
      select last_value
      from item_item_id_seq
   ))::bigint
   from item
   limit 100
   ) limit 10
) as foo
order by random();

Off course, if we have Postgres 8, we can use generate_series() to improve it.

For limits, I suggest to test max() and min() functions in Postgres 8 (I will do it as short as I can). I also think that an index must speed up searching maximum and minimum values but, at least when we try to examine big subsets of wide tables, postgres planner uses sequencial scan because, in this cases, is more efficient than index scan.

I repeat: I think that searching maximum or minimum must can be more efficient using index but, at least Postgres 7, does not it this way.

For this reason, I suggest to try "explain select max (item_id)" first. So I understand your arguments, but many times I discovered that things are not as they seem to be.


PD: Just now I remembered a way to implement best 'max()' and 'min()' which really takes advantadge of index which I implemented in past and also remembered the reason (I think) for which max() doesnt take advantadge of the index (we could need to select max() from a subset of table or join result, but, off course, is possible --and desirable-- that Postgres 8 could implement a way to always take advantadge of the indexs).

Look at this:

calm=# explain select max(lid) from location;
                                QUERY PLAN
---------------------------------------------------------------------------
 Aggregate  (cost=214987.09..214987.09 rows=1 width=8)
   -&#62;  Seq Scan on "location"  (cost=0.00..206792.67 rows=3277767 width=8)
(2 filas)

calm=# explain select lid from location order by lid desc limit 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.79 rows=1 width=8)
   -&#62;  Index Scan Backward using location_pkey on "location"  (cost=0.00..12438139.10 rows=3277767 width=8)
(2 filas)


I really don't need to obtain random rows of a table. I simply found this post yesterday searching another thing and I think it interesting.

But, if you need, you can use this trick to erradicate the use of the sequence max_value in my query (PG7) or the max() and min() in the post author's in Postgres8 if max() and min() values continues not taking advantadge of indexes.

Regards.</description>
		<content:encoded><![CDATA[<p>Dear Andrew,</p>
<p>We are just working on migration to Postgresql 8.3 but, I can say that, at least in Postgres 7.4, max() and min() functions causes a sequential scan which in tables with a few milions of rows, can be quite expensive.</p>
<p>I know that using the sequence last_value is not a good idea, but is the (tested) best way that I found.</p>
<p>Also, in Postgres 7.4, does&#8217;nt exist generate_series() function (which I thought were user-defined function which author didn&#8217;t reproduced but now, I found it in Postgres 8.3 documentation and, yes, is really better way to get some rows (but with Postgres 7 we have&#8217;nt this option).</p>
<p>You are right in that we need the outer level order by random() to guarantee the random order. I apologize for that.</p>
<p>So I think in Postgres 7, the best solution will be someting like this:</p>
<p>select * from (<br />
  select * from item<br />
  where item_id in (<br />
    select floor(random() * (<br />
      select last_value<br />
      from item_item_id_seq<br />
   ))::bigint<br />
   from item<br />
   limit 100<br />
   ) limit 10<br />
) as foo<br />
order by random();</p>
<p>Off course, if we have Postgres 8, we can use generate_series() to improve it.</p>
<p>For limits, I suggest to test max() and min() functions in Postgres 8 (I will do it as short as I can). I also think that an index must speed up searching maximum and minimum values but, at least when we try to examine big subsets of wide tables, postgres planner uses sequencial scan because, in this cases, is more efficient than index scan.</p>
<p>I repeat: I think that searching maximum or minimum must can be more efficient using index but, at least Postgres 7, does not it this way.</p>
<p>For this reason, I suggest to try &#8220;explain select max (item_id)&#8221; first. So I understand your arguments, but many times I discovered that things are not as they seem to be.</p>
<p>PD: Just now I remembered a way to implement best &#8216;max()&#8217; and &#8216;min()&#8217; which really takes advantadge of index which I implemented in past and also remembered the reason (I think) for which max() doesnt take advantadge of the index (we could need to select max() from a subset of table or join result, but, off course, is possible &#8211;and desirable&#8211; that Postgres 8 could implement a way to always take advantadge of the indexs).</p>
<p>Look at this:</p>
<p>calm=# explain select max(lid) from location;<br />
                                QUERY PLAN<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br />
 Aggregate  (cost=214987.09..214987.09 rows=1 width=8)<br />
   -&gt;  Seq Scan on &#8220;location&#8221;  (cost=0.00..206792.67 rows=3277767 width=8)<br />
(2 filas)</p>
<p>calm=# explain select lid from location order by lid desc limit 1;<br />
                                                 QUERY PLAN<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br />
 Limit  (cost=0.00..3.79 rows=1 width=8)<br />
   -&gt;  Index Scan Backward using location_pkey on &#8220;location&#8221;  (cost=0.00..12438139.10 rows=3277767 width=8)<br />
(2 filas)</p>
<p>I really don&#8217;t need to obtain random rows of a table. I simply found this post yesterday searching another thing and I think it interesting.</p>
<p>But, if you need, you can use this trick to erradicate the use of the sequence max_value in my query (PG7) or the max() and min() in the post author&#8217;s in Postgres8 if max() and min() values continues not taking advantadge of indexes.</p>
<p>Regards.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1069</link>
		<author>Andrew</author>
		<pubDate>Fri, 17 Apr 2009 01:52:37 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1069</guid>
		<description>Iterating a single-row fetch in the client is potentially quite a lot slower than the approach given in the original post.

Using the sequence last_value is essentially always a bad idea; better to use min() and max() on the actual id column. Due to the non-transactional nature of sequences, it's possible (for example in the case where a large bulk insert is running in another session) for the sequence last_value to be a long way ahead of the visible maximum ID.

Using a real table rather than a generate_series() call to produce multiple rows in the IN subquery is just going to slow things down.

And finally, omitting the outer level ORDER BY random() clause means that the results will not be in a random order (even though it may look random).</description>
		<content:encoded><![CDATA[<p>Iterating a single-row fetch in the client is potentially quite a lot slower than the approach given in the original post.</p>
<p>Using the sequence last_value is essentially always a bad idea; better to use min() and max() on the actual id column. Due to the non-transactional nature of sequences, it&#8217;s possible (for example in the case where a large bulk insert is running in another session) for the sequence last_value to be a long way ahead of the visible maximum ID.</p>
<p>Using a real table rather than a generate_series() call to produce multiple rows in the IN subquery is just going to slow things down.</p>
<p>And finally, omitting the outer level ORDER BY random() clause means that the results will not be in a random order (even though it may look random).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joanmi</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1067</link>
		<author>Joanmi</author>
		<pubDate>Thu, 16 Apr 2009 19:26:25 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1067</guid>
		<description>Continuing below post...

I just thought about a way to get multiple random rows in one query...

In a first approach, the basic trick consits on using some table to retrive many random values:

select * from items
where item_id in (
  select floor(random() * (
    select last_value
    from item_item_id_seq
  ))::bigint
  from item
  limit 10
);

But, because of the holes, we can obtain 10 rows... or less.


Workarround: Get more ids than we need and limit again:

select * from item
where item_id in (
  select floor(random() * (
    select last_value
    from item_item_id_seq
  ))::bigint
  from item
  limit 100
) limit 10;


This method is not rock-solid: we can even obtain less rows than expected but the probability of that is much smaller, depending of the density of the holes and the distance between first and second 'limit' values.

We can also increase the first 'limit' to drastically reduce the probability of finding not enought rows with a quite small process cost increment.

For example, using values up to 100000 for the first limit value, works quite fast for me.

But, take in mind: The probability of obtaining not enought results can be drastically reduced, but NOT totally eliminated.

In conclusion: The best way, if we can do this, is to iterate in client application (by software) until we obtain enought results. So now we can also use this multiple-row approach to achieve it faster and, at least most times, with only one or two database querys.

PD: If we want to obtain large lists of random rows, we can also use this query and, if not obtain enought results, adjust limit values in iterations depending on the number of faults.</description>
		<content:encoded><![CDATA[<p>Continuing below post&#8230;</p>
<p>I just thought about a way to get multiple random rows in one query&#8230;</p>
<p>In a first approach, the basic trick consits on using some table to retrive many random values:</p>
<p>select * from items<br />
where item_id in (<br />
  select floor(random() * (<br />
    select last_value<br />
    from item_item_id_seq<br />
  ))::bigint<br />
  from item<br />
  limit 10<br />
);</p>
<p>But, because of the holes, we can obtain 10 rows&#8230; or less.</p>
<p>Workarround: Get more ids than we need and limit again:</p>
<p>select * from item<br />
where item_id in (<br />
  select floor(random() * (<br />
    select last_value<br />
    from item_item_id_seq<br />
  ))::bigint<br />
  from item<br />
  limit 100<br />
) limit 10;</p>
<p>This method is not rock-solid: we can even obtain less rows than expected but the probability of that is much smaller, depending of the density of the holes and the distance between first and second &#8216;limit&#8217; values.</p>
<p>We can also increase the first &#8216;limit&#8217; to drastically reduce the probability of finding not enought rows with a quite small process cost increment.</p>
<p>For example, using values up to 100000 for the first limit value, works quite fast for me.</p>
<p>But, take in mind: The probability of obtaining not enought results can be drastically reduced, but NOT totally eliminated.</p>
<p>In conclusion: The best way, if we can do this, is to iterate in client application (by software) until we obtain enought results. So now we can also use this multiple-row approach to achieve it faster and, at least most times, with only one or two database querys.</p>
<p>PD: If we want to obtain large lists of random rows, we can also use this query and, if not obtain enought results, adjust limit values in iterations depending on the number of faults.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joanmi</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1066</link>
		<author>Joanmi</author>
		<pubDate>Thu, 16 Apr 2009 18:21:43 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-1066</guid>
		<description>I think that, in large tables, if we have serial column and holes are not hugue, best solution can be iterating select from random id (only one --and a few times zero-- row per query, but quite fast) just like this:

select * from items
where item_id = (
   select floor(random() * (
     select last_value
     from items_item_id_seq
   ))::bigint
);

Note that I use subselect to avoid per_row calculation of random() function and cast result to biginteger (because, in my case, this is the type of the serial column) because, otherwise, type mistmatch will force sequencial scan.

Also note that I use the sequence to get an aproximate maximum value of the id. This works quite good if sequence's increment is positive (default Ok), min_value is 0 (default) or, at least, positive and (also default) sequence cycling is not enabled or max_value never reached.</description>
		<content:encoded><![CDATA[<p>I think that, in large tables, if we have serial column and holes are not hugue, best solution can be iterating select from random id (only one &#8211;and a few times zero&#8211; row per query, but quite fast) just like this:</p>
<p>select * from items<br />
where item_id = (<br />
   select floor(random() * (<br />
     select last_value<br />
     from items_item_id_seq<br />
   ))::bigint<br />
);</p>
<p>Note that I use subselect to avoid per_row calculation of random() function and cast result to biginteger (because, in my case, this is the type of the serial column) because, otherwise, type mistmatch will force sequencial scan.</p>
<p>Also note that I use the sequence to get an aproximate maximum value of the id. This works quite good if sequence&#8217;s increment is positive (default Ok), min_value is 0 (default) or, at least, positive and (also default) sequence cycling is not enabled or max_value never reached.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andreas</title>
		<link>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-532</link>
		<author>Andreas</author>
		<pubDate>Wed, 18 Mar 2009 12:13:17 +0000</pubDate>
		<guid>http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/#comment-532</guid>
		<description>I actually googled for this just three days ago, but none of the solutions that came up was this neat. 

I had a go initially with doing a similar calculation on row numbers ain an OFFSET (xxx) LiMIT 1, just to find out that OFFSET didn't like subqueries. My problem was of the run-once kind, so a simple ORDER BY random() was sufficent. But I'll add your solution to my mental toolbox, I can think of other complex queries where similar solutions might save the day.</description>
		<content:encoded><![CDATA[<p>I actually googled for this just three days ago, but none of the solutions that came up was this neat. </p>
<p>I had a go initially with doing a similar calculation on row numbers ain an OFFSET (xxx) LiMIT 1, just to find out that OFFSET didn&#8217;t like subqueries. My problem was of the run-once kind, so a simple ORDER BY random() was sufficent. But I&#8217;ll add your solution to my mental toolbox, I can think of other complex queries where similar solutions might save the day.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
