Perfect hash function for weekday names

This is probably only of limited usefulness in practice, but it came as a spinoff from a discussion on #postgresql. The basic problem is, given a string which you already know is an English weekday name, return the day number (Sunday=0):

create function dayno(text)
  returns integer
  language sql immutable
  as $f$
    select (( ((ascii(substring($1 from 3)) & 22)*10)
              # (ascii($1) & 23) )*5 + 3) % 7;

I came up with several alternative functions, but this one has the advantage of only needing one substring() call (it makes use of the fact that ascii() looks only at the first character). It’s also case-insensitive (since it looks at only the low bits of the characters).

select dayno(d)
  from regexp_split_to_table('Sun,MON,tUe,WeD,tHU,FRi,saT',',') r(d);
(7 rows)

Posted in PlanetPG, Postgres, SQL | Leave a comment

Performance issues with ORed conditions

An issue that comes up from time to time on the IRC channel is performance optimization of queries that contain ORed conditions. Here is a summary of the common problems and some solutions.

Continue reading

Posted in PlanetPG, Postgres, SQL | 2 Comments

The Rule Challenge

I’ve made this challenge on IRC several times, and I think on the mailing lists too, but I’m putting it up here for reference.

I’m not a fan of rules. I’ve used them in production code in the past, though I’d avoid doing so again, and (unlike most people who end up using them) have spent a fair bit of time analyzing their behaviour and effects. I’ve come to the following conclusion:

All non-trivial rules are incorrect.

I define “non-trivial” as follows:

  1. Any rule with multiple action statements is non-trivial.
  2. Any rule using DO ALSO is non-trivial.
  3. Any rule with a WHERE clause is non-trivial.

This pretty much only leaves rules which rewrite into at most one simple statement, which I’ll concede can sometimes be written correctly (but often are not).

I define “incorrect” as “produces surprising or destructive behaviour in response to ordinary SQL statements”. In general this requires SQL statements not forseen by the rule writer; it’s easy to write rules that work only for certain statements, but doing this in real code amounts to laying a minefield for your future developers.

My challenge is simple: show me a counterexample to the above claims. Any takers?

[EDIT: to clarify, by “rule with a WHERE clause” I mean a conditional rule with WHERE used in the rule definition, not merely a rule whose action statement has a WHERE clause in it.]

Posted in Peeves, PlanetPG, Postgres | 5 Comments

Range aggregation with window functions

This article is also available on the PG wiki here

This problem is one I’ve been asked about a couple of times on IRC; the solution involves some tricks with window functions that have a fairly wide range of applications.

Assume you have a table of ranges expressed as “start” (s) and “end” (e) columns; we’ll assume that these are half-open intervals, i.e. each row represents the range [s,e). We also assume that the constraint (e > s) is enforced.

The problem is to aggregate all overlapping ranges and produce a result with one row for each disjoint collection of ranges.

For example (using integers for simplicity but this will often involve other types, such as timestamps, in practice):

test=# select * from intervals order by s,e;
 s  | e  
  1 |  3
  2 |  4
  5 |  6
  5 |  8
  6 |  9
  7 | 10
  8 | 10
 10 | 11
 10 | 15
 11 | 12
 12 | 13
(11 rows)

The desired output from this is:

  s  |  e 
   1 |   4
   5 |  10
  10 |  15
(3 rows)

Unusually for SQL, this problem can be solved most easily by thinking about it in purely procedural terms.

First, consider each range one at a time in ascending order of (s,e). For each range there are only two possibilities: either it overlaps with a range which we have already processed, or it begins a new disjoint range. Using PG 8.4 or later, we can express this idea using window functions as follows:

SELECT s, e,
       CASE WHEN s < max(le) OVER (ORDER BY s,e)
            THEN NULL
            ELSE s
       END AS new_start
  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
          FROM intervals) s1;

The subquery here is because we really want max(lag(e)) in order to exclude the current row's value of "e" but window functions cannot be nested directly. (In 9.0 or later, it may be more convenient to simply change the window framing clause to exclude the current row; this feature is unavailable in 8.4.) The new column "new_start" contains a non-null value only if the current row represents the left edge of a new disjoint interval; all rows which merely extend existing intervals have null values. The output on our test data is therefore:

 s  | e  | new_start 
  1 |  3 |         1
  2 |  4 |          
  5 |  6 |         5
  5 |  8 |          
  6 |  9 |          
  7 | 10 |          
  8 | 10 |          
 10 | 11 |        10
 10 | 15 |          
 11 | 12 |          
 12 | 13 |          
(11 rows)

We now extend this result to label every row with the left edge of the aggregated range of which it is part. We do this by wrapping the above in another level of subquery, using max(new_start) as a window function to propagate the value from each "left edge" row to its successors:

SELECT s, e,
       max(new_start) OVER (ORDER BY s,e) AS left_edge
  FROM (SELECT s, e,
               CASE WHEN s < max(le) OVER (ORDER BY s,e)
                    THEN NULL
                    ELSE s
               END AS new_start
          FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
                  FROM intervals) s1) s2;

The output from our test data is now:

 s  | e  | left_edge 
  1 |  3 |         1
  2 |  4 |         1
  5 |  6 |         5
  5 |  8 |         5
  6 |  9 |         5
  7 | 10 |         5
  8 | 10 |         5
 10 | 11 |        10
 10 | 15 |        10
 11 | 12 |        10
 12 | 13 |        10
(11 rows)

We can now simply group by left_edge in order to aggregate all overlapping rows together:

SELECT min(s) AS s, max(e) AS e
  FROM (SELECT s, e,
               max(new_start) OVER (ORDER BY s,e) AS left_edge
          FROM (SELECT s, e,
                       CASE WHEN s < max(le) OVER (ORDER BY s,e)
                            THEN NULL
                            ELSE s
                       END AS new_start
                  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
                         FROM intervals) s1) s2) s3
 GROUP BY left_edge;

producing the desired output:

 s  | e  
  1 |  4
  5 | 10
 10 | 15
(3 rows)
Posted in PlanetPG, Postgres, SQL | Leave a comment

Selecting random rows from a table

The question of how to select some random items from a table is one that comes up fairly often in the IRC channel (and as the subject of blog posts, such as this one from depesz).
While there is a simple solution of this form (let’s assume for now that we want to select 5 uniformly random rows from table “items”):

select * from items order by random() limit 5;

this is unfortunately slow if the table has more than a small number of rows. It is possible to do far better, though there aren’t really any perfect solutions that don’t resort to procedural logic in some way. Let’s start by looking at how to do better in plain SQL.
Continue reading

Posted in PlanetPG, Postgres | 11 Comments

Reading XML files into the database

This came up as a question on IRC: how to read an XML file on the server into a table column?

Continue reading

Posted in PlanetPG, Postgres | 3 Comments

UUID generation for PostgreSQL 8.3 on FreeBSD

A couple of weeks ago I whomped up a quick module in response to someone on IRC who was unable to get contrib/uuid-ossp working on FreeBSD (which turns out to be due to a nasty conflict between misc/ossp-uuid and libc).

I’ve now put it up on PGFoundry, and the initial release can be found here:

Posted in PlanetPG, Postgres | Leave a comment

On Normalization

People have funny ideas about database normalization.

What normalization is not:

  • creating an (id,value) table for every single piece of data is not normalization
  • using surrogate keys for everything is not normalization
  • a way to improve performance (though it sometimes does)

Equally, failing to do the above does not mean your data is “denormalized”.

What normalization is:

  • keeping your data consistent by ensuring that the relationships between values exist only in one place
  • applying the rules: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF … usually between 3NF and 4NF is enough

Obviously, though, there are times when one denormalizes for performance reasons, with appropriate care to ensure consistency isn’t lost in the process.

Posted in Peeves, PlanetPG, SQL | 1 Comment

Hello world!

I have no idea if I want to keep a blog.

So let’s see what happens, if anything.

Posted in Bloggery | 2 Comments