Archive for the SQL Category

Range aggregation with window functions

This article is also available on the PG wiki here

This problem is one I’ve been asked about a couple of times on IRC; the solution involves some tricks with window functions that have a fairly wide range of applications.

Assume you have a table of ranges expressed as “start” (s) and “end” (e) columns; we’ll assume that these are half-open intervals, i.e. each row represents the range [s,e). We also assume that the constraint (e > s) is enforced.

The problem is to aggregate all overlapping ranges and produce a result with one row for each disjoint collection of ranges.

For example (using integers for simplicity but this will often involve other types, such as timestamps, in practice):

test=# select * from intervals order by s,e;
 s  | e
----+----
  1 |  3
  2 |  4
  5 |  6
  5 |  8
  6 |  9
  7 | 10
  8 | 10
 10 | 11
 10 | 15
 11 | 12
 12 | 13
(11 rows)

The desired output from this is:

  s  |  e
-----+-----
   1 |   4
   5 |  10
  10 |  15
(3 rows)

Unusually for SQL, this problem can be solved most easily by thinking about it in purely procedural terms.

First, consider each range one at a time in ascending order of (s,e). For each range there are only two possibilities: either it overlaps with a range which we have already processed, or it begins a new disjoint range. Using PG 8.4 or later, we can express this idea using window functions as follows:

SELECT s, e,
       CASE WHEN s < max(le) OVER (ORDER BY s,e)
            THEN NULL
            ELSE s
       END AS new_start
  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
          FROM intervals) s1;

The subquery here is because we really want max(lag(e)) in order to exclude the current row’s value of “e” but window functions cannot be nested directly. (In 9.0 or later, it may be more convenient to simply change the window framing clause to exclude the current row; this feature is unavailable in 8.4.) The new column “new_start” contains a non-null value only if the current row represents the left edge of a new disjoint interval; all rows which merely extend existing intervals have null values. The output on our test data is therefore:

 s  | e  | new_start
----+----+-----------
  1 |  3 |         1
  2 |  4 |
  5 |  6 |         5
  5 |  8 |
  6 |  9 |
  7 | 10 |
  8 | 10 |
 10 | 11 |        10
 10 | 15 |
 11 | 12 |
 12 | 13 |
(11 rows)

We now extend this result to label every row with the left edge of the aggregated range of which it is part. We do this by wrapping the above in another level of subquery, using max(new_start) as a window function to propagate the value from each “left edge” row to its successors:

SELECT s, e,
       max(new_start) OVER (ORDER BY s,e) AS left_edge
  FROM (SELECT s, e,
               CASE WHEN s < max(le) OVER (ORDER BY s,e)
                    THEN NULL
                    ELSE s
               END AS new_start
          FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
                  FROM intervals) s1) s2;

The output from our test data is now:

 s  | e  | left_edge
----+----+-----------
  1 |  3 |         1
  2 |  4 |         1
  5 |  6 |         5
  5 |  8 |         5
  6 |  9 |         5
  7 | 10 |         5
  8 | 10 |         5
 10 | 11 |        10
 10 | 15 |        10
 11 | 12 |        10
 12 | 13 |        10
(11 rows)

We can now simply group by left_edge in order to aggregate all overlapping rows together:

SELECT min(s) AS s, max(e) AS e
  FROM (SELECT s, e,
               max(new_start) OVER (ORDER BY s,e) AS left_edge
          FROM (SELECT s, e,
                       CASE WHEN s < max(le) OVER (ORDER BY s,e)
                            THEN NULL
                            ELSE s
                       END AS new_start
                  FROM (SELECT s, e, lag(e) OVER (ORDER BY s,e) AS le
                         FROM intervals) s1) s2) s3
 GROUP BY left_edge;

producing the desired output:

 s  | e
----+----
  1 |  4
  5 | 10
 10 | 15
(3 rows)

On Normalization

People have funny ideas about database normalization.

What normalization is not:

  • creating an (id,value) table for every single piece of data is not normalization
  • using surrogate keys for everything is not normalization
  • a way to improve performance (though it sometimes does)

Equally, failing to do the above does not mean your data is “denormalized”.

What normalization is:

  • keeping your data consistent by ensuring that the relationships between values exist only in one place
  • applying the rules: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF … usually between 3NF and 4NF is enough

Obviously, though, there are times when one denormalizes for performance reasons, with appropriate care to ensure consistency isn’t lost in the process.

|