EDIT: I had to add this, because it was cracking me up:
I’m reimplementing some expensive database queries, moving them from our middleware into materialized view tables. We make pretty extensive use of Common Table Expressions (CTEs). And we generate many reporting queries that calculate averages and percentages of a total. One way this could be done is with a CROSS JOIN
, which is a cartesian product of two tables, adding the total and then the percentage calculation to our original table that produces counts of events. For information about JOIN
types supported by PostgreSQL, see SELECT…FROM documentation.
Or we could use OVER()
, one of several Window Functions supported by Postgres. From the Postgres documentation, Window Functions are:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
A few other folks have written about Window Functions: Postgres Guide to Window Functions, Window Functions: Postgres’s best kept secret, SQL Window Functions (examples in Ruby). “All built-in and user-defined aggregate functions — such as count, max, bit_or, or xmlagg” — are available for computation as a window function, quoting Open Logic, who phrased it well.
If you read about OVER(), you’ll see that all the examples specify a PARTITION
clause. But you don’t have to specify a PARTITION
to take advantage of the feature.
We had a query containing two CTEs and a final SELECT
that produced the matview data. Here’s an example of that query:
WITH crashes as (
SELECT
product_name as category
, version_string
, SUM(report_count) as report_count
FROM signature_summary_products
JOIN signatures USING (signature_id)
WHERE signatures.signature = 'libflashplayer.so@0x1f2a14'
AND report_date >= now()::date - '15 day'::interval
AND report_date < now()::date
GROUP BY product_name, version_string
),
totals as (
SELECT
category
, version_string
, report_count
, SUM(report_count) OVER () as total_count
FROM crashes
)
SELECT category
, version_string
, report_count
, round((report_count * 100::numeric)/total_count,3)::TEXT
as percentage
FROM totals
ORDER BY report_count DESC;
The part under consideration is:
, SUM(report_count) OVER () as total_count
So, you see that OVER()
has no PARTITION
defined, meaning that the SUM will be calculated over the entire result.
Here is that same query, implemented using SUM()
and a CROSS JOIN
:
WITH crashes as (
SELECT
product_name as category
, version_string
, SUM(report_count) as report_count
FROM signature_summary_products
JOIN signatures USING (signature_id)
WHERE signatures.signature = 'libflashplayer.so@0x1f2a14'
AND report_date >= now()::date - '15 day'::interval
AND report_date < now()::date
GROUP BY product_name, version_string
),
totals as (
SELECT
SUM(report_count) AS total_count
FROM crashes
)
SELECT category
, version_string
, report_count
, round((report_count * 100::numeric)/total_count,3)::TEXT
as percentage
FROM crashes CROSS JOIN totals
ORDER BY report_count DESC;
What’s the difference to Postgres between that and a SUM()
plus a CROSS JOIN
?
Here’s the EXPLAIN output from this query, pared down to the relevant section:
CTE totals
-> WindowAgg (cost=0.00..0.03 rows=1 width=72) (actual time=0.112..0.114 rows=3 loops=1)
-> CTE Scan on crashes (cost=0.00..0.02 rows=1 width=72) (actual time=0.097..0.100 rows=3 loops=1)
-> CTE Scan on totals (cost=0.00..0.04 rows=1 width=104) (actual time=0.121..0.129 rows=3 loops=1)
The important bit to have a look at is WindowAgg right after CTE totals.
Now compare to the EXPLAIN output from a SUM()
plus a CROSS JOIN
query:
CTE totals
-> Aggregate (cost=0.02..0.03 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1)
-> CTE Scan on crashes (cost=0.00..0.02 rows=1 width=8) (actual time=0.001..0.004 rows=3 loops=1)
-> Nested Loop (cost=0.00..0.07 rows=1 width=104) (actual time=0.191..0.205 rows=3 loops=1)
-> CTE Scan on crashes (cost=0.00..0.02 rows=1 width=72) (actual time=0.162..0.164 rows=3 loops=1)
-> CTE Scan on totals (cost=0.00..0.02 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=3)
You can see here that we now have an Aggregate (for the SUM()
) plus a Nested Loop (for the CROSS JOIN
). This example query only has three results returned, but our more typical queries involve 10k or more rows returned.
Avoiding the Aggregate and Nested Loop will save us lots of memory and processing time on every run of a very expensive query.
Most of the time I use CROSS JOIN in order to do the following:
SELECT categories.category,
timeframes.timeframe,
COALESCE(count(*), 0) as num_in_cat,
FROM ( categories CROSS JOIN timeframes )
LEFT OUTER JOIN data
ON categories.category = data.category
and data.date <@ timeframes.period;
In this case, the purpose of the CROSS JOIN is to explode a set created by multiplying two reference sets together. I don’t see how you could replace that with Windowing queries. You could, however, use the windowing query to get the grand total from the above …
Right, I should have posted the alternative query. 🙂 To be clear – I was just explaining a particular use of a CROSS JOIN, rather than all uses of CROSS JOIN, or OVER() for that matter!
What does the plan for this look like?
SELECT
product_name as category
, version_string
, SUM(report_count) as report_count
, round(SUM(report_count) * 100::numeric/ sum(report_count) over (),3)::TEXT
FROM signature_summary_products
JOIN signatures USING (signature_id)
WHERE signatures.signature = 'libflashplayer.so@0x1f2a14'
AND report_date >= now()::date - '15 day'::interval
AND report_date < now()::date
GROUP BY product_name, version_string, report_count
order by 3 desc
You end up with a Nested Loop inside the WindowAgg:
-> WindowAgg (cost=128.66..128.72 rows=1 width=19)
-> HashAggregate (cost=128.66..128.69 rows=1 width=19)
-> Nested Loop (cost=3.97..128.65 rows=1 width=19)
-> Index Scan using signatures_signature_key on signatures (cost=0.00..10.06 rows=1 width=4)
Index Cond: (signature = 'libflashplayer.so@0x1f2a14'::text)
-> Append (cost=3.97..118.38 rows=21 width=49)
-> Bitmap Heap Scan on signature_summary_products (cost=3.97..37.80 rows=10 width=23)
Sorry, with only partial explain plans I can’t tell if doing it all in one step helped or hurt the situation.
p.s. too bad pgsql doesn’t have ratio_to_report(), as that’s basically what you’re doing.
Pingback: How I write queries using psql: Common Table Expressions