Have you ever wanted to perform aggregations across multiple columns/dimensions in SQL Server without licensing SSAS (SQL Server Analysis Services)? I am not talking about using the Developer edition in production, nor am I talking about installing a pirate copy of SQL Server.
Mission Impossible? Not really, because it’s possible through the so-called Grouping Sets in SQL Server. I want to give you in this blog posting an overview about grouping sets, what kind of queries you can achieve with them, and what are their performance advantages.
Aggregations with Grouping Sets
Imagine you have an orders table, and you want to perform T-SQL query aggregations across multiple groups. In the context of the Sales.SalesOrderHeader table of the AdventureWorks2012 database, these groupings can be something like the following:
- A grouping across “everything”
- GROUP BY SalesPersonID, YEAR(OrderDate)
- GROUP BY CustomerID, YEAR(OrderDate)
- GROUP BY CustomerID, SalesPersonID, YEAR(OrderDate)
When you want to perform these individual groupings with a traditional T-SQL query, you need multiple statements, where you perform a UNION ALL between the individual result sets. Let’s have a look at such a query:
SELECT * FROM ( -- 1st Grouping Set SELECT NULL AS 'CustomerID', NULL AS 'SalesPersonID', NULL AS 'OrderYear', SUM(TotalDue) AS 'TotalDue' FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL UNION ALL -- 2nd Grouping Set SELECT NULL AS 'CustomerID', SalesPersonID, YEAR(OrderDate) AS 'OrderYear', SUM(TotalDue) AS 'TotalDue' FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL GROUP BY SalesPersonID, YEAR(OrderDate) UNION ALL -- 3rd Grouping Set SELECT CustomerID, NULL AS 'SalesPersonID', YEAR(OrderDate) AS 'OrderYear', SUM(TotalDue) AS 'TotalDue' FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL GROUP BY CustomerID, YEAR(OrderDate) UNION ALL -- 4th Grouping Set SELECT CustomerID, SalesPersonID, YEAR(OrderDate) AS 'OrderYear', SUM(TotalDue) AS 'TotalDue' FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL GROUP BY CustomerID, SalesPersonID, YEAR(OrderDate) ) AS t ORDER BY CustomerID, SalesPersonID, OrderYear GO
The approach used by this T-SQL statement has multiple disadvantages:
- The T-SQL statement itself is huge, because every individual group is one distinct query.
- The table Sales.SalesOrderHeader has to be accessed 4 times – once for every distinct query.
- When you look at the execution plan, you can see that SQL Server performs an Index Seek (NonClustered) operation 4 times – once for every query.
You can dramatically simplify the T-SQL code that you need, if you use the grouping sets functionality introduced back in SQL Server 2008. The following code shows you the same query, but this time implemented with grouping sets.
SELECT CustomerID, SalesPersonID, YEAR(OrderDate) AS 'OrderYear', SUM(TotalDue) AS 'TotalDue' FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL GROUP BY GROUPING SETS ( -- Our 4 different grouping sets (CustomerID, SalesPersonID, YEAR(OrderDate)), (CustomerID, YEAR(OrderDate)), (SalesPersonID, YEAR(OrderDate)), () ) GO
As you can see from the code itself, you just specify the needed grouping sets inside the GROUP BY GROUPING SETS clause – everything else is performed transparently by SQL Server. The empty parentheses specify the so-called Empty Grouping Set, the aggregation across the whole table. When you also look at the output of STATISTICS IO, you can see that the table Sales.SalesOrderHeader was accessed only once! That’s a huge difference from the previous manual implementation that we have performed.
Within the execution plan SQL Server uses a Table Spool operator that stores the retrieved data temporarily in TempDb. The data from the work table created in TempDb is afterwards used in the second branch of the execution plan. Therefore the data isn’t rescanned for every group from the base table, which leads to a better performance of the whole execution plan.
When you look at the execution plan, you can also see that the query plan contains 3 Stream Aggregate operators (highlighted in red, blue, and green). These 3 operators are calculating the individual grouping sets:
- The blue highlighted operator calculates the grouping set for CustomerID, SalesPersonID, and YEAR(OrderDate).
- The red highlighted operator calculates the grouping set for SalesPersonID and YEAR(OrderDate). In addition it also calculates the grouping set across “everything”.
- The green highlighted operator calculates the grouping set for CustomerID and YEAR(OrderDate).
The idea behind the 2 subsequent Stream Aggregate operators is to calculate so-called Super Aggregates – aggregations of aggregations.
Summary
In today’s blog posting I have introduced Grouping Sets to you, a T-SQL enhancement introduced back in SQL Server 2008. As you have seen grouping sets have 2 huge advantages: they simplify your code, and they improve the performance of the query by accessing the base data only once.
I hope that you have now got some good ideas about grouping sets, and it would be really interesting for me, if you can leave a short comment if you are able to use this functionality in your own databases.
Thanks for your time!
-Klaus
5 thoughts on “The power of Grouping Sets in SQL Server”
Good and quick introduction to GROUPING SET
Very nicely explained , Now the grouping set concept is much clear to me
Oldie but a goodie, very nice explanation, thanks!
Another way of retrieving the same results I often see (and until just now, have been using) is the use of a CUBE() or ROLLUP() with various checks against Grouping_ID() or GROUPING() in the HAVING clause, for instance the following (ugly) code I’ve just been trying to fix:
…
Group by ROLLUP(ColA, ColB, ColC)
Having NOT (GROUPING(ColA) = 1
AND GROUPING(ColB) = 1
AND GROUPING(ColC) = 1)
I was simply going to replace the Groupings in the having clause with a single Grouping_ID(ColA, ColB, ColC) != 7, but correct use of Grouping sets is miles better!
Excellent ! Very well explained.
Thanks for great work.
I was able to use this to find user visit statistics by user, month, and total across everything SUPER easy. Thank you for the explanation!
SELECT
UserName,
VisitDate,
COUNT(*) AS Visits
FROM SiteVisits
GROUP BY GROUPING SETS (
(UserName, VisitDate),
(UserName),
(VisitDate),
()
)