In today’s blog posting I want to talk about the INTERSECT set operation in SQL Server. The INTERSECT set operation intersects two sets with each other and returns the rows that appear with identical column values in both sets. The following picture illustrates this concept.
INTERSECT vs. Inner Join
As you will realize, it is nearly the same as an INNER JOIN between 2 tables, but there are some important differences that I want to describe today. Let’s start by creating 2 simple tables, which we will use as the input tables:
-- Create the 1st table CREATE TABLE t1 ( Col1 INT, Col2 INT, Col3 INT ) GO -- Create the 2nd table CREATE TABLE t2 ( Col1 INT, Col2 INT ) GO -- Create a unique Clustered Index on both tables CREATE UNIQUE CLUSTERED INDEX idx_ci ON t1(col1) CREATE UNIQUE CLUSTERED INDEX idx_ci ON t2(col1) GO -- Insert some records into both tables INSERT INTO t1 VALUES (1, 1, 1), (2, 2, 2), (NULL, 3, 3) INSERT INTO t2 VALUES (2, 2), (NULL, 3) GO GO
As you can see from the T-SQL code, I have also created a unique Clustered Index on both tables, and inserted a few test records. Now let’s intersect the tables with each other:
SELECT Col1, Col2 FROM t1 INTERSECT SELECT Col1, Col2 FROM t2 GO
SQL Server returns 2 records: the record with the column values 2 and the record with the NULL value. That’s one of the first big differences to an INNER JOIN operation: records with NULL values are not eliminated if they appear in both tables. When you perform an INNER JOIN operation between both tables on column Col1 the record with the NULL value is not returned:
SELECT t1.col1, t1.col2 FROM t1 INNER JOIN t2 ON t2.col1 = t1.col1 GO
The following picture shows how the result sets differ between INTERSECT and the INNER JOIN approaches.
Now let’s analyze the execution plan of the INTERSECT set operation. Because you have a supporting index on column Col1, the Query Optimizer is able to translate the INTERSECT operation to a traditional logical INNER JOIN operation.
But here the operator Nested Loop (Inner Join) doesn’t really perform an INNER JOIN operation. Let’s see why not. When you look at the inner side of the Nested Loop operator, you can see that you have a residual predicate on the Clustered Index Seek (Clustered) operator.
The residual predicate is evaluated on the column Col2, because that column is not part of the navigation structure of the previously created Clustered Index. As I have said at the beginning, SQL Server has to find matching rows that have same values in ALL columns in BOTH tables. With the Clustered Index Seek (Clustered) operator and the residual predicate SQL Server just checks if we have a matching record with the same column values in the table t1. And the Nested Loop operator itself just returns the column values from one table – in our case from table t1.
So the INNER JOIN is just a Left Semi Join: SQL Server checks if we have matching records in the right table – if yes the matching record from the left table is returned. The residual predicate on the Clustered Index Seek (Clustered) operator can be removed by providing an index that contains all necessary columns in the navigation structure, as follows:
-- Create a supporting Non-Clustered Index CREATE NONCLUSTERED index id_nci ON t1(Col1, Col2) GO
Now when you look again at the execution plan of the INTERSECT operation you can see that SQL Server performs an Index Seek (NonClustered) operation on the previously created index, and that the residual predicate is not needed anymore.
Now let’s check what the generated execution plan looks like, when we drop all supporting index structures:
-- Drop all supporting indexes DROP INDEX id_nci ON t1 DROP INDEX idx_ci ON t1 DROP INDEX idx_ci ON t2 GO
When you perform the INTERSECT between both tables again, now you can see a Nested Loop (Left Semi Join) operation in the execution plan. SQL Server now has to perform the Left Semi Join physically in the execution plan by performing a Table Scan operation on the inner side and making a row-by-row comparison with a residual predicate inside the Nested Loop operator.
This execution plan isn’t really efficient, because the Table Scan operator on the inner side has to be performed over and over again – for every row that is returned from the outer table. Supporting indexes are very important if we want to perform the INTERSECT set operation as efficiently as possible.
Summary
The INTERSECT set operation is a really scary one, because almost nobody knows it well. When you use it, you have to be aware of the differences between it and an INNER JOIN. And as you have seen it is also very important to have an efficient indexing strategy, so that the Query Optimizer can generate an execution plan that performs well.
Thanks for your time,
-Klaus