Interval Queries in SQL Server Part 3

This is the third part of the solutions for interval queries in SQL Server. For an introduction, please refer to the blog post Interval Queries in SQL Server Part 1. You can find the second part of the solutions in the blog post Interval Queries in SQL Server Part 2. Note that you also need to read an excellent article by Itzik Ben-Gan wrote on interval queries in SQL Server (http://sqlmag.com/t-sql/sql-server-interval-queries) by using the Relational Interval Tree model. I am using the tables and data Itzik has prepared. In order to test the solutions, you can download the code from Itzik’s article by using the link in this paragraph.

Let me start with an introduction to the unpacked form of intervals. In order to make this introduction complete, I am also briefly introducing the packed form of intervals, although I do not need this form for querying for overlapping intervals. For more in-depth knowledge about intervals, interval operators, and interval forms, please refer to the Inside Microsoft® SQL Server® 2008: T-SQL Programming book by Itzik Ben-Gan , Dejan Sarka, Roger Wolter, Greg Low , Ed Katibah, and Isaac Kunen, Microsoft Press, 2009 (http://www.amazon.com/gp/product/0735626022/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=1AEWTQ7GTY1XS1R68R2M&pf_rd_t=101&pf_rd_p=1389517282&pf_rd_i=507846). I explain a lot of details in chapter 12, Temporal Support in the Relational Model.

Any set of points in time can be represented by a collection of intervals, which I’ll call an interval set. Look at the following three interval sets:

{(2:5), (3:7), (10:12)}
{(2:2), (3:3), (4:4), (5:5), (6:6), (7:7), (10:10), (11:11), (12:12)}
{(2:7), (10:12)}

Each set represents the same set of time points. The first set contains overlapping intervals. The second set contains only intervals exactly one time point long, or unit intervals. The third set represents the time points using the smallest possible number of intervals.

An interval set is in expanded form (or is unpacked) if it contains unit intervals only. An interval set is in collapsed form (or is packed) if it contains no intervals that overlap or meet. Any set of points in time can be represented uniquely in either form, and you can require one or the other form without losing anything.

I can transform a relation that has an attribute—such as during—of interval data type into another relation similarly to the way I rearranged sets of intervals. Without further delay, I’ll define the unpacked form of a relation: a relation with full temporal support is unpacked if every value of interval type is a unit interval.

It would be very nice if T-SQL supported the UNPACK operator. However, there is no such operator in T-SQL yet. So how can you get an unpacked form of a relation? Again, the numbers auxiliary table or tabular function get handy here. I can join the original table with intervals with the auxiliary numbers table using a non-equijoin, using the BETWEEN operator to find all time points between the beginning and end of an interval, or simply use the APPLY operator to apply the auxiliary numbers tabular function to each row of the intervals table. From the time points, I can create unit intervals.

My first solution uses an indexed view. The APPLY operator is still not supported in an indexed view; therefore, to get the unpacked form, I have to rely on the JOIN operator. This means that instead of the tabular function, I have to create and populate an auxiliary table of numbers, which the following code does.

CREATE
TABLE
dbo.Nums

(n
INT NOT
NULL PRIMARY
KEY);
GO

DECLARE @max
AS
INT,
@rc AS
INT,
@d AS
DATE;
SET
@max
= 10000000;
SET
@rc
= 1;

INSERT
INTO
dbo.Nums
VALUES(1);
WHILE
@rc * 2 <= @max
BEGIN
INSERT
INTO
dbo.Nums

SELECT
n
+
@rc
FROM
dbo.Nums;

SET
@rc = @rc
* 2;
END

INSERT
INTO
dbo.Nums

SELECT n
+
@rc
FROM
dbo.Nums

WHERE
n
+
@rc
<=
@max;
GO

Here is the code that creates a new table with intervals, a view in the unpacked form on this table, and then indexes the view to persist the unpacked form.

CREATE
TABLE
dbo.IntervalsU
(

id
INT
NOT
NULL,
lower
INT
NOT
NULL,
upper
INT
NOT
NULL,
CONSTRAINT
CHK_IntervalsU_upper_gteq_lower
CHECK(upper >=
lower)
);
GO

ALTER
TABLE
dbo.IntervalsU
ADD
CONSTRAINT
PK_IntervalsU
PRIMARY
KEY(id);
GO

— Create view Intervals_Unpacked
IF
OBJECT_ID(‘dbo.Intervals_Unpacked’, ‘V’)
IS
NOT
NULL

DROP
VIEW
dbo.Intervals_Unpacked;
GO
CREATE
VIEW dbo.Intervals_Unpacked
WITH
SCHEMABINDING
AS
SELECT
i.id, n.n
FROM
dbo.IntervalsU
AS
i

INNER
JOIN dbo.Nums
AS
n

ON n.n BETWEEN i.lower AND i.upper;
GO
— Index the view
CREATE
UNIQUE
CLUSTERED
INDEX
PK_Intervals_Unpacked

ON
dbo.Intervals_Unpacked(n, id);
GO

[box type=”info”] Note that in the composite index on the view, I put the number n in the first place. This is what I am going to seek for in my queries. Let’s insert the data in the new table, and SQL Server will insert also the data in the view.[/box]

— SLOW! (17+ min!)
INSERT
INTO
dbo.IntervalsU
WITH(TABLOCK)
(id,
lower, upper)

SELECT id,
lower, upper

FROM
dbo.Stage;
GO

You can see from the comment that the indexed view slowed down the insert substantially. Still, the performance is acceptable. However, when I tried to insert the data in the table first, and then create the index on the view, the performance was not really acceptable. At least not for me – after two hours I lost my patience and cancelled the index creation. This is not a negligible issue; imagine what would happen if you would need to re-index the view in production.

Time for querying the table. I am showing here the query that selects the same rows from the middle of the data as the query in my previous articles and as the queries in Itzik’s article.

— query
SET
STATISTICS
IO
ON;
SET
STATISTICS
TIME
ON;
GO

— middle of data
DECLARE
@l
AS
INT
= 5000000,
@u
AS
INT
= 5000020;
DECLARE
@t
AS
TABLE(id
INT);
INSERT
INTO
@t(id)
SELECT id
FROM
dbo.Intervals_Unpacked
WHERE
n BETWEEN @l AND @u;
SELECT
DISTINCT id
FROM
@t;
— logical reads: 6 (5 + 1 from the table variable), CPU time: 0 ms
GO

I use a table variable to temporarily store the id column of the intervals that do include at least one point of the given interval, the interval I am checking for overlapping. The logic is simple – if an interval from the table includes at least one point that the given interval includes, than the two intervals overlap. I decided to store these id values in a table variable and then perform the DISTINCT operator on them in order to prevent SQL Server to expand the view to the base table and make the query less efficient. I could prevent this also with a query hint.

As you can see from the comment, the query is very efficient. You get the same efficiency also if you query the beginning or the end of the data. In addition, the query performance does not depend on the length distribution of the intervals. From the querying point of the view, this is practically an ideal solution. The queries are fast and simple, much simpler than the queries on the RI tree model that Itzik explained in his article.

Of course, you already know there is no free lunch. The drawback of this solution is the same as its advantage – the indexed view. Maintenance of an indexed view is a nightmare. In addition, there are many, many more rows in the unpacked form than in the original form. With this data, the original form has 10,000,000 rows, with maximal length of an interval 20, and average length around 10. I got more than 100,000,000 rows in the unpacked form. With longer intervals, there could potentially be tens of times more rows in the unpacked form.

Many of the problems could be mitigated if Microsoft would spend some time and improve the indexed views. However, like many times with Microsoft, it seems that they are simply forgetting on some features, and leave them half-baked. Then they are surprised when we don’t use these features in production. And instead of finalizing these features, they are simply adding new features in the next version, again half-baked. Like children that get bored with a toy very quickly and want a new toy much too soon.

Since SQL Server can’t help me here, I decided to create a table in the unpacked form. Re-indexing a table is not a problem, and therefore the maintenance is not such a nightmare. In addition, I can use data compression to save the space. However, SQL Server does not automatically update my table. In order to get it updated automatically, you need to create DML triggers on the original table. Below is the code that created and populates a table in the unpacked form.

CREATE
TABLE
dbo.Intervals_Unpacked_Table
(

id
INT
NOT
NULL,
n
INT
NOT
NULL
)
WITH (DATA_COMPRESSION = ROW);
GO

ALTER
TABLE
dbo.Intervals_Unpacked_Table

ADD
CONSTRAINT PK_Intervals_Unpacked_Table
PRIMARY
KEY(n, id);
GO

— 4 min with row compression
INSERT
INTO
dbo.Intervals_Unpacked_Table
WITH(TABLOCK)
(id, n)
SELECT id, n
FROM
dbo.Intervals

CROSS
APPLY dbo.GetNums(lower, upper);
GO

The population of this table was reasonably fast. Of course, there are still more than 100,000,000 rows in this table. What is the space consumption?

EXEC
sys.sp_spaceused
N’dbo.Intervals’,
TRUE;
EXEC
sys.sp_spaceused
N’dbo.Intervals_Unpacked’,
TRUE;
EXEC
sys.sp_spaceused
N’dbo.Intervals_Unpacked_Table’,
TRUE;
GO
— Original table reserverd:                 566,424 KB
— Indexed view reserved:                1,857,808 KB
— Table with row compressions reserved:        1,234,000 KB

From the space consumption point of view, this is not problematic. The unpacked table occupies a bit more than twice as much space as the original table. However, remember that the original table does not have any additional attribute besides the interval boundaries and the key. In production, a real table would have many more columns, and thus much longer rows, and the space consumption would be probably even bigger that the space consumption of the unpacked table. I also tried to use the page compression; however, I did not get much better compression rate with this data set and decided to stick with the row compression. So let’s query the unpacked table!

— query
SET
STATISTICS
IO
ON;
SET
STATISTICS
TIME
ON;
GO

— middle of data
DECLARE
@l
AS
INT
= 5000000,
@u
AS
INT
= 5000020;
SELECT
DISTINCT
id
FROM
dbo.Intervals_Unpacked_Table
WHERE
n BETWEEN @l
AND
@u;
GO
— logical reads: 7, CPU time: 0 ms

The query is very efficient again, and the performance also does not depend on the interval length distribution. However, with longer intervals, you could get many more rows, and still have issues with the maintenance. Still, I like this solution quite a lot, because the queries are simple, efficient, and have stable performance regarding the data distribution.

Interval Queries in SQL Server Part 3

Blog and Corporate News

CONVOCATORIA DE LA JUNTA GENERAL EXTRAORDINARIA DE LA MERCANTIL “LUCIENT DATA, S.A.” para el 28 de febrero de 2023 a las 19:00 CET

A Static Relational Interval Tree

Acuerdos corporativos de Lucient Data SA

Consulting

Solutions & Services

Resources

Company

Interval Queries in SQL Server Part 3

Blog and Corporate News

CONVOCATORIA DE LA JUNTA GENERAL EXTRAORDINARIA DE LA MERCANTIL “LUCIENT DATA, S.A.” para el 28 de febrero de 2023 a las 19:00 CET

A Static Relational Interval Tree

Acuerdos corporativos de Lucient Data SA

Join our mailing list for the latest insights

Consulting

Solutions & Services

Resources

Company