Packing Intervals

SQL Server doesn’t support a native temporal interval type. A temporal interval has a starting point in time and duration. Therefore, when you want to represent intervals in your database, you need to come up with your own solution. Most use two attributes representing the start and end points in time. But then you also need to implement your own solutions to fundamental operations on intervals like packing, unpacking, and others. Packing intervals involves merging intersecting intervals. So far, set-based solutions to packing intervals were very slow, and often people resorted to use of cursors. This article covers new set-based techniques to handle intervals in a highly efficient manner. The new techniques utilize window functions and are optimized with close to linear scaling.Packing of intervals is a fundamental operation in both the relational model and in SQL. Mostly treatment of intervals in SQL is concerned with temporal ones, where each time interval has start and end points in time; but the packing techniques covered in this article can be used with other types of intervals as well, like spatial ones on a line. Packing of intervals means grouping each set of contiguous intervals with which no other interval overlaps or is adjacent to (abutting), and returning the minimum start and maximum end for each group. Often, packing problems in SQL also involve a partitioning element (e.g. user, application, etc.), where the packing is done for each partition independently.
This article introduces new, highly efficient (good performance, close to linear scaling), setbased solutions using standard SQL constructs. Some of the techniques do make use of T-SQL specific constructs, but those constructs have standard parallels so adaptation to other platforms should be fairly straightforward. For example, one of the solutions makes use of the T-SQLspecific construct APPLY. Standard SQL has a similar construct called LATERAL correlation.

I provided the packing intervals problem as a challenge in my blog in SQL Server Magazine, and I do recommend looking at the different solutions people suggested. Here I’m going to focus and expand on my solutions. I’d like to thank a few people first. One of the techniques I used to address the interval packing problem was inspired by a technique used by Ben Flanaghan to address a different problem called Maximum Concurrent Sessions. I’ll tell you a little secret; when I saw Ben’s solution to that problem,
I nearly cried; and having seen lots of very creative SQL techniques in my life, that’s saying something. Thanks to Alejandro Mesa for helping to verify the validity of the solutions. And also thanks to Herbert Albert, Gianluca Hotz, and Dejan Sarka for editing the article and providing input.

Sample Data and Desired Result

The scenario I’ll use to demonstrate my solutions involves user sessions against some application or service. Use the code in Listing 1, page 41, (on Microsoft SQL Server) to create a table called Sessions, two indexes to support the solutions, and sample data to test the solution’s validity. Use the code in Listing 2, page 41, to create and populate the Sessions table on Oracle to test the portability of the solutions. The desired result for the small set of sample data is shown in Table 1, page 43.

CREATE TABLE dbo.Sessions
(
id INT NOT NULL IDENTITY,
username VARCHAR(14) NOT NULL,
starttime DATETIME2(3) NOT NULL,
endtime DATETIME2(3) NOT NULL,
CONSTRAINT PK_Sessions PRIMARY KEY(id),
CONSTRAINT CHK_endtime_gteq_starttime
CHECK (endtime >= starttime)
);
— indexes to support solutions
CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);
— sample data (small)
INSERT INTO dbo.Sessions VALUES
(‘User1’, ‘20111201 08:00:00.000’, ‘20111201 08:30:00.000’),
(‘User1’, ‘20111201 08:30:00.000’, ‘20111201 09:00:00.000’),
(‘User1’, ‘20111201 09:00:00.000’, ‘20111201 09:30:00.000’),
(‘User1’, ‘20111201 10:00:00.000’, ‘20111201 11:00:00.000’),
(‘User1’, ‘20111201 10:30:00.000’, ‘20111201 12:00:00.000’),
(‘User1’, ‘20111201 11:30:00.000’, ‘20111201 12:30:00.000’),
(‘User2’, ‘20111201 08:00:00.000’, ‘20111201 10:30:00.000’),
(‘User2’, ‘20111201 08:30:00.000’, ‘20111201 10:00:00.000’),
(‘User2’, ‘20111201 09:00:00.000’, ‘20111201 09:30:00.000’),
(‘User2’, ‘20111201 11:00:00.000’, ‘20111201 11:30:00.000’),
(‘User2’, ‘20111201 11:32:00.000’, ‘20111201 12:00:00.000’),
(‘User2’, ‘20111201 12:04:00.000’, ‘20111201 12:30:00.000’),
(‘User3’, ‘20111201 08:00:00.000’, ‘20111201 09:00:00.000’),
(‘User3’, ‘20111201 08:00:00.000’, ‘20111201 08:30:00.000’),
(‘User3’, ‘20111201 08:30:00.000’, ‘20111201 09:00:00.000’),
(‘User3’, ‘20111201 09:30:00.000’, ‘20111201 09:30:00.000’);

CREATE TABLE dbo.Sessions
(
id INT NOT NULL,
username VARCHAR2(14) NOT NULL,
starttime TIMESTAMP NOT NULL,
endtime TIMESTAMP NOT NULL,
CONSTRAINT PK_Sessions PRIMARY KEY(id),
CONSTRAINT CHK_endtime_gteq_starttime
CHECK (endtime >= starttime)
);
INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);
— sample data
INSERT INTO dbo.Sessions VALUES(1, ‘User1’,
TO_DATE(‘20111201 08:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 08:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(2, ‘User1’,
TO_DATE(‘20111201 08:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(3, ‘User1’,
TO_DATE(‘20111201 09:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(4, ‘User1’,
TO_DATE(‘20111201 10:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 11:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(5, ‘User1’,
TO_DATE(‘20111201 10:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 12:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(6, ‘User1’,
TO_DATE(‘20111201 11:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 12:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(7, ‘User2’,
TO_DATE(‘20111201 08:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 10:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(8, ‘User2’,
TO_DATE(‘20111201 08:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 10:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(9, ‘User2’,
TO_DATE(‘20111201 09:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(10, ‘User2’,
TO_DATE(‘20111201 11:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 11:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(11, ‘User2’,
TO_DATE(‘20111201 11:32:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 12:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(12, ‘User2’,
TO_DATE(‘20111201 12:04:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 12:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(13, ‘User3’,
TO_DATE(‘20111201 08:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(14, ‘User3’,
TO_DATE(‘20111201 08:00:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 08:30:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(15, ‘User3’,
TO_DATE(‘20111201 08:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:00:00’, ‘YYYYMMDD HH24:MI:SS’));
INSERT INTO dbo.Sessions VALUES(16, ‘User3’,
TO_DATE(‘20111201 09:30:00’, ‘YYYYMMDD HH24:MI:SS’),
TO_DATE(‘20111201 09:30:00’, ‘YYYYMMDD HH24:MI:SS’));

Table 1: Desired Result

Figure 1 – Unpacked and Packed Intervals

Figure 1 has a graphical depiction of both the original intervals from the Sessions table (green bars), as well as the packed intervals (red segments). Concerning the grain, or precision, used for the start and end points of intervals; I’ll first present solutions that assume that if the end of one interval is less than the start of another, the two do not overlap or abuts (are adjacent). I will later also provide a solution in case a gap of up to a certain length is to be ignored.

You can use the code in Listing 3, page 44, to populate the Sessions table in Microsoft SQL Server with a large set of sample data to test the performance of the solutions. The code in Listing 3 populates the Sessions table with 5,000,000 rows. I filled it with data for 2,000 users, each with 2,500 sessions, during a period of a week, with each session lasting up to one hour. But the code allows you to change any element that you like to test the scaling of the solutions. The performance measures I’ll mention in the
article are for the sample data with the current values in Listing 3. I tested the queries on a Dell Alienware M15x laptop, with a Core i7 processor (quad with hyper threading, namely 8 logical CPUs), 4 GB RAM 1333 MHz, against hot cache.

Note that if you’re not getting parallel plans where I do, it’s probably because the machine you’re using has fewer logical CPUs than in mine. To get similar plans to the ones I get, just for test purposes, you can use the SQL Server service startup option -P8, which will cause SQL Server to use 8 schedulers like in an environment with 8 logical CPUs. The -P startup parameter is not an officially documented one, so use it just for test purposes to mimic a machine with a desired number of CPUs, not for production purposes.

Solution 1, Using Row Numbers

The first solution that I’ll cover is a standard solution that uses the ROW_NUMBER function. It runs without any alteration on both SQL Server and Oracle. The complete solution is shown in Listing 4, page 45. It runs for 18 seconds on my laptop against the sample data provided in Listing 3.

— helper function GetNums
IF OBJECT_ID(‘dbo.GetNums’, ‘IF’) IS NOT NULL
DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT TOP (@n) n FROM Nums ORDER BY n;
GO
— sample data, 5,000,000 rows
DECLARE
@num_users AS INT = 2000,
@intervals_per_user AS INT = 2500,
@start_period AS DATETIME2(3) = ‘20110101’,
@end_period AS DATETIME2(3) = ‘20110107’,
@max_duration_in_ms AS INT = 3600000; — 60 minutes
TRUNCATE TABLE dbo.Sessions;
WITH C AS
(
SELECT ‘User’ + RIGHT(‘000000000’ + CAST(U.n AS VARCHAR(10)), 10) AS username,
DATEADD(ms, ABS(CHECKSUM(NEWID())) % 86400000,
DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @start_period, @end_period), @start_period)) AS starttime
FROM dbo.GetNums(@num_users) AS U
CROSS JOIN dbo.GetNums(@intervals_per_user) AS I
)
INSERT INTO dbo.Sessions WITH (TABLOCK) (username, starttime, endtime)
SELECT username, starttime,
DATEADD(ms, ABS(CHECKSUM(NEWID())) % (@max_duration_in_ms + 1), starttime) AS endtime
FROM C;

WITH C1 AS
— let e = end ordinals, let s = start ordinals
(
SELECT id, username, starttime AS ts, +1 AS type, NULL AS e,
ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, id) AS s
FROM dbo.Sessions
UNION ALL
SELECT id, username, endtime AS ts, -1 AS type,
ROW_NUMBER() OVER(PARTITION BY username ORDER BY endtime, id) AS e,
NULL AS s
FROM dbo.Sessions
),
C2 AS
— let se = start or end ordinal, namely, how many events (start or end) happened so far
(
SELECT C1.*,
ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts, type DESC, id) AS se
FROM C1
),
C3 AS
— For start events, the expression s - (se - s) - 1 represents how many sessions were active
— just before the current (hence - 1)
—
— For end events, the expression (se - e) - e represents how many sessions are active
— right after this one
—
— The above two expressions are 0 exactly when a group of packed intervals
— either starts or ends, respectively
—
— After filtering only events when a group of packed intervals either starts or ends,
— group each pair of adjacent start/end events
(
SELECT username, ts,
FLOOR((ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT username, MIN(ts) AS starttime, max(ts) AS endtime
FROM C3
GROUP BY username, grpnum;

The code in the CTE called C1 unifies start events with end events into one chronological sequence of events (start or end). Start events are marked with a +1 event type since they increase the count of active sessions, and end events are marked with a -1 event type since they decrease the count of active sessions. Figure 2, page 46, shows the chronological sequence of unified events sorted by username, ts, type DESC, id, with green bars representing how many sessions are active before and after each event.

Observe that a packed interval always starts when the number of active sessions prior to a start event was zero, and ends when the number of active sessions after an end event is zero. Therefore, in respect to each start event you need to know how many sessions were active prior to it, and in respect to each end event you need to know how many sessions are active after it. This information is calculated in steps. Observe that the code in the CTE C1 calculates start ordinals for start events (attribute called s), with NULLs used as place holders in that attribute for end events, and end ordinals for end events (attribute called e), with NULLs used as place holders in that attribute for start events.

The code in the CTE C2 then simply adds an ordinal for start or end events (attribute called se), partitioned by username, sorted by ts, type DESC, id. Table 2 has the output of the code in C2, sorted by username, ts, type DESC, id. For readability, I marked the start event types as +1 instead of just 1 and replaced NULLs with blanks, Table 2, page 47.

Figure 2 – Start and End Events Ordered Chronologically

Then the code in the CTE C3 is where most of the magic is done. For each start event you know how many sessions started so far (s), and you know how many sessions either started or ended so far (se). Therefore, you can easily calculate how many sessions ended so far (se – s). Now that you know how many sessions started and how many sessions ended, you can calculate how many sessions are active after the start event: s – (se – s). Think of it just like calculating how many people are in a room if x people enter the room and y people leave the room. Finally, to tell how many sessions were active prior to the start event, simply subtract 1 from the calculation: s – (se – s) – 1.
In a very similar way you can calculate the number of active sessions after each end event. Having both the number of sessions that ended thus far (e) and the number of sessions that either started or ended (se), you can calculate how many sessions started as: se – e. Then the number of active sessions is (se – e) – e. Now, remember that you want to filter only start events where the number of active sessions prior to the event was zero, and end events where the number of active sessions after the event was zero. You can generalize the two filters into one:

WHERE COALESCE(s – (se – s) – 1, (se – e) – e) = 0

Table 2: Output of Code in Solution 1, C2

Table 3: Output of Code in Solution 1, C3

What you have left post filtering are pairs of adjacent start-end events, each representing the start and end of a packed interval. So you need to assign a group identifier to each pair in order to be able to later pivot each pair into one row. This can be achieved by assigning row numbers (call it n), and applying the calculation (n – 1) / 2 + 1,
where / represents integer division. For n values 1, 2, 3, 4, … you get a result of 1, 1, 2, 2, …

In SQL Server the arithmetic operator / represents integer division when the operands are integers, but in Oracle you get a decimal division, so I added a FLOOR function so that the code would run correctly on both platforms. So the code in the CTE C3 generates the output shown in Table 3.
What’s left to the outer query to do is group the rows from C3 by username and grpnum, and return the minimum ts as the packed interval’s start time, and the maximum ts as the end time.

The plan generated by SQL Server’s optimizer for this query is highly efficient, given that you create the aforementioned indexes: idx_user_start_id and idx_user_end_id. The plan is shown in Figure 3.

Figure 3: Plan for Solution in Listing 4

What’s amazing about this plan is that it applies two ordered scans of the indexes created to support this solution (idx_user_start_id and idx_user_end_id), and relies on the ordered scans to (take a deep breath now): calculate the row numbers for start ordinals (s), as well as row numbers for end ordinals (e), as well as performing a merge join to unify the results, as well as calculating the start or end ordinals (se) on the unified sets, as well as calculating the row numbers that are used to produce grpnum post filtering; and all this without requiring even one sort operator! That’s truly remarkable to see an optimizer that so beautifully understands the concept of order preservation.
That’s a big Kudos to the SQL Server optimization team for this design. Finally a hash aggregate is used to group the data by grpnum (only the remaining rows post filtering). Since most of the operations used in this plan have linear complexity, this solution should scale close to linearly. So in total, this plan performs only two scans
of the data (one of each index), in index order. As mentioned, this solution runs on my laptop for 18 seconds. The one thing that this solution doesn’t exploit well is parallelism. That’s where the second solution excels…

IF OBJECT_ID(‘dbo.Users’) IS NOT NULL DROP TABLE dbo.Users;
CREATE TABLE dbo.Users
(
username VARCHAR(14) NOT NULL,
CONSTRAINT PK_Users PRIMARY KEY(username)
);
DECLARE @num_users AS INT = 2000;
INSERT INTO dbo.Users(username)
SELECT ‘User’ + RIGHT(‘000000000’ + CAST(U.n AS VARCHAR(10)), 10) AS username
FROM dbo.GetNums(@num_users) AS U;

Solution 2, Using Row Numbers and APPLY to Exploit Parallelism

To exploit parallelism well, what you want is to encapsulate the logic from the solution in Listing 4 in a table expression that operates on a single customer, and apply the table expression to each user. I’m assuming here that you have a table holding the distinct users, which is a fair assumption to make. The code in Listing 5, page 48, creates
a table called Users and fills it with the same username values as the ones that appear in the Sessions table. It is then convenient to encapsulate the logic from the solution in Listing 4 for a single user in an inline table function, as shown in Listing 6. And then finally, use the CROSS APPLY operator (similar to LATERAL correlation in the standard) to apply the function to each user from the Users table as shown in Listing 7.

SQL Server generates a very efficient parallel plan for this query. The plan for this solution is shown in Figure 4.

IF OBJECT_ID(‘dbo.UserIntervals’, ‘IF’) IS NOT NULL DROP FUNCTION dbo.UserIntervals;
GO
CREATE FUNCTION dbo.UserIntervals(@user AS VARCHAR(14)) RETURNS TABLE
AS
RETURN
WITH C1 AS
(
SELECT id, starttime AS ts, +1 AS type, NULL AS e,
ROW_NUMBER() OVER(ORDER BY starttime, id) AS s
FROM dbo.Sessions
WHERE username = @user
UNION ALL
SELECT id, endtime AS ts, -1 AS type,
ROW_NUMBER() OVER(ORDER BY endtime, id) AS e,
NULL AS s
FROM dbo.Sessions
WHERE username = @user
),
C2 AS
(
SELECT C1.*, ROW_NUMBER() OVER(ORDER BY ts, type DESC, id) AS se
FROM C1
),
C3 AS
(
SELECT ts,
FLOOR((ROW_NUMBER() OVER(ORDER BY ts) - 1) / 2 + 1) AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT MIN(ts) AS starttime, max(ts) AS endtime
FROM C3
GROUP BY grpnum;
GO
SELECT U.username, A.starttime, A.endtime
FROM dbo.Users AS U
CROSS APPLY dbo.UserIntervals(U.username) AS A;

Figure 4: Plan for Query in Listing 7

As you can see, the plan uses a parallel scan of the clustered index on the Users table, and then performs the work for each user in the inner branch of the Nested Loops join. The work done in this inner branch should look familiar; it’s very similar to the work done in the plan shown in Figure 3, only this time for the data associated with one user. This inner branch is, of course, executed in parallel by multiple threads. This solution runs for only 2 seconds on my laptop.

Ignoring Gaps of Up To a Certain Length

In case you want the solution to ignore gaps of up to a certain length (e.g. 10 seconds), you can use the following approach:

Add a computed column for endtime + ; call that column endtime2
Create an index on username, endtime2
Use one of the aforementioned solutions with a couple of revisions:
- Operate on endtime2 instead of endtime
- In the outer query subtract from the result endtime

So, suppose you want to ignore gaps of up to 10 seconds. Use the following code to add the computed column endtime2 and the index:

ALTER TABLE dbo.Sessions ADD endtime2 AS DATEADD(second, 10, endtime)
CREATE INDEX idx_user_end2_id ON dbo.Sessions(username, endtime2, id);

The code in Listing 8, page 51, then shows the revised solution from Listing 4 with the two aforementioned revisions. The performance is similar to that of the solution
in Listing 4. And naturally, also here you can use the APPLY approach for efficient use of parallelism.

WITH C1 AS
(
SELECT id, username, starttime AS ts, +1 AS type, NULL AS e,
ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, id) AS s
FROM dbo.Sessions
UNION ALL
SELECT id, username, endtime2 AS ts, -1 AS type,
ROW_NUMBER() OVER(PARTITION BY username ORDER BY endtime2, id) AS e,
NULL AS s
FROM dbo.Sessions
),
C2 AS
(
SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts, type DESC, id) AS se
FROM C1
),
C3 AS
(
SELECT username, ts,
FLOOR((ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT username, MIN(ts) AS starttime, DATEADD(second, -10, MAX(ts)) AS endtime
FROM C3
GROUP BY username, grpnum;

Solution 3, Using a Window Aggregate

Solution 3 is shown in Listing 9, page 52. It relies on a window aggregate function with elements that are standard but not supported by SQL Server as of version 2008 R2. The solution is supported by Oracle, and hopefully the missing functionality will be added to SQL Server in a future version.

WITH C1 AS
(
SELECT username, starttime AS ts, +1 AS type, 1 AS sub
FROM dbo.Sessions
UNION ALL
SELECT username, endtime AS ts, -1 AS type, 0 AS sub
FROM dbo.Sessions
),
C2 AS
(
SELECT C1.*,
SUM(type) OVER(PARTITION BY username ORDER BY ts, type DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) - sub AS cnt
FROM C1
),
C3 AS
(
SELECT username, ts,
FLOOR((ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE cnt = 0
)
SELECT username, MIN(ts) AS starttime, max(ts) AS endtime
FROM C3
GROUP BY username, grpnum;

The solution in Listing 9 uses very similar principals to those used by Solution 1, only instead of using row numbers to calculate the number of active sessions at any given point, it uses a window SUM aggregate. A running sum of the type (recall +1 represents a start event and -1 an end event), partitioned by username, in chronological order, is the number of active sessions at any given point. Now, remember that for start events you need the number of active sessions prior to the event, and for end events you need the number post the event. Therefore you need to subtract 1 from the count with start events, and subtract nothing with end events. So the solution generates an attribute called sub with 1 for start events and 0 for end events, and then subtracts that value from the running total, using the following expression:

SUM(type) OVER(PARTITION BY username ORDER BY ts, type DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) - sub AS cnt

The rest is very similar to the logic in Solution 1.

Conclusion

In this article I presented new solutions to the fundamental interval packing problem. The first solution was based mainly on use of row numbers to calculate ordinals for interval start and/or end events and based on those, the count of active sessions. The second solution implemented the same logic but for a single user, and applied it to each user using the APPLY operator, and this way utilized parallelism better. I also showed how to apply the solutions when a gap of up to a certain length is to be ignored. Finally I showed a third, more concise, solution that is based on a standard window aggregate function with elements that at the moment are not supported by Microsoft SQL Server (as of version 2008 R2), but hopefully will be added in the future.

Thanks for reading!

Stay tuned for more news on our blog and subscribe to our newsletter if you want to receive our new posts in your mail, get course discounts…

Packing Intervals

Sample Data and Desired Result

Solution 1, Using Row Numbers

Solution 2, Using Row Numbers and APPLY to Exploit Parallelism

Ignoring Gaps of Up To a Certain Length

Solution 3, Using a Window Aggregate

Conclusion

Blog and Corporate News

CONVOCATORIA DE LA JUNTA GENERAL EXTRAORDINARIA DE LA MERCANTIL “LUCIENT DATA, S.A.” para el 28 de febrero de 2023 a las 19:00 CET

A Static Relational Interval Tree

Acuerdos corporativos de Lucient Data SA

Consulting

Solutions & Services

Resources

Company

Packing Intervals

Sample Data and Desired Result

Solution 1, Using Row Numbers

Solution 2, Using Row Numbers and APPLY to Exploit Parallelism

Ignoring Gaps of Up To a Certain Length

Solution 3, Using a Window Aggregate

Conclusion

Blog and Corporate News

CONVOCATORIA DE LA JUNTA GENERAL EXTRAORDINARIA DE LA MERCANTIL “LUCIENT DATA, S.A.” para el 28 de febrero de 2023 a las 19:00 CET

A Static Relational Interval Tree

Acuerdos corporativos de Lucient Data SA

Join our mailing list for the latest insights

Consulting

Solutions & Services

Resources

Company