I was recently looking at some blog posts and noticed a new one out by the SQL Server CAT team titled Top 10 Best Practices for Building a Large Scale Relational Data Warehouse. This short piece has several interesting items in it but one of the top 10 best practices was devoted to deleting data from the data warehouse.

If you read the piece, the deleting data section listed several different ways of deleting large amounts of data from a table in a data warehouse and one of them caught my attention.  One of the best practices listed was to create a second table, insert all the data you do not want to delete into the table and then drop the one table and rename the new one. This would allow you to basically “delete” the old data through its non-inclusion in the INSERT statement. My first thought that this would be a hassle as you would have to redo all FK relationships, indexes, constraints, etc on the new table but to tell you the truth, I never thought about doing a large delete this way so I thought I would play with the statement to see if it is any better than the “normal” ways of deleting large amounts of data.

I thought about a few basic tests I could do to establish baselines that I could compare to the “INSERT” delete method. Deleting data with a DELETE statement, a batched DELETE statement, partition switching, creating a new table with SELECT INTO, and then creating a new table and populating it with          INSERT SELECT. Now these tests are just basic simple tests with no indexes on the table, no relationships to worry about, no triggers, etc. attached to the table. You will have to “enlarge” these tests to create more realistic tables in a more real world environment.

Note:  I am not considering TRUNCATE TABLE as I only want to delete some of the data and not all of it.

Let’s create a master table with 10,000,000 rows of data that we can use for each of our tests

SET NOCOUNT ON

GO

/*

Prep for the tests by creating a master table with 10,000,000 rows of data

*/

 

–Create master table for demo

IF OBJECT_ID (‘drop_test_table’) IS NOT NULL

DROP TABLE drop_test_table

GO

 

CREATE TABLE drop_test_table

(

col1 INT

,col2 INT

)

GO

 

–Add data

DECLARE @loop INT

SET @loop = 1

 

WHILE @loop < 10000001

BEGIN

      INSERT INTO drop_test_table VALUES (@loop, @loop+1)

      SET @loop = @loop + 1

END

GO

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

The first test is to create a table with 10,000,000 rows of data from our master table and then delete 1,000,000 rows of data using a DELETE statement with a WHERE clause. A normal delete statement that is used in thousands of operations around the world.

/*

Test 1: Delete data DELETE statements

*/

 

–Create table for test 1

IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL

DROP TABLE drop_test_table_delete

GO

 

CREATE TABLE drop_test_table_delete

(

col1 INT

,col2 INT

)

GO

 

–Insert data

INSERT INTO drop_test_table_delete (col1, col2)

(SELECT col1, col2 FROM drop_test_table)

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

To understand how the DELETE statement affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.

 

SET STATISTICS PROFILE ON

SET STATISTICS TIME ON

SET STATISTICS IO ON

 

DELETE FROM drop_test_table_delete

WHERE col1 < 1000001

 

/*

TotalSubtreeCost: 64.93308

Table ‘drop_test_table_delete’. Scan count 3, logical reads 1038315, physical reads 0, read-ahead reads 36190, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

   CPU time = 9610 ms,  elapsed time = 11226 ms.

*/

 

SET STATISTICS PROFILE OFF

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

 

–Clean up

IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL

DROP TABLE drop_test_table_delete

GO

 

/*

Test 2: Delete data batched DELETE statements

*/

 

The second test is to create a table with 10,000,000 rows of data from our master table and then delete 1,000,000 rows of data using a batched DELETE statement with a WHERE clause. Another normal delete statement that is used in thousands of operations around the world when the amount of data that needs to be deleted is large and the database administrator does not want a single large transaction that may hold locks, grow the transaction log, or take a long time to rollback if the operation fails.

–Create table for test 2

IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL

DROP TABLE drop_test_table_delete

GO

 

CREATE TABLE drop_test_table_delete

(

col1 INT

,col2 INT

)

GO

 

–Insert data

INSERT INTO drop_test_table_delete (col1, col2)

(SELECT col1, col2 FROM drop_test_table)

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

To understand how the DELETE statement affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.

 

SET STATISTICS PROFILE ON

SET STATISTICS TIME ON

SET STATISTICS IO ON

 

–Delete in batches of 10000

–You can adjust the batch size

DELETE TOP (10000) FROM drop_test_table_delete

WHERE col1 < 1000001

GO 100

 

/*

You will see a different execution plan for each 100 executions of the DELETE statement

TotalSubtreeCost: Each execution is approximately 0.6871661

The TotalSubtreeCost after adding TotalSubtreeCost from all 100 executions is around:

68.71661

 

IO

Each execution had different IO patterns as would be expected

The total IO after adding IO from all 100 executions is around:

3,050,000

 

Time

The total time after adding time from all 100 executions is around:

112 seconds

*/

 

SET STATISTICS PROFILE OFF

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

 

IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL

DROP TABLE drop_test_table_delete

GO

 

/*

Test 3: Delete data through partition switching

*/

 

The third test is a new way to delete large amounts of data from a table in SQL Server 2005. This method involves creating the table as a partitioned table and then moving the partition that contain the unwanted data to a sacrifice table that will be dropped with a DROP TABLE command which will not create a large transaction or put a lot of data into the transaction log.

 

IF EXISTS (SELECT * FROM sys.partition_schemes WHERE [name] = ‘psPartition’)

DROP PARTITION SCHEME psPartition

GO

 

IF EXISTS (SELECT * FROM sys.partition_functions WHERE [name] = ‘pfPartition’)

DROP PARTITION FUNCTION pfPartition

GO

 

–Create partition function for date

CREATE PARTITION FUNCTION [pfPartition] (INT)

AS RANGE LEFT FOR VALUES ( 1000000, 2000000, 3000000, 4000000, 5000000, 6000000, 7000000, 8000000, 9000000, 10000000)

GO

 

–Create partition scheme for date

CREATE PARTITION SCHEME psPartition

AS PARTITION [pfPartition]

TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],[PRIMARY],[PRIMARY],[PRIMARY]

,[PRIMARY],[PRIMARY],[PRIMARY],[PRIMARY])

GO

 

–Create partition table

IF OBJECT_ID (‘drop_test_table_partition’) IS NOT NULL

DROP TABLE drop_test_table_partition

GO

 

CREATE TABLE drop_test_table_partition

(

col1 INT

,col2 INT

) ON psPartition(col1)

GO

 

–Create table for switch

IF OBJECT_ID (‘drop_test_table_partition_new’) IS NOT NULL

DROP TABLE drop_test_table_partition_new

GO

 

CREATE TABLE drop_test_table_partition_new

(

col1 INT

,col2 INT

)

GO

 

–Insert data

INSERT INTO drop_test_table_partition (col1, col2)

(SELECT col1, col2 FROM drop_test_table)

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

To understand how the partition switching operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.

 

SET STATISTICS PROFILE ON

SET STATISTICS TIME ON

SET STATISTICS IO ON

 

–Switch the partition

ALTER TABLE drop_test_table_partition SWITCH PARTITION 1 TO drop_test_table_partition_new

 

/*

   CPU time = 0 ms,  elapsed time = 147 ms

*/

 

–Drop the table

DROP TABLE drop_test_table_partition_new

 

/*

   CPU time = 0 ms,  elapsed time = 79 ms.

*/

 

SET STATISTICS PROFILE OFF

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

 

–Clean up database

IF OBJECT_ID (‘drop_test_table_partition’) IS NOT NULL

DROP TABLE drop_test_table_partition

GO

 

IF OBJECT_ID (‘drop_test_table_partition_new’) IS NOT NULL

DROP TABLE drop_test_table_partition_new

GO

 

IF EXISTS (SELECT * FROM sys.partition_schemes WHERE [name] = ‘psPartition’)

DROP PARTITION SCHEME psPartition

GO

 

IF EXISTS (SELECT * FROM sys.partition_functions WHERE [name] = ‘pfPartition’)

DROP PARTITION FUNCTION pfPartition

GO

 

/*

Test 4: Delete data through new table (SELECT INTO)

*/

 

The fourth test is based on the best practice I read from the Microsoft article. In this test, I wanted to use SELECT INTO to create the new table as SELECT INTO usually takes less system resources to create versus using an INSERT INTO statement.

 

–Create table for test 4

IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL

DROP TABLE drop_test_table_new

GO

 

CREATE TABLE drop_test_table_new

(

col1 INT

,col2 INT

)

GO

 

–Insert data

INSERT INTO drop_test_table_new (col1, col2)

(SELECT col1, col2 FROM drop_test_table)

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

To understand how the delete operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.

 

SET STATISTICS PROFILE ON

SET STATISTICS TIME ON

SET STATISTICS IO ON

 

IF OBJECT_ID (‘new_table_for_select_into’) IS NOT NULL

DROP TABLE new_table_for_select_into

GO

 

SELECT * INTO new_table_for_select_into FROM drop_test_table_new

WHERE col1 > 1000000

 

/*

TotalSubtreeCost: 121.9906

Table ‘drop_test_table_new’. Scan count 1, logical reads 38315, physical reads 0, read-ahead reads 35105, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

CPU time = 8689 ms,  elapsed time = 9965 ms.

*/

 

DROP TABLE drop_test_table_new

/*

   CPU time = 16 ms,  elapsed time = 11 ms.

*/

 

EXEC sp_rename ‘new_table_for_select_into’ , ‘drop_test_table_new’

/*

0.00328328

0.00328328

0.00328338

0.00657236

0.003291317

0.009868517

0.006578877

 

logical reads round 10

total 1000 ms

*/

 

 

SET STATISTICS PROFILE OFF

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

 

–Clean up database

IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL

DROP TABLE drop_test_table_new

GO

 

/*

Test 5: Delete data through new table (insert INTO)

*/

 

The last test is the INSERT INTO test. This would be a normal method to replicate the best practice listed by Microsoft.

 

–Create table for test 5

IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL

DROP TABLE drop_test_table_new

GO

 

CREATE TABLE drop_test_table_new

(

col1 INT

,col2 INT

)

GO

 

–Insert data

INSERT INTO drop_test_table_new (col1, col2)

(SELECT col1, col2 FROM drop_test_table)

 

–Clear memory

CHECKPOINT

DBCC DROPCLEANBUFFERS

 

To understand how the delete operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.

 

SET STATISTICS PROFILE ON

SET STATISTICS TIME ON

SET STATISTICS IO ON

 

–Create table for test 5

IF OBJECT_ID (‘drop_test_table_insert_into’) IS NOT NULL

DROP TABLE drop_test_table_insert_into

GO

 

CREATE TABLE drop_test_table_insert_into

(

col1 INT

,col2 INT

)

GO

 

INSERT INTO drop_test_table_insert_into (col1, col2)

(SELECT col1, col2 FROM drop_test_table_new WHERE col1 > 1000000)

 

/*

TotalSubtreeCost: 120.7378

Table ‘drop_test_table_insert_into’. Scan count 0, logical reads 9034482, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘drop_test_table_new’. Scan count 1, logical reads 38315, physical reads 0, read-ahead reads 35086, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

CPU time = 32417 ms,  elapsed time = 38687 ms.

*/

 

DROP TABLE drop_test_table_new

/*

   CPU time = 16 ms,  elapsed time = 11 ms

*/

 

EXEC sp_rename ‘drop_test_table_select_into’ , ‘drop_test_table_new’

/*

0.00328328

0.00328328

0.00328338

0.00657236

0.003291317

0.009868517

0.006578877

 

logical reads round 10

total 1000 ms

*/

 

SET STATISTICS PROFILE OFF

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

 

–Clean up database

IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL

DROP TABLE drop_test_table_new

GO

 

/*

–When you are done, clean up your database

IF OBJECT_ID (‘drop_test_table’) IS NOT NULL

DROP TABLE drop_test_table

GO

*/

Findings

If you review the findings below, you will find that the partition switch method is the clear winner when it comes to dropping large amounts of data. Of course your tables need to be partitioned in the first place and you cannot have any non-aligned indexes on the partitioned table, but after you work through that switching is not only easy, it is fast and takes very little system resources. After that, according to my opinions on what is the best resource usage, test 3 which used the SELECT INTO.

The next three in order where Test 1, the regular DELETE, test 2, the batched DELETE, and then test 5, the INSERT INTO.

Another thing you will notice about the findings and test is that I did not take into account the resources it would take to recreate the normal indexes, relationships, triggers, constraints, etc that are usually found on normal real-world tables. You can create your own tests in your environment to see which method is best for you.

Test 1

TotalSubtreeCost: 64

IO:  1,074,000 logical reads

Time: 112 seconds

 

Test 2

TotalSubtreeCost: 69

IO:  3,050,000 logical reads

Time: 112 seconds

 

Test 3

TotalSubtreeCost: 0

IO:  0 logical reads

Time: 200 milliseconds

 

Test 4

TotalSubtreeCost: 122

IO:  73,000 logical reads

Time: 10 seconds

 

Test 5

TotalSubtreeCost: 121

IO: 9,072,000 logical reads

Time: 39 seconds