I was recently looking at some blog posts and noticed a new one out by the SQL Server CAT team titled Top 10 Best Practices for Building a Large Scale Relational Data Warehouse. This short piece has several interesting items in it but one of the top 10 best practices was devoted to deleting data from the data warehouse.
If you read the piece, the deleting data section listed several different ways of deleting large amounts of data from a table in a data warehouse and one of them caught my attention. One of the best practices listed was to create a second table, insert all the data you do not want to delete into the table and then drop the one table and rename the new one. This would allow you to basically “delete” the old data through its non-inclusion in the INSERT statement. My first thought that this would be a hassle as you would have to redo all FK relationships, indexes, constraints, etc on the new table but to tell you the truth, I never thought about doing a large delete this way so I thought I would play with the statement to see if it is any better than the “normal” ways of deleting large amounts of data.
I thought about a few basic tests I could do to establish baselines that I could compare to the “INSERT” delete method. Deleting data with a DELETE statement, a batched DELETE statement, partition switching, creating a new table with SELECT INTO, and then creating a new table and populating it with INSERT SELECT. Now these tests are just basic simple tests with no indexes on the table, no relationships to worry about, no triggers, etc. attached to the table. You will have to “enlarge” these tests to create more realistic tables in a more real world environment.
Note: I am not considering TRUNCATE TABLE as I only want to delete some of the data and not all of it.
Let’s create a master table with 10,000,000 rows of data that we can use for each of our tests
SET NOCOUNT ON
GO
/*
Prep for the tests by creating a master table with 10,000,000 rows of data
*/
–Create master table for demo
IF OBJECT_ID (‘drop_test_table’) IS NOT NULL
DROP TABLE drop_test_table
GO
CREATE TABLE drop_test_table
(
col1 INT
,col2 INT
)
GO
–Add data
DECLARE @loop INT
SET @loop = 1
WHILE @loop < 10000001
BEGIN
INSERT INTO drop_test_table VALUES (@loop, @loop+1)
SET @loop = @loop + 1
END
GO
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
The first test is to create a table with 10,000,000 rows of data from our master table and then delete 1,000,000 rows of data using a DELETE statement with a WHERE clause. A normal delete statement that is used in thousands of operations around the world.
/*
Test 1: Delete data DELETE statements
*/
–Create table for test 1
IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL
DROP TABLE drop_test_table_delete
GO
CREATE TABLE drop_test_table_delete
(
col1 INT
,col2 INT
)
GO
–Insert data
INSERT INTO drop_test_table_delete (col1, col2)
(SELECT col1, col2 FROM drop_test_table)
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
To understand how the DELETE statement affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
DELETE FROM drop_test_table_delete
WHERE col1 < 1000001
/*
TotalSubtreeCost: 64.93308
Table ‘drop_test_table_delete’. Scan count 3, logical reads 1038315, physical reads 0, read-ahead reads 36190, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 9610 ms, elapsed time = 11226 ms.
*/
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
–Clean up
IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL
DROP TABLE drop_test_table_delete
GO
/*
Test 2: Delete data batched DELETE statements
*/
The second test is to create a table with 10,000,000 rows of data from our master table and then delete 1,000,000 rows of data using a batched DELETE statement with a WHERE clause. Another normal delete statement that is used in thousands of operations around the world when the amount of data that needs to be deleted is large and the database administrator does not want a single large transaction that may hold locks, grow the transaction log, or take a long time to rollback if the operation fails.
–Create table for test 2
IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL
DROP TABLE drop_test_table_delete
GO
CREATE TABLE drop_test_table_delete
(
col1 INT
,col2 INT
)
GO
–Insert data
INSERT INTO drop_test_table_delete (col1, col2)
(SELECT col1, col2 FROM drop_test_table)
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
To understand how the DELETE statement affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
–Delete in batches of 10000
–You can adjust the batch size
DELETE TOP (10000) FROM drop_test_table_delete
WHERE col1 < 1000001
GO 100
/*
You will see a different execution plan for each 100 executions of the DELETE statement
TotalSubtreeCost: Each execution is approximately 0.6871661
The TotalSubtreeCost after adding TotalSubtreeCost from all 100 executions is around:
68.71661
IO
Each execution had different IO patterns as would be expected
The total IO after adding IO from all 100 executions is around:
3,050,000
Time
The total time after adding time from all 100 executions is around:
112 seconds
*/
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
IF OBJECT_ID (‘drop_test_table_delete’) IS NOT NULL
DROP TABLE drop_test_table_delete
GO
/*
Test 3: Delete data through partition switching
*/
The third test is a new way to delete large amounts of data from a table in SQL Server 2005. This method involves creating the table as a partitioned table and then moving the partition that contain the unwanted data to a sacrifice table that will be dropped with a DROP TABLE command which will not create a large transaction or put a lot of data into the transaction log.
IF EXISTS (SELECT * FROM sys.partition_schemes WHERE [name] = ‘psPartition’)
DROP PARTITION SCHEME psPartition
GO
IF EXISTS (SELECT * FROM sys.partition_functions WHERE [name] = ‘pfPartition’)
DROP PARTITION FUNCTION pfPartition
GO
–Create partition function for date
CREATE PARTITION FUNCTION [pfPartition] (INT)
AS RANGE LEFT FOR VALUES ( 1000000, 2000000, 3000000, 4000000, 5000000, 6000000, 7000000, 8000000, 9000000, 10000000)
GO
–Create partition scheme for date
CREATE PARTITION SCHEME psPartition
AS PARTITION [pfPartition]
TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],[PRIMARY],[PRIMARY],[PRIMARY]
,[PRIMARY],[PRIMARY],[PRIMARY],[PRIMARY])
GO
–Create partition table
IF OBJECT_ID (‘drop_test_table_partition’) IS NOT NULL
DROP TABLE drop_test_table_partition
GO
CREATE TABLE drop_test_table_partition
(
col1 INT
,col2 INT
) ON psPartition(col1)
GO
–Create table for switch
IF OBJECT_ID (‘drop_test_table_partition_new’) IS NOT NULL
DROP TABLE drop_test_table_partition_new
GO
CREATE TABLE drop_test_table_partition_new
(
col1 INT
,col2 INT
)
GO
–Insert data
INSERT INTO drop_test_table_partition (col1, col2)
(SELECT col1, col2 FROM drop_test_table)
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
To understand how the partition switching operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
–Switch the partition
ALTER TABLE drop_test_table_partition SWITCH PARTITION 1 TO drop_test_table_partition_new
/*
CPU time = 0 ms, elapsed time = 147 ms
*/
–Drop the table
DROP TABLE drop_test_table_partition_new
/*
CPU time = 0 ms, elapsed time = 79 ms.
*/
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
–Clean up database
IF OBJECT_ID (‘drop_test_table_partition’) IS NOT NULL
DROP TABLE drop_test_table_partition
GO
IF OBJECT_ID (‘drop_test_table_partition_new’) IS NOT NULL
DROP TABLE drop_test_table_partition_new
GO
IF EXISTS (SELECT * FROM sys.partition_schemes WHERE [name] = ‘psPartition’)
DROP PARTITION SCHEME psPartition
GO
IF EXISTS (SELECT * FROM sys.partition_functions WHERE [name] = ‘pfPartition’)
DROP PARTITION FUNCTION pfPartition
GO
/*
Test 4: Delete data through new table (SELECT INTO)
*/
The fourth test is based on the best practice I read from the Microsoft article. In this test, I wanted to use SELECT INTO to create the new table as SELECT INTO usually takes less system resources to create versus using an INSERT INTO statement.
–Create table for test 4
IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL
DROP TABLE drop_test_table_new
GO
CREATE TABLE drop_test_table_new
(
col1 INT
,col2 INT
)
GO
–Insert data
INSERT INTO drop_test_table_new (col1, col2)
(SELECT col1, col2 FROM drop_test_table)
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
To understand how the delete operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
IF OBJECT_ID (‘new_table_for_select_into’) IS NOT NULL
DROP TABLE new_table_for_select_into
GO
SELECT * INTO new_table_for_select_into FROM drop_test_table_new
WHERE col1 > 1000000
/*
TotalSubtreeCost: 121.9906
Table ‘drop_test_table_new’. Scan count 1, logical reads 38315, physical reads 0, read-ahead reads 35105, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
CPU time = 8689 ms, elapsed time = 9965 ms.
*/
DROP TABLE drop_test_table_new
/*
CPU time = 16 ms, elapsed time = 11 ms.
*/
EXEC sp_rename ‘new_table_for_select_into’ , ‘drop_test_table_new’
/*
0.00328328
0.00328328
0.00328338
0.00657236
0.003291317
0.009868517
0.006578877
logical reads round 10
total 1000 ms
*/
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
–Clean up database
IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL
DROP TABLE drop_test_table_new
GO
/*
Test 5: Delete data through new table (insert INTO)
*/
The last test is the INSERT INTO test. This would be a normal method to replicate the best practice listed by Microsoft.
–Create table for test 5
IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL
DROP TABLE drop_test_table_new
GO
CREATE TABLE drop_test_table_new
(
col1 INT
,col2 INT
)
GO
–Insert data
INSERT INTO drop_test_table_new (col1, col2)
(SELECT col1, col2 FROM drop_test_table)
–Clear memory
CHECKPOINT
DBCC DROPCLEANBUFFERS
To understand how the delete operation affects our environment, let’s trap the TotalSubTreeCost, the amount of time it takes, and the number of IO operations it took to complete the statement.
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
–Create table for test 5
IF OBJECT_ID (‘drop_test_table_insert_into’) IS NOT NULL
DROP TABLE drop_test_table_insert_into
GO
CREATE TABLE drop_test_table_insert_into
(
col1 INT
,col2 INT
)
GO
INSERT INTO drop_test_table_insert_into (col1, col2)
(SELECT col1, col2 FROM drop_test_table_new WHERE col1 > 1000000)
/*
TotalSubtreeCost: 120.7378
Table ‘drop_test_table_insert_into’. Scan count 0, logical reads 9034482, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘drop_test_table_new’. Scan count 1, logical reads 38315, physical reads 0, read-ahead reads 35086, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
CPU time = 32417 ms, elapsed time = 38687 ms.
*/
DROP TABLE drop_test_table_new
/*
CPU time = 16 ms, elapsed time = 11 ms
*/
EXEC sp_rename ‘drop_test_table_select_into’ , ‘drop_test_table_new’
/*
0.00328328
0.00328328
0.00328338
0.00657236
0.003291317
0.009868517
0.006578877
logical reads round 10
total 1000 ms
*/
SET STATISTICS PROFILE OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
–Clean up database
IF OBJECT_ID (‘drop_test_table_new’) IS NOT NULL
DROP TABLE drop_test_table_new
GO
/*
–When you are done, clean up your database
IF OBJECT_ID (‘drop_test_table’) IS NOT NULL
DROP TABLE drop_test_table
GO
*/
Findings
If you review the findings below, you will find that the partition switch method is the clear winner when it comes to dropping large amounts of data. Of course your tables need to be partitioned in the first place and you cannot have any non-aligned indexes on the partitioned table, but after you work through that switching is not only easy, it is fast and takes very little system resources. After that, according to my opinions on what is the best resource usage, test 3 which used the SELECT INTO.
The next three in order where Test 1, the regular DELETE, test 2, the batched DELETE, and then test 5, the INSERT INTO.
Another thing you will notice about the findings and test is that I did not take into account the resources it would take to recreate the normal indexes, relationships, triggers, constraints, etc that are usually found on normal real-world tables. You can create your own tests in your environment to see which method is best for you.
Test 1
TotalSubtreeCost: 64
IO: 1,074,000 logical reads
Time: 112 seconds
Test 2
TotalSubtreeCost: 69
IO: 3,050,000 logical reads
Time: 112 seconds
Test 3
TotalSubtreeCost: 0
IO: 0 logical reads
Time: 200 milliseconds
Test 4
TotalSubtreeCost: 122
IO: 73,000 logical reads
Time: 10 seconds
Test 5
TotalSubtreeCost: 121
IO: 9,072,000 logical reads
Time: 39 seconds