SSIS (SQL Server Integration Services) is a powerful technology for most things data related. This includes moving data in and out of SQL Server and many other things like copying files or in my case today profiling data.The following table shows a problem we run into all the time with many clients. In this case we have a lookup table that lists product categories. As you can see there is a lot of bad data there where the category names are not consistent. For instance Belt is also shown as belt, blt, and bt. And it’s not usually just the database issue necessarily its either the database is not designed right and/or the applications are not handling data correctly.
We saw this recently where a database had a global option table for all types of fields and the data was really bogus just like this figure shows.
The only way you can really clean up your data is to profile it and monitor it. And there are good tools coming out like the new features in SQL Server 2012 which I will get into in another post.
This discussion brings me to the SSIS Data Profiler for this post. This SSIS task is really cool and will let us do things like analyze the key strength of columns, look at the column lengths for the current data, analyze the data in those columns and so forth. It’s nice because this is right out of the box in SQL Server. Let’s dig in and see how this works.
The following figure shows the Data Profiling Task on the design panel of SSIS 1012. This task is the only thing in the current project.
Once you have the task in the design surface you can configure it by double clicking it or right clicking and selecting Edit. The next figure shows the edit dialog. When the Data Profiling Task runs it outputs its results to a file. Notice how the DestinationType property is set to FileConnection and the Destination property to the DataProfile.xml connection. This is where the results will end up when the task runs.
Now that you have an output location, you can create a profile of your data. The easiest way to do this is to click the Quick Profile button shown in the last figure. This will open the following dialog:
You can see that I have already selected the Localhost.db connection and the ProductCategory table. Then I checked all of the boxes to create a complete profile. Next I clicked Ok and the profile was done.
Then I ran the package in debug mode.
Now the really cool thing is the Data Profile Viewer. You can find this in the Start menu and also it’s in in Data Profiling Tasks property page in the General area where you can just click Open Profile Viewer (see the 2nd figure) to run it.
This brings up the following dialog:
This tool allows us to dig into the results of our data profile and see what’s going on with our data. There is lots of good info in this output depending upon what you are looking at. For instance, you can find out the key strength of columns, look at a profile of the data, check column length against data and more.
In the case I mentioned early on, we really want to look at the data and its consistency. You can click on the Column Pattern Profiles item to dig into what your data looks like. The next figure shows this view after I also clicked on the Category column:
You can also click the Drill Down button to the right of All Rows to display all the data that a row represents. For instance if I were to click Drill Down, the display would change to this:
You can see the two discrete types of data that make up PANTS is Pants and pants. To drill into another field, double click it and it will change the drill down details.
As you can see from this image there’s lots of fields that are not correct. The next image shows the Pattern Distribution after I clicked the Pattern header to sort it. Now its clear that Belt and Blt are the same.
Now that I have identified this problem, I can run this query in SSMS:
select * from [dbo].[ProductCategory] where description = ‘Belts’
This shows me that we have 3 entries for Belts where we should have 1 entry.
15 Belt Belts
16 belt Belts
17 blt Belts
This type of cleanup should be done on any database that’s been in production for any length of time. When you find these problems, you can look at related tables and figure out where to make other data changes. And you should look at your applications and make sure no one has put in hard references or is feeding a list with hard coded values such as Belt or Blt and such.
It’s also worth looking at cleaning up your data layout and making lookups normalized if they are not already.