Poor data is bad for business. Some reasons to have data quality initiatives in a company are as follows:
- Projects can be delayed (or declared as failure) because business users saw “poor” data in Business Intelligence deployments. In fact, poor data quality has been reported as the single most BI problem hindering their deployments.
- Companies risk losing money from poor quality data. In a report “Data Quality and the Bottom Line” published in 2002 by TDWI in as much as 10$ million per year.
To that end, in this paper, I present to you how you can use SQL Server 2012’s Data Quality Services to solve common data quality problems.
I’ll cover following topics in this two -part series:
A. Data Standardization
B. Identifying and correcting unrealistic or invalid values
C. Validating and correcting records using Regular Expressions
D. Validating and correcting value in a field based on value in another field(s) in the same row
E. Reduce data duplication
F. Cleaning records using
Before we begin, just want you to note that in real world, the requirements would not be as straight forward as I have discussed in the tutorial. This is so because the article is written with a goal of becoming a reference for IT professionals or Data stewards who are looking for examples that shows how to solve common problems using Data Quality Services. I have kept the demos and technical discussion at level 200.
And I am assuming that you have knowledge about basic features and concepts of Data Quality services and in this article, I’ll show demo’s that extends your knowledge to reach to the intermediate level so that you are empowered to build solutions with Data quality services in your organization. And I encourage you to follow along as we discuss each technique, data files required for following the tutorials can be found here.
Also, you’ll need SQL Server 2012 along with installation of Data Quality Services to try out examples on your machine. If you’ve not installed Data Quality Services, you can find the instructions here: http://msdn.microsoft.com/en-us/library/gg492277.aspx
With that, here are the solutions to the common problems that I listed before:
We’ve customer data coming in from two different source systems. In one system, the values in the Gender column has values Male & Female, while in the other system, the values in the Gender column has values M & F. We want to correct this data inconsistency issue and standardize the values in the Gender column.
Let’s see how SQL server 2012′s data quality services would help us solve this problem:
1. Open DQS client, Connect to DQS server
2. Under Create or Maintain Data Quality knowledge base, click on New Knowledge Base
3. Enter the name, Description
4. Under Select Activity, choose Knowledge Discovery
5. Select the Data Source as Excel File, Browse and open the CustomerInformation.xml file. Leave the worksheet value to the default. And make sure that the Use First row as header is checked.
Note: The Full Names in the data set that I shared are imported from the AdventureWorks Sample Database provided by Microsoft.
6. Select Full Name and Gender under Source Column
7. Create a domain with domain name Full Name and leave all the values to default
8. Map the Full Name under source column to domain that you created in previous step named Full Name
9. Create a domain with domain name Gender and leave all the values to default
10. Map the Gender under source column to domain that you created in previous step named Gender
11. Click on Next
12. Click on Start
13. In the Profiler statistics displayed note that the Gender Column has 4 unique values while we were expecting 2 unqique values i.e. Male & Female. This the profiler statistics has inted to us that we should be further investigating this column.
14. Click on Next
15. Select the Domain Gender to see the values in the source column
16. Change the Type of value F and M to invalid
17. Correct the values F and M to Female and Male respectively
18. Click on Finish and publish the Knowledge Base (KB)
19. We are going to use the KB that we created in previous steps in a Data Quality Project
20. Under Create or maintain Data Quality Project, click on New Data Quality Project
21. Enter the Name and select the KB that we created in previous steps
22. Under Select Activity, select Cleansing
23. Click Next.
24. Select Excel File as the Data Source and browse to the CustomerInformation.xls file and leave the worksheet value to Sheet1$
25. Map Source Column Full Name to Domain Named Full Name, and map Source Column named Gender to Domain named Gender, click next
26. Click on start
27. Once the Analysis of Data source has been completed successfully, click on next
28. Select the Gender Domain from the left column
29. Switch to the corrected tab, note that the values F and M were corrected to Female and Male respectively:
30. Click next
31. Under the Destination Type, select Excel; Browse to the Destination location and enter Excel File name (file will be created if it does not exist)
32. Select data only for output format and click on Export
33. You’ll get a notification when the file download completes:
34. Now, , we want to use the KB that we had built earlier in a SQL Server Integration Services (SSIS) Project. This would help you understand the integration of Data Quality Services (DQS) integration with SSIS. For official resource on DQS cleansing Transformation in SSIS, read: http://msdn.microsoft.com/en-us/library/ee677619.aspx
35. Open SQL Server Data Tools (SSDT) and start a New Integration Services Project
36. Drag and drop the Data Flow Task to the Control Flow in the SSIS designer
37. Rename the Data Flow Task to Receive and Standardize data using DQS
38. Double click the Data Flow Task to switch to the Data Flow Tab
39. Drag the Excel Source Task to the Designer and configure it to connect to CustomerDataFromSourceSystem1. Select Sheet1$ as the Name of the Excel sheet.
40. Rename this Excel Source Task to CustomerDataFromSourceSystem1
41. Drag one more Excel Source Task to the Designer and configure it to connect to CustomerDataFromSourceSystem2. Select Sheet1$ as the Name of the Excel sheet.
42. Rename this Excel Source Task to CustomerDataFromSourceSystem2
43. Drag Union All Task to the designer and Rename it to Combine Data From SourceSystem1 and SourceSytem2
44. Drag the blue connector from the two excel source tasks to the Union all Task
45. Drag the DQS Cleansing task to the designer. Rename it to Standardize Data using DQS
46. Right click the DQS Cleansing Task, click on Edit to open DQS cleansing Transformation Editor
47. Configure the Data Quality Connection Manager and Data Quality Knowledge Base. Select the knowledge Base that was created in Previous Steps
48. After successfully connecting to the DQS server and selecting the KB, you’ll see the available domains
49. Switch to the “mapping” Tab in the DQS cleansing Transformation Editor
50. Check Gender from the Available Input Columns
51. Under the Domain select Gender for the Input Column Gender
52. Click OK to switch back to Designer
53. Drag Excel Destionation Task to the Designer and rename it to CustomerDataDestination
54. Drag the blue connector from Standardize Data using DQS to the CustomerDataDestination
55. Configure the Excel Destination Task. Connect it to an empty excel file that has two columns named Full Name and Gender. Then after establishing the connection to this file, map the Gender Column in the Destination File with the input column Gender_Output and Map the Destination column Full Name with the Input column Full Name
56. Start Debugging (F5) to Test the package and see the results:
57. You can deploy the package after successful testing.
In this section, we saw how to standardize data that is coming from multiple source systems and take a step closer to achieving “single version of truth”.
Identifying and correcting unrealistic or invalid values
With Data quality services, Data Stewards can define rules that correct invalid values in the data. Let’s take an example; if you get a data set where a person’s height is 120 inches (10 ft.), you know that there’s probably something wrong about it. For such cases, DQS lets you define rules that detect such outliers. Also by defining such rules in DQS, you’ve automated the process of identifying and correcting invalid values from the data.
With that, here’s an example to create a Domain Rule that flags height values that are not between 5 ft. and 7 ft. 6 inches. Here are the steps:
1. We will start with the domain management activity of the knowledge base but in real world scenario, typically knowledge discovery activity is performed before the domain management activity. In our example, we know the problem in our data and so we’re directly jumping to the domain management activity
2. Open DQS client > connect to DQS server > click on New Knowledge Base
3. Enter the name of the knowledge base and select Domain Management as the activity
4. Create a domain name Player and leave all values to default
5. Create another domain name Height and change the Data Type to Decimal
6. Switch to the Domain Rules tab of the Height Domain
7. Here, based on the business requirement, we are going to create a domain rule that checks for player heights that’s less than 5ft and greater than 7ft 6 inches for review.
Name the domain rule: Unrealistic Values in Height Domain.
Description of the domain rule: check for height less than 5 ft. and greater than 7ft. 6 inches.
Note: 1 ft. = 12 inches.
Here is the domain rule:
Value is greater than 60
Value is less than 90
8. Click Finish and Publish the KB
9. Click on New Data Quality Project
10. Enter name, select the knowledge Base created in previous steps and select Cleansing as the activity. Click Next
11. Choose the database to cleanse against the mapped domains
Select Excel File as the data source
Select Dallas Mavericks Player Rooster.xlsx as the Excel File
Map Source Column named height in Inches to Height Domain
Map Source Column named Player to Player Domain
12. Click next
13. Click on start
14. Notice that DQS provides statistics via Profiler. Under source statistics, it says that 3 values are invalid:
15. Click next
16. Select the Height Domain
17. Switch to the Invalid tab
18. You can enter the valid values in the correct to field
19. After entering the values in the correct to field check Approve for all three invalid values
20. Switch to corrected tab to see the changes that we just made
21. Click next
22. Note that the column Height_in_inches_output contains the cleaned records
23. Here you can choose to export the results to the desired destination
Note that you can re-use the knowledge base (KB) that you had created in above steps for other Data Quality Projects too.
Validating and correcting records using Regular Expressions
In the previous section, we used domain rules to check for unrealistic values by using the greater than and less than operators. But conditions in domain rules for other data types like string is not limited to greater than, less than, equal to; there are conditions like value begins with, value is date/time, value contains, value matches regular expression among other conditions. The official reference can be read here; in this section we’ll explore one of these conditions. We’ll see how to clean records using regular expressions in a domain rule. For demo purposes, we’ll perform preliminary cleansing activity on email records.
Before we begin, here’s a primer on Regular Expressions:
Regular expressions (also known as regex or regexp) is a language used to match strings to patterns. Here’s a visual:
So, if we defined a pattern that asked “is this the valid email address?” the regular expression would retunr PASS or FAIL. So let’s try creating a regular expression that performs a preliminary email address check:
1. Open DQS client, connect to DQS Server, click on New Knowledge Base
2. Enter Name and select Domain Management from the activity
3. Create Domain, name it Email Address and select String as the Data Type
4. Switch to the Domain Rules tab
5. Add a new Domain Rule
6. Name the Domain Rule: Is this a valid email address?
7. Under the Build a Rule section, select value matches regular Expression
8. Use this regular expression for the purpose of the demo: [A-za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z]
Note that the result you get will be as good as your regular expression. After all, regular expressions are like “programs” and so the “Quality” of your regex depends on how well you write it.
9. You can test the domain rules while you are creating them. To do so, click on Run the selected domain rule on test data
10. Click on Adds as a new testing term for the domain rule
11. Enter a few records to validate them against the domain rule, for this demo, I am going to add:
12. After adding the terms, click on Test the Domain Rule on all term
13. Click on Close
14. Click on Finish and Publish the KB.
15. You can use this KB to validate email addresses in your data sets. We’re not going to start a Data Quality Project for this section.
16. This was just a basic demo to show how you can use Regular Expressions to detect values that do not match a pattern. In real world scenario, for cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for demo but if you want to read more about Regular Expressions used to clean emails then I strongly recommend you to read email regular expressions.
Note: We’re going to see how to use third party reference datasets in part 2 of this series.
In this article, we saw how to solve some of the common data quality problems using SQL Server 2012 Data Quality Services. We saw:
1. How to standardize data
2. Learned how to use DQS cleansing transformation in SSIS
3. Learned to build domain values and domain rule
In next article, we will continue our journey and we’ll learn about:
1. Validating and correcting value in a field based on value in another field(s) in the same row
2. Reduce data duplication
3. Cleaning records using reference data-set from Windows Azure Data market
Survey: Poor data quality most common Business Intelligence problem
Data Quality and Bottom Line
Data Quality Services concepts
Create Composite Domain
Create a Cross-Domain Rule
Using DQS: Cleansing complex data using composite domains
Official resource on DQS cleansing Transformation in SSIS
Using DataMarket with SQL Server Data Quality Services
Create a Domain Rule
How to Find or Validate Email Address