Cleansing Customer Data Using Data Quality Services (DQS)

This post has been republished via RSS; it originally appeared at: SQL Server Integration Services (SSIS) articles.

First published on MSDN on Jun 14, 2013

SQL Server 2012 Data Quality Services (DQS) enables you to cleanse data using your own Knowledge Base. In this blog article, I am going to show how to create a simple Knowledge Base which you can use to cleanse your customer data.


The steps for using DQS to cleanse data is as follows:


A. Create DQS Knowledge Base
B. Create a DQS project to cleanse your data using the Knowledge Base


For this example, I am using the following sample data - you can run the following script on your SQL Server database :


CREATE TABLE MyCustomers
(
CustomerID INT,
CustomerName NVARCHAR(255),
City NVARCHAR(32),
Province NVARCHAR(32),
LastUpdate DATETIME
)

INSERT INTO MyCustomers
VALUES      (1, 'Consolidate Co Ltd', 'Miami', 'FL','2013-01-01'),
(2, 'Consolidation Company Ltd', 'New York', 'NY','2013-01-01'),
(3, N'什锦的件', 'LA', 'CA','2013-01-01'),
(4, 'Chop-suey Chinese', 'Los Angeles', 'CA', '2013-03-03'),
(5, 'Big Cheese, The', 'Redmond', 'WA', '2013-02-02'),
(6, 'THE BIG CHEESE', 'Chicago', 'Il','2013-02-02'),
(7, 'To Be Filled Later', 'Redmond', 'Wash.', '2013-01-01')



A. Create DQS Knowledge Base



  1. Start Data Quality Client. For information about this step, refer to Run the Data Quality Client Application.

  2. In the Data Quality Client home screen, under Knowledge Base Management, click New Knowledge Base.

  3. Name your new Knowledge Base (e.g. MyCustomerKB), make sure Domain Management activity is selected and click Next




  1. Click Create Domain Icon.




  1. From the pop-up window, name your domain (e.g. CustomerName). As an option, you can select to change the format of the output to Capitalize.




  1. Repeat the steps and create domains for City and State.

  2. Select City domain and click on domain values tab. In this screen, you can enter all the list of valid and invalid values for the domain.


  • Click on Add new domain value icon. Enter a correct value (e.g. Los Angeles) then press enter.

  • Click on Add new domain value icon. Enter an invalid value for the city (e.g. United States), click the down arrow in the Type column and select the invalid type (yellow triangle) then press enter. (There are 3 supported type: correct, invalid, and error; invalid suggests that the value is not valid for the given domain, but may be valid in a different domain. For example, United States is not a valid value for city domain but is a valid value for country domain; error suggests that the value is globally incorrect). For invalid or error value, you can optionally specify correct to value for DQS to automatically correct. Otherwise, records with invalid or error value will be flag as invalid during cleansing.

  • Click on Add new domain value icon. Enter a synonym value for the city (e.g. LA) then press enter. Highlight both "LA" and "Los Angeles" values, right click then select Set as Synonyms.




Note: by default, DQS include DQS_NULL as valid value for the domain, you can change the type to invalid if you would like the record with missing value to be flagged




    • Update the leading value by right clicking Los Angeles and select Set as Leading during cleansing.






    • The final city domain values should look as follows:





  1. Select State domain and click on domain values tab. In this step, we will import the values from a spreadsheet.


  • Obtain the list of US states, its standard abbreviation and postal abbreviation from here . Copy the first 3 columns into excel and save as csv file.

  • Click the down arrow next to import values icon on the state domain management and select import valid values from excel. From the pop-up window, browse to the location of the csv file (make sure to select file type as csv) and update Use First row as header checkbox according to your file (if you do not include header when you prepare the file, then you can leave the default option unchecked.






    • You should see the State domain values screen populated as below - notice that when you import list of values with more than one column, DQS automatically set the first column as leading value and the subsequent column value as its synonyms.





  1. Select CustomerName domain and click on Term-Based Relations tab. DQS allows you to define terms within your domain value and standardize them into standard term. For example, business name often include abbreviation such as "corp", etc. When there are variation for the use of the terms (e.g. one record may use abbreviation such as "Microsoft Corp" while another record may have values for "Microsoft Corporation"), you can use DQS to standardize the use of terms within your domain values. In this step, we will define 2 term based relations for our CustomerName domain:


  • Click Add new relation icon

  • Add "co" as value and "Company" as correct to. Press enter to continue.

  • Add "ltd" as value and "Limited" as correct to.

  • You should see the CompanyName Term-Based Relations screen to be populated as below. Click finish to continue.





  1. Click Publish Knowledge Base button to continue.



Creating your own DQS Knowledge Base sometimes requires a lot of effort. For things such as address cleansing, phone number cleansing, creating your own complete list of all valid and invalid values can be a huge effort. DQS supports integration with third party service provider to cleanse your data through DQS. Refer to my blog article on how to cleanse Customer Data using Dun & Bradstreet for more information.


B. Create a DQS project to cleanse your data using the Knowledge Base



  1. In the Data Quality Client home screen, under Data Quality Projects, click New Data Quality Project.

  2. Name your new Project (e.g. MyCustomer Cleansing Project), make sure you select Knowledge Base created in the previous step (e.g. MyCustomerKB) then click Next to continue.

  3. In the Map screen:


  • Select data source, database, and table for your data (e.g. MyCustomers table and data generated from the script at the beginning of this article)

  • Select columns to be cleansed and map to the domains in your knowledge base

  • You should have your datasource columns mapped as below. Click Next to continue.





  1. In the Cleanse screen, click Start to begin the process. Click Next when the process finishes.

  2. In the Manage and View results screen:


  • Select City domain and click Corrected tab. You should see "LA" corrected to "Los Angeles"






  • Select City domain and click New tab. You should see the list of city not defined in the knowledge base domain value definition listed here. Click Approve all terms icon to continue (Refer to this msdn article on how do import all approved project values into DQS knowledge base). The approved values are now appear under Correct tab.






  • Select CustomerName domain and click New tab. You should see Customer Names are updated to use standardized terms. Click Next to continue.





  1. In the Manage and View results screen, you can export the results to SQL Server table or Excel.



In this blog article, I discussed how to create a Knowledge Base to cleanse customer data, including use of domain values and term based relations. You can then refer to the Knowledge Base to create Data Quality Projects to cleanse your data. You can use the same knowledge base to perform cleansing on many Data Quality Projects. You can also automate the cleansing using SQL Server 2012 Integration Services. Matt Mason wrote a nice article : Overview of DQS Transform that describes the SSIS DQS Cleansing transform.


You may also notice that there appears to be duplicate records in the sample data I used. In the next article , I will describe on how to enhance your knowledge base by adding matching policy and identify duplicate and related records in your dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.