My client has a dictionary table. They're defining a list of standard naming convention. Now, they are requiring my team to provide us a mechanism how to match new incoming data with existing data in their system.
Fuzzy Lookup Transformation by Dung Dinh
Scenario
My client has a dictionary table that they're defining list of standard TV Program name. Currently, their data team must compare and match new Program with existing standard Program manually. It takes much time and efforts of the team. So they want us to provide a mechanism to help us reduce this work.
For example: there is a standard Program named it as "The Simpsons". New raw data are coming to their system and they have three versions of the program name for "The Simpsons" like
"Simpsons, the"
"The Simpsons"
"Simpsons"
We need to compare three versions with the standard Program, which was defined before, and then determine what is matched, partial matched or un-matched?
Solutions
In SSIS, Microsoft provides an useful component. There is Fuzzy Lookup Transformation component. The component's mechanism is to use Fuzzy matching and compare set of input data with set of reference table. Basically, it returns one or more close matches in the reference table.
This requirement is reaching out concepts of FUZZY logic. In scope of this article, I don't intend to explain what FUZZY logic is, but FUZZY has been being applied in our life widely. Many algorithms are been developing based on this concept. One of them is approximate string matching.
Basically, it returns similarity rate of two strings when we use the approximate string matching algorithms. You need to remember that it is the technique of finding strings that match a pattern approximately (rather than exactly). It does not work like some comparison operators such as LIKE, CONTAINS,.. only return TRUE / FALSE values.
Step 1 - Create dictionary table
This table contains list of standard Program name data. Insert The Simpsons as example value.
CREATE TABLE ProgramReference(ProgramID int,ProgramName nvarchar(100))GOINSERT INTO ProgramReferenceVALUES(1,'The Simpsons')
Create new project and named it as FuzzyLookup. Then create new package ProgramFuzzyLookup.dtsx
4.1 Create new variables
I need two variables
FileConnectionString: path of raw file
Value: D:\SSIS-Practice\FuzzyLookup\ProgramData.txt
OLEDBConnectionString: connection string to SQL Server Database
Value: Value: Data Source=HSSSC1PCL01198\SQLSERVER2014; Initial Catalog=Demo;Provider=SQLNCLI11.1;Integrated Security=SSPI;
Data Source: database server name
Initial Catalog: database name
Provider: the driver to connect to SQL Database Server. You need to change the value to meet your environment. I use SQLNCLI11.1 because SQL Server Engine is 2014. If your version is lower than SQL 2012 version, it should be 10.0 / 10.1
Integrated Security: means that I use Window Authentication mode while connection to SQL Database Server.
4.2 Create connection configurations
We need to create OLEDBConnection and FileConnection configurations. SSIS use these configurations to connect to data source and database server.
Create OLEDB Connection and assign OLEDBConnectionString variable to ConnectionString property.
Create Flatfile Connection , right-click on Connection Manager area and select Create Flat File Connection
Connection Manager name: FileConnection
In General tab:
File name: D:\SSIS-Practice\FuzzyLookup\ProgramData.txt
Locale: English (United States)
Code Page: 1252 (ANSI - Latin I)
Format: Delimited
Text Qualifier: "
Header row delimiter: {CR}{LF}
Column names in first data row: checked
In Column tab:
Row delimiter: {CR}{LF}
Column delimiter: Vertical Bar {|}
Assign FileConnectionString variable to Connection String property.
4.3 Design work-flow
Drag and drop Execute SQL Task component into work-flow area to truncate table Program_FuzzyLookup. Then continue drag and drop Data Flow Task component.
4.4 Design data-flow
Drag and drop Flat File Source component.
Drag and drop Fuzzy Lookup and double click to open Fuzzy Lookup Transformation Editor
In Reference Table tab:
OLE DB connection manager: OLEDBConnectionString
Check Generate new index
Reference table name: ProgramReference
Store new index and Use existing index: if reference table is static, you can leave those options. Otherwise, you should consider to configure them for optimization when we have large data.
In Columns tab
Select ProgramName on both Available Input Columns and Available Lookup Columns. Fuzzy Lookup compares ProgramName of raw file with ProgramName of ProgramReference table and return similarity score and confidence score.
Output Alias: rename to ProgramNameReference. We need this output column to insert it into the final table Program_FuzzyLookup
In Advance tab:
Maximum number of matches per lookup: 1. This value indicates that maximum matches we want the transformation to return. The transformation returns zero or more matches up to the number of matches specified. Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity threshold: 0.00 by default. The transformation returns a similarity score per input. If it is larger than Similarity threshold, it returns the result.
Token delimiters: tokenize the data, it defines the units within the data that are compared to each other.
The output result of this task returns two columns _Similarity and _Confidence. They are very important, they help us determine what data is matched, partial matched or un-matched.
_Similarity: describes the similarity between values in the input and reference columns. It means that how much percentage of similarity
_Confidence: describes the quality of the match.
Drag and drop Condition Split component and use below logic:
If _Similarity >=0.9 and _Confidence > =0.85 Then Matched
If _Similarity >=0.7 and _Confidence > =0.65 Then Partial Matched
Else Un-Matched
Drag and drop Derived Column component to add new column and name it as Result.
Double click on Derived Column to add new column Result and enter "Matched" value
We repeat the same steps for Partial Matched and Un-matched.
Now, we combine all data flow by using Union All component.
Then drag and drop OLE DB Destination
In Connection Manger tab:
Connection manager: OLEDBConnectionString
Use a table or view: Program_FuzzyLookup
In Mapping tab: map as below Finally, we have the data flow as below
Step 5 - Execute package
Press F5 to execute package. And you can see that there are two matched record, one partial matched record and one un-matched.
Query data from Program_FuzzyLookup table
As I said above, Similarity and Confidence are very important. Data team can base on them to make decision. You can see line number 2, comparison between "Simpsons, the" and "The Simpsons" having Similarity = 0,98 and Confidence = 0.9875. Obviously, we can understand they are the same. However, if you use comparison operators such as LIKE, CONTAINS, SUBSTRING,... I'm sure that you can not get the result that those two strings are similar.
Conclusions
By using Fuzzy Lookup Transformation, I have a flexible approach to help data team reduce their work-load significantly. We have the mechanism to map new incoming data with standard data automatically instead data team must do this. In near future, we will try to build a process, which can run fuzzy matching and learn from standard dictionary table to solve special cases. Obviously, there are still several cases but not much, FUZZY lookup returns unexpected results and data team might verify them manually. Nothing is perfect.
One of critical issues is performance. If your reference table is large and change data frequently, you should consider the options which I talked above. Using index, caching mechanism... I did a quick testing, my reference table had 125,000 rows and input data around 7,000 rows. The execution time of the package was running 2 minutes completely. For understanding Fuzzy Lookup Transformation click here.
Comments (1)
Commented: