How to apply FUZZY algorithm in SSIS?

Published:
My client has a dictionary table. They're defining a list of standard naming convention. Now, they are requiring my team to provide us a mechanism how to match new incoming data with existing data in their system.

Scenario

My client has a dictionary table that they're defining list of standard TV Program name. Currently, their data team must compare and match new Program with existing standard Program manually. It takes much time and efforts of the team. So they want us to provide a mechanism to help us reduce this work.

For example: there is a standard Program named it as "The Simpsons". New raw data are coming to their system and they have three versions of the program name for "The Simpsons" like

1. "Simpsons, the"
2. "The Simpsons"
3. "Simpsons"
We need to compare three versions with the standard Program, which was defined before, and then determine what is matched, partial matched or un-matched?

Solutions

In SSIS, Microsoft provides an useful component. There is Fuzzy Lookup Transformation component. The component's mechanism is to use Fuzzy matching and compare set of input data with set of reference table. Basically,  it returns one or more close matches in the reference table.

This requirement is reaching out concepts of FUZZY logic. In scope of this article, I don't intend to explain what FUZZY logic is, but FUZZY has been being applied in our life widely. Many algorithms are been developing based on this concept. One of them is approximate string matching.

Basically, it returns similarity rate of two strings when we use the approximate string matching algorithms. You need to remember that it is the technique of finding strings that match a pattern approximately (rather than exactly). It does not work like some comparison operators such as LIKE, CONTAINS,.. only return TRUE / FALSE values.

Step 1 - Create dictionary table

This table contains list of standard Program name data. Insert The Simpsons as example value.

``````CREATE TABLE ProgramReference
(
ProgramID int,
ProgramName nvarchar(100)
)
GO
INSERT INTO ProgramReference
VALUES(1,'The Simpsons')
``````

Step 2 - Create example raw file

This file contains new Program data what sent by third party partner. You should save it in Unicode UTF-8 encoding.

``````ProgramName
"Simpsons, the"
"The Simpsons"
"Simpsons"
"Sianame"
``````

Step 3 - Create destination table

This table contains output result after applying FUZZY lookup transformation in SSIS

``````CREATE TABLE Program_FuzzyLookup
(
ProgramID int IDENTITY(1,1),
ProgramSource nvarchar(100),
ProgramRefernce nvarchar(100),
Result nvarchar(30), -- Matched Partial Matched Un-matched
Similarity real,
Confidence real
)
``````

Step 4 - Create SSIS loader

Create new project and named it as FuzzyLookup. Then create new package ProgramFuzzyLookup.dtsx

4.1 Create new variables

I need two variables
FileConnectionString: path of raw file
Value: D:\SSIS-Practice\FuzzyLookup\ProgramData.txt
OLEDBConnectionString: connection string to SQL Server Database
Value: Value: Data Source=HSSSC1PCL01198\SQLSERVER2014; Initial Catalog=Demo;Provider=SQLNCLI11.1;Integrated Security=SSPI;
Data Source: database server name
Initial Catalog: database name
Provider: the driver to connect to SQL Database Server. You need to change the value to meet your environment. I use SQLNCLI11.1 because SQL Server Engine is 2014. If your version is lower than SQL 2012 version, it should be 10.0 / 10.1
Integrated Security: means that I use Window Authentication mode while connection to SQL Database Server.

4.2 Create connection configurations

We need to create OLEDBConnection and FileConnection configurations. SSIS use these configurations to connect to data source and database server.
Create OLEDB Connection and assign OLEDBConnectionString variable to ConnectionString property.
Create Flatfile Connection , right-click on Connection Manager area and select Create Flat File Connection
Connection Manager name: FileConnection
In General tab:

File name: D:\SSIS-Practice\FuzzyLookup\ProgramData.txt
Locale: English (United States)
Code Page: 1252 (ANSI - Latin I)
Format: Delimited
Text Qualifier: "
Column names in first data row:  checked
In Column tab:

Row delimiter: {CR}{LF}
Column delimiter: Vertical Bar {|}
Assign FileConnectionString variable to Connection String property.

4.3 Design work-flow

Drag and drop Execute SQL Task component into work-flow area to truncate table Program_FuzzyLookup. Then continue drag and drop Data Flow Task component.

4.4 Design data-flow

Drag and drop Flat File Source component.

Drag and drop Fuzzy Lookup and double click to open Fuzzy Lookup Transformation Editor
In Reference Table tab:

• OLE DB connection manager: OLEDBConnectionString
• Check Generate new index
• Reference table name: ProgramReference
• Store new index and Use existing index:  if reference table is static, you can leave those options. Otherwise, you should consider to configure them for optimization when we have large data.
In Columns tab
• Select ProgramName on both Available Input Columns and Available Lookup Columns. Fuzzy Lookup compares ProgramName of raw file with ProgramName of ProgramReference table and return similarity score and confidence score.
• Output Alias: rename to ProgramNameReference. We need this output column to insert it into the final table Program_FuzzyLookup
• Maximum number of matches per lookup: 1. This value indicates that maximum matches we want the transformation to return. The transformation returns zero or more matches up to the number of matches specified. Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
• Similarity threshold: 0.00 by default. The transformation returns a similarity score per input. If it is larger than Similarity threshold, it returns the result.
• Token delimiters: tokenize the data, it defines the units within the data that are compared to each other.

The output result of this task returns two columns _Similarity and _Confidence. They are very important, they help us determine what data is matched, partial matched or un-matched.
• _Similarity: describes the similarity between values in the input and reference columns. It means that how much percentage of similarity
• _Confidence: describes the quality of the match.

Drag and drop Condition Split component and use below logic:
If _Similarity >=0.9 and _Confidence > =0.85 Then Matched
If _Similarity >=0.7 and _Confidence > =0.65 Then Partial Matched
Else Un-Matched
Drag and drop Derived Column component to add new column and name it as Result.
Double click on Derived Column to add new column Result and enter "Matched" value
We repeat the same steps for Partial Matched and Un-matched.
Now, we combine all data flow by using Union All component.

Then drag and drop OLE DB Destination
In Connection Manger tab:
• Connection manager: OLEDBConnectionString
• Use a table or view: Program_FuzzyLookup
In Mapping tab: map as below
Finally, we have the data flow as below

Step 5 - Execute package

Press F5 to execute package. And you can see that there are two matched record, one partial matched record and one un-matched.
Query data from Program_FuzzyLookup table

As I said above, Similarity and Confidence are very important. Data team can base on them to make decision. You can see line number 2, comparison between "Simpsons, the" and "The Simpsons" having Similarity = 0,98 and Confidence = 0.9875. Obviously, we can understand they are the same. However, if you use comparison operators such as LIKE, CONTAINS, SUBSTRING,... I'm sure that you can not get the result that those two strings are similar.

Conclusions

By using Fuzzy Lookup Transformation, I have a flexible approach to help data team reduce their work-load significantly. We have the mechanism to map new incoming data with standard data automatically instead data team must do this. In near future, we will try to build a process, which can run fuzzy matching and learn from standard dictionary table to solve special cases.  Obviously, there are still several cases but not much, FUZZY lookup returns unexpected results and data team might verify them manually. Nothing is perfect.

One of critical issues is performance. If your reference table is large and change data frequently, you should  consider the options which I talked above. Using index, caching mechanism... I did a quick testing, my reference table had 125,000 rows and input data around 7,000 rows. The execution time of the package was running 2 minutes completely. For understanding Fuzzy Lookup Transformation click here.

2
5,548 Views

IT Engineer
CERTIFIED EXPERT
Distinguished Expert 2017

Commented:
Very nice