Analytics as a field is growing in relevance and importance as each day passes. Leveraging knowledge and creating knowledge from information is one of the most popular fields of study today. While analytics involves studying information, data mining is the process of extracting any such information and finding patterns which might not be obvious at first glance.
Data Mining can be useful in a number of different applications, but the focus of this article is on classification.
Things you will need:
Microsoft SQL Server 2008
Microsoft Visual Studio 2008(or 2005 or later)
So what kind of classification are we talking about? This could be the classification of documents at a library based on their content, or the classification of items being procured by a company based on their descriptions.
Performing any kind of classification follows the general steps below:
a) Select a mining algorithm
b) Select an appropriate structure for your model
c) Create the Mining Structure
d) Create the Mining Model
e) Populate the model/train the model
f) Test the model
Before we proceed, let's have a look at the information available to us. Assume we are classifying items based on their descriptions.
Consider two tables:
1) [Item] - Contains the columns [Item_Id],[Item_Descriptio
], and so on.
2) [Item_Terms] - Contains the columns [Term] and [Item_Id]
The [Item] table is fairly straight-forward. Let's say we have about 5000 items in here which each belong to one of twelve classes.
[Item_Terms] however, takes a little explaining. Every entry for [Term] in the [Item_Terms] table is one word in the [Item_Description] corresponding to a particular [Item_Id]. What this means is, if the [Item] table had an entry with [Item_Id] as 1001 and [Item_Description] as "Electric hand held drill", the term table would have 4 entries where [Term] takes the value "Electric", "hand", "held" and "drill". Each of these entries would have the same [Item_Id] as the entry in the [Item] table.
A simple program to Split the description and insert into the [Item_Terms] table should be enough to generate this. This will not be covered in the article. The idea behind this, is to tell our model that "These terms describe an item that belongs to X class. So if you see these terms, there's a decent chance of them also belonging to this class."
The phrase "decent chance" is used because the same term might appear in different classes - for example, "book" might appear under a "Furniture" class if the item was "book shelf" or under "Paper Products" if it was from "note book".
Now that we know the data we are working with, let's see how we can apply the steps a-f listed above.
a) Selecting a mining algorithm:
It's hard to label any one, particular step as "the" critical step, but this one would be right up there on the list of probables. SQL Server 2008 offers you a decent variety of algorithms to choose from - Microsoft Naive Bayes, Microsoft Decision Trees, Microsoft Logistic Regression and Microsoft Neural Networks. The pros and cons of these algorithms are beyond the scope of this article, but what can be noted is the fact that Microsoft Naive Bayes is the fastest but not the most accurate (but good enough for this example) and Microsoft Neural Network is probably the most processor intensive and while it is more accurate, it's quite slow.
We will choose to work with the Microsoft Naive Bayes algorithm which, as mentioned, is the simplest and fastest way to get a basic model up and running.
b) Selecting an appropriate structure for the mining model:
You need to decide how you will store information in your model. This is information that will be used to 'train' the model.
First of all, we know that our model needs to have data which is already classified. This is called the training data. In this case, our training data should consist of an [Item_Id],[Item_Class] as well as the different terms belonging to each item.
Since we know we need the [Item_Id] and [Item_Class] let's fix this as our current table structure.
Now, we need to map each set of terms with the items that they belong to. Since our current structure has an [Item_Id], and every entry in the term table also has an [Item_Id], we can think of every row in our current structure to contain a nested table of terms belonging to that particular [Item_Id].
In essence, our final structure chosen Consists of two tables, one nested in the other. An outer table which contains the [Item_Id] and [Item_Class]. Each row of this table contains a nested term table which contains [Term] and [Item_Id]. The [Item_Id] of the inner and outer tables is the same for a particular row.
A few things before proceeding further:
1) I assume that you know how to create a data source and a data source view in your Analysis Server and have done so.
2) Substitute the name of your data source where mentioned.
c) Create the mining structure
A mining structure is like a package. It can contain many models, or just one model as desired. For example, I only need to populate my mining structure with training data once, and I can then create ten models inside it with each using a different algorithm.
Hence, we create the mining structure first, and then a mining model within this mining structure.
Create a new DMX query in your SQL Server Analysis Services window in SSMS and create the mining structure as shown:
CREATE MINING STRUCTURE ItemClassification(
[Item_Id] LONG KEY,
[Item_Class] TEXT DISCRETE,
[Term] TEXT KEY
Notice that we've defined [Term_Table] as a table for every row in the outer table. The reason is the explanation given earlier when we chose our structure.
d) Creating the mining model:
Once we have created the mining structure, it's time to define our mining model as part of the mining structure.
ALTER MINING STRUCTURE ItemClassification
ADD MINING MODEL Item_Classification_NaiveBayes(
Notice that we've added the keyword 'PREDICT' to the [Item_Class] which means this is the attribute we're going to try and predict based on the input. Our inputs in this case, are the terms.
e) Train the model:
Now that we've done the structuring of our model, we can go ahead and 'train' it. Training involves 'feeding' the model existing known information so that the model can 'learn' how current terms relate to classes. We will use the 5000 odd terms currently in our [Item] table to train this model.
The model can be trained using the code below:
INSERT INTO [Item_Classification_NaiveBayes]
OPENQUERY ([DataSourceName], //Your datasource here
'SELECT [Item_Id], [Item_Class] FROM [Item] ORDER BY [Item_Id]')
OPENQUERY ([DataSourceName], //Your datasource here
'SELECT [Item_Id], [Term] FROM [Item_Terms] ORDER BY [Item_Id]')
RELATE [Item_Id] To [Item_Id]
Let's clarify some things here. The 'SKIP' keyword can be thought of as a marker to indicate that all other columns are irrelevant except for [Term].
Also, as you may notice, the 'SHAPE' statement obtains data for the outer table and the 'APPEND' statement obtains the data for the inner/nested table (the term table).
An important thing to note here, is that both SELECT queries have an 'order by' clause. This is compulsory when relating nested tables in a DMX query to train a model. Also, the 'RELATE' statement is used to tell the model what is the common column between the outer and the nested table.
And we're done!
f) Testing the model:
Now that we've trained the model, it's time to put it through the test! You would have figured out by now, that we're going to try and predict the class of an Item based on it's description terms. Any query you make to test this model, will be in the following form:
SELECT Predict(Item_Class) FROM [Item_Classification_NaiveBayes]
NATURAL PREDICTION JOIN
(SELECT 'Term1' AS Term UNION
SELECT 'Term2' AS Term UNION
SELECT 'Term3' AS Term) AS [Terms]
) AS T
This query will return the predicted Item Class for the given terms.
This completes the basic tutorial in creating a model for classification. You can always build on this using other algorithms, or changing the mining structure/model as you see fit.
Hope this article has been of some help. Until next time.