Link to home
Start Free TrialLog in
Avatar of ADFB
ADFB

asked on

How to Use C5.0 Classifier to Categorize Text Files

I have a whole bunch of text files organized into categories. I want to use the C5.0 classifier on them to learn the classification so that it can be applied to new data.

How would I do this? I don't understand how to use C5.0.

http://www.rulequest.com/see5-info.html
Avatar of TommySzalapski
TommySzalapski
Flag of United States of America image

What have you done so far? Have you downloaded the source code for Linux or Windows? Have you looked at the sample programs or read the tutorials?
Avatar of ADFB
ADFB

ASKER

I have attempted to look at the samples and tutorials, but they already assume you know what you're doing. I admit that I don't understand them.

I was previous using MALLET, which is another classifier (and supports C4.5, but it appears to be broken), it has a lot of very technical docs and unfriendly tutorials, and then after days I finally figure out that it just takes one simple command to get it to import, one simple command for building the classifier, and simple command to apply the classifier to new data.

So I was hoping C5.0 would be similar.

I'll be running it on a Linux server.
ASKER CERTIFIED SOLUTION
Avatar of TommySzalapski
TommySzalapski
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ADFB

ASKER

OK, thanks. That's a start. Yes I mean the GPL code, though I thought it was the same except supporting only single-threading?

I'm just trying to use it for the simple purpose of categorizing text files.

So if I understand correctly, to build a classifier I use:
c5.0  -f  mydata > mydata.output

And the app will look for trainingdata.names & trainingdata.data and build a classifier from this saving it to classifier.output? Or does it save the classifier somewhere hidden? (It seems to suggest that.)

Then to apply this to new data, I use:
predict -f mydata

So where is the new data to classify coming from if I haven't input it? Also it says that predict is an "interactive interpreter", whereas I just want it to give me the percentage match for each category without any frills (just the default settings are fine with me).

Also I have no idea how to convert a series of txt files into the names and data format... is this a standard format? Is there some kind of converter from txt files to the C5.0 format?

C4.5 can be used for text categorization... is C5.0 not made for this, or am I just confused?
Avatar of ADFB

ASKER

Sorry, I meant:
And the app will look for mydata.names & mydata.data and build a classifier from this saving it to mydata.output? Or does it save the classifier somewhere hidden? (It seems to suggest that.)
Well, I think the See5 also contains a GUI where you can click buttons instead of typing shell commands.

The first little part of the tutorial shows how to format the names file. It's just telling it what type of data to expect for each column
http://storm.cis.fordham.edu/~gweiss/dmrg/c5-tutorial.html

The output is just the results of the classifier. The actual classification rules go into mydata.tree and mydata.rules files. It looks like mydata.out also contains the report. All the files involved in the process are listed in the above link right before the specification of the names file format.
Avatar of ADFB

ASKER

OK, thanks. Hmmmm..

It has to be command line, since I'm doing this on a remote server.

How about how to tell "predict" what unlabeled data needs to be classified?

And there must be a standard method for converting a series of text files into the names & data format required by C5.0. I have no idea how that would be done but I expect many people would require it. Do you know anything about that?
Converting the data should be easy. Just open your text file in Excel (or OpenOffice) and save it as a .csv (comma separated). Then change the extension to .data and you should be good to go.

You could write (and there may exist) scripts to try to build the .names file, but you'd probably be better off just writing it. If you need to extract the possible values for some text fields, open it back up in Excel and for each column do remove duplicates and you should have the values.
If there are a lot, you could select them, hit copy, hit paste special, select transpose, and save that as a .csv to get the commas.
Avatar of ADFB

ASKER

OK. I think that's enough information to get me going. Thanks!