How to Use C5.0 Classifier to Categorize Text Files

I have a whole bunch of text files organized into categories. I want to use the C5.0 classifier on them to learn the classification so that it can be applied to new data.

How would I do this? I don't understand how to use C5.0.
Who is Participating?
I'm assuming you have the GPL code, not the full See5? Poor documentation is one of the few real downsides of open source code.

See if this helps. It looks pretty simple
Just skip the stuff about logging into storm.
You may have to build the C5.0 executable first.
What have you done so far? Have you downloaded the source code for Linux or Windows? Have you looked at the sample programs or read the tutorials?
ADFBAuthor Commented:
I have attempted to look at the samples and tutorials, but they already assume you know what you're doing. I admit that I don't understand them.

I was previous using MALLET, which is another classifier (and supports C4.5, but it appears to be broken), it has a lot of very technical docs and unfriendly tutorials, and then after days I finally figure out that it just takes one simple command to get it to import, one simple command for building the classifier, and simple command to apply the classifier to new data.

So I was hoping C5.0 would be similar.

I'll be running it on a Linux server.
Introducing Cloud Class® training courses

Tech changes fast. You can learn faster. That’s why we’re bringing professional training courses to Experts Exchange. With a subscription, you can access all the Cloud Class® courses to expand your education, prep for certifications, and get top-notch instructions.

ADFBAuthor Commented:
OK, thanks. That's a start. Yes I mean the GPL code, though I thought it was the same except supporting only single-threading?

I'm just trying to use it for the simple purpose of categorizing text files.

So if I understand correctly, to build a classifier I use:
c5.0  -f  mydata > mydata.output

And the app will look for trainingdata.names & and build a classifier from this saving it to classifier.output? Or does it save the classifier somewhere hidden? (It seems to suggest that.)

Then to apply this to new data, I use:
predict -f mydata

So where is the new data to classify coming from if I haven't input it? Also it says that predict is an "interactive interpreter", whereas I just want it to give me the percentage match for each category without any frills (just the default settings are fine with me).

Also I have no idea how to convert a series of txt files into the names and data format... is this a standard format? Is there some kind of converter from txt files to the C5.0 format?

C4.5 can be used for text categorization... is C5.0 not made for this, or am I just confused?
ADFBAuthor Commented:
Sorry, I meant:
And the app will look for mydata.names & and build a classifier from this saving it to mydata.output? Or does it save the classifier somewhere hidden? (It seems to suggest that.)
Well, I think the See5 also contains a GUI where you can click buttons instead of typing shell commands.

The first little part of the tutorial shows how to format the names file. It's just telling it what type of data to expect for each column

The output is just the results of the classifier. The actual classification rules go into mydata.tree and mydata.rules files. It looks like mydata.out also contains the report. All the files involved in the process are listed in the above link right before the specification of the names file format.
ADFBAuthor Commented:
OK, thanks. Hmmmm..

It has to be command line, since I'm doing this on a remote server.

How about how to tell "predict" what unlabeled data needs to be classified?

And there must be a standard method for converting a series of text files into the names & data format required by C5.0. I have no idea how that would be done but I expect many people would require it. Do you know anything about that?
Converting the data should be easy. Just open your text file in Excel (or OpenOffice) and save it as a .csv (comma separated). Then change the extension to .data and you should be good to go.

You could write (and there may exist) scripts to try to build the .names file, but you'd probably be better off just writing it. If you need to extract the possible values for some text fields, open it back up in Excel and for each column do remove duplicates and you should have the values.
If there are a lot, you could select them, hit copy, hit paste special, select transpose, and save that as a .csv to get the commas.
ADFBAuthor Commented:
OK. I think that's enough information to get me going. Thanks!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.