Bayesian filters: buckets and relevance

jhanna777 used Ask the Experts™
I'm testing for a condition that happens about 25% of the time in the wild. I have 26 factors which can easily be measured and are relivant to the condition. The 26 factors (Fa through Fz) are distributed over their respective ranges (Famin -> Famax, Fbmin -> Fbmax...). I partition these ranges into buckets (115 has experimently proven to be an optimal number of buckets) and I measure the probability that each bucket results in the condition of interest. Then I sample the Fa-z for an unknown sample and use Bayes theorm to combine the probabilities for each bucket to forcast if the unknown sample will have the condition of interest. I'm getting about 70% accuracy in my forcasts, but I'd like to improve that, if possible.

I have two questions.

First, currently I break every factor into 115 buckets. Is there a mathmatical way to find the optimal number of buckets, possibly based on Standard Deviation or some other statistical analysis? I have the feeling that overall results would improve if Fa has a different number of buckets than Fb, etc.

Secondly, intuitively I expect that some factors have a higher correlation to the forcast than others, or (better yet) that if Fa has some condition, then Fb is more relavant, but if Fa is in another condition Fb could be ignored entirely. How could one determine these relationships? How does Bayes therom allow for "relevance"? Is it done with liniar coeficients or by exponential ones?

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
You asked many questions.

* Is there a mathmatical way to find the optimal number of buckets?

Although you can certainly build one, I almost always very specifically that a test was done and the something was optimal, etc.... I recommend that you simply try different partition sizes on your samples and then run your test data through to see if you see an improvement.

Automate your tests and then come back later and see what the numbers say.

* How can I determine which condition is the better predictor.

Same answer as above. run the tests and see.

I do not know off hand how to set relevance....



That's what I've ended up doing. I wrote a program to do something like a binary search to optimise bucket size and a relevance exponent for each factor. I believe it increases accuracy, but it takes 8 hours to run, and I had some factors wrong last night, so I'll try again tonight.

It's causing a new problem -- probabilities peg out at 1 or -1 and doubles don't have enough precision to differentiate results...

I hope that when it "pegs out at 1 or -1" that it is a correct thing. I simply chop at 1 and -1....


Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial