how exactly use Bayesian_spam_filtering to detect spam email in a contact forms?

I am trying to create a script that will check a message in a contact form on a website to decide if its spam or not I already have a around 300 spam email I can use to detect keywords and im trying to use probablities to determent if a message is spam.
I tried to use as a refrence :http://en.wikipedia.org/wiki/Bayesian_spam_filtering

but im still confued about which formula exctly to use here.

Would I need to detect ham email as well?

Nura111Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
Why not just add a CAPTCHA to the contact form page?  And let the contact form send the email messages to a GMail account.  Google is pretty good at filtering spam.

The Wikipedia article explains it pretty well.  What's the question?
0
Nura111Author Commented:
The Article explains it very well?? I added a link to a way to calculate probability not captche.
Again the client doesnt want CAPTCHA the reason are people are lasy and will not contact. most of the spam we get they think are inserted manully and that not going to stop them. the question is more in Probability theory and base law. how to use the  Bayesian_spam_filtering to the case im describing??

that I have keywords im detecting in spam emails and I want to keep learn them automaticlly What will be the formulas I will need to use to decide in my case a message is spam base on the keywords
0
Ray PaseurCommented:
If you really think that writing your own spam filter is better than using pre-configured tools (like SpamAssassin) then this is the way you can get started.

First, add a hidden field to the form input and give it a name like "zip_code."  Since it is hidden, it will only get filled in by a 'bot.

Next, execute a "handshake" email with everyone who sends you a message.  The design of that is in this article.  If a sender fails to respond within a day or two, consider them a possible spammer.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_3939-Registration-and-Email-Confirmation-in-PHP.html

Read every email, upload the ones that are spam to your spam data base.  You will need to continue to do that for some time.  In the action script that sends the email message, tag the messages with the IP address of the sender's computer ($_SERVER["REMOTE_ADDR"]).  Take note of the points of origin.   Begin building a list of IP addresses that send spam.  Eventually you will want to automatically discard messages from those IP addresses, but for now you want to continue to collect spam messages.  You will probably have a reliable data base of spam messages when you get to about 10,000 known spam messages.

If you do not understand the Wikipedia article, you might want to consider hiring a consulting firm with expertise in pattern recognition.

The quick-and-dirty easy way for general processing to heuristically identify spam goes something like this.  Break the body of the email up into words (there is a REGEX anchor for word boundaries).  Count the number of words.  Then count the number of unique words (the vocabulary).  Divide the number of unique words by the number of words to get a ratio of vocabulary to content.  Count the number of words that appear in your "spam words" list.  Divide the number of spam words by the number of unique words to get a ratio of spam words to vocabulary.  As you look at these ratios for each message in your spam data base a pattern will emerge and you will be able to begin scoring the messages on various things.  Did it come from a suspected spammer?  Did it have a high ratio of spam words to vocabulary?  Was "zip_code" filled in? Etc.  Each component of the scoring will contribute something to the overall score.

I know you don't want to use a Captcha, but why not consider a checkbox right beside the submit button?  or Maybe have two submit buttons.  One says "I am a human" and the other says, "Use the 'human' button."  Or consider a JavaScript alert/confirm button that injects something into the form input.  There are any number of creative and simple ways to get one extra check on the idea of whether the form is being submitted by a human.  Granted, humans can still make bogus entries, but if you can eliminate the 'bots you will have eliminated most of the noise and preserved most of the signal.

Sidebar note: if the client's value proposition is so poor that people will not check a box to say they are human, then I doubt if there is enough value to expect much response to the web site.

Best regards, and best of luck with it, ~Ray
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ON-DEMAND: 10 Easy Ways to Lose a Password

Learn about the methods that hackers use to lift real, working credentials from even the most security-savvy employees in this on-demand webinar. We cover the importance of multi-factor authentication and how these solutions can better protect your business!

Nura111Author Commented:
Thank you for the inforamtion
I just dont understand how to use SpamAssassin to my needs here if you do I will be happy to hear that or get a good refrence beside their website

>>>If you do not understand the Wikipedia article, you might want to consider hiring a consulting firm with expertise in pattern recognition."
I do understand most of the article I dont understand some statistical in it and that was what my question is about.

>>Did it have a high ratio of spam words to vocabulary?
Thats again what my question was about , I wanted to hear another opinion about what will be the best way.   so you are suggesting to break the words of every email and than say for example if the ratio lets say 90% of this words are spam word we will mark it as spam.
but the problem here I believe is you are considering every spam word as if its has the same value.

What I wanted to is search for every known spam word in the new email than take into the consideration the probability of that word
 (if the word was 200 out of 400 known spam emails the probabilty is 0.5) what should be the the way im looking for the words what will be the threshold? that is probably the basic idea behind Bayesian_spam_filtering and SpamAssassin  but what is it more specifaclly

I will look into all of your other ideas as well

Thank you!!
0
doninjaCommented:
As an extra think about using link recognition in your program as a lot of spam forms often try to send some sort of link to you.

There are a number of public link category collections that can be used and updated regularly i.e. http://urlblacklist.com/

Filters such as spamassassin are generally a combination of rule based filtering and bayesian filtering to get a good signal ratio. If your forms are supposed to be entering addresses and names then you know that there should be no content that includes http:// as a simple example.

On your Q about thresholds and starting ratio's that all comes down to the exact algorithm you are using, quantity of sample forms to create initial stats and what chance you want of good forms being blocked.


0
crazedsanityCommented:
There's another question that probably has most of what you need: [http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_27415060.html].
0
Nura111Author Commented:
Ok I read it but still Im not even begining understanding how to use  SpamAssassin Can someone please explain?
0
doninjaCommented:
For spamassassin your need the server and client element.
Assuming you have a standard server setup on a server with bayesian and rules enabled then you need a client function to send the message or in your case a form in a MIME format to the server so it can be scanned.

The client side can be accomplished using a spamc client or library depending what langauge you are writing your contact form in and what server you have.

Or as a simple solution have your contact form be auto mailed to a specific mail domain/address that goes through the spamassassin server you have setup. This is standard functionality that is documented in loads of places depending what mail server you choose.
0
Nura111Author Commented:
>>Or as a simple solution have your contact form be auto mailed to a specific mail domain/address that goes through the spamassassin server you have setup. This is standard functionality that is documented in loads of places depending what mail server you choose.

Do you have any reference on how to do that?
0
Nura111Author Commented:
Ray: a quick question about your comment.
>>First, add a hidden field to the form input and give it a name like "zip_code."  Since it is hidden, it will only get filled in by a 'bot.

a boot cant check if its hidden or not?  and don't fill it out also?  If I  have a scroll down menu for instant and its get filled does it mean that It was human or it doesn't say anything
0
doninjaCommented:
Not got a specific reference but if your using PHP then there is a lot of form processors available that are put into the form action and send to a specific email address.
http://apptools.com/phptools/forms/

Just have all mails sent to a specific mail domain/address and have that domain routed through a spamassassin server.
If it passes through to the specifid address it could then be passed on to the end destination, be it a database or another address etc.

Also did find this blog on a PHP API to talk direct to spamassassin without using mail. Never used but may be a good source or starting point for you.
http://ppadron.blog.br/2010/05/04/php-api-to-spamassassin-spamd-protocol/


0
Nura111Author Commented:
Thank you! Im trying to write my own bayesain spam filtering but if its not going to be enoufh Ill defentily will go back to that.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Math / Science

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.