Link to home
Create AccountLog in
Avatar of SqueezeOJ
SqueezeOJFlag for United States of America

asked on

How to Spam Filter Text in Document

Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
Avatar of rd707
rd707
Flag of United Kingdom of Great Britain and Northern Ireland image

Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
Avatar of SqueezeOJ

ASKER

Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
ASKER CERTIFIED SOLUTION
Avatar of rd707
rd707
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!