• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 330
  • Last Modified:

How to Spam Filter Text in Document

Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
0
SqueezeOJ
Asked:
SqueezeOJ
  • 2
  • 2
1 Solution
 
rd707Commented:
Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
0
 
SqueezeOJAuthor Commented:
Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
0
 
rd707Commented:
I'm not sure an email spam filter would suffice, even if it could be pointed at text files as opposed to incoming email.

I used to work for a marketing company and we'd have to be careful on a number of fronts to make sure our marketing emails didn't bounce or get stored in junk email folders due to over-zealous spam filters.

Some themes I'm aware that various filters use (or have used) to block emails:

- Media : Some image types may be blocked (e.g. JPG) or content consisting purely of images may be blocked
- Links : Emails pointing to known phishing URLs may be blocked
- Content : Emails containing certain combinations of words may be blocked
- Names : Emails where the recipient forename or surname are all the same may be blocked
- Sender : Emails from known spurious addresses or domains may be blocked
- Attachments : Content with an executable content - e.g. .EXE or script
- User reported : Emails where the sender has been repeatedly reported as a junk mail sender (we had particular problems with this due to the Hotmail reporting feature)

As you can see, very few of these relate directly to twitter - probably only content and links. Filtering links is complicated even more by the fact that there are now forwarding address services e.g. tinyurl that can mask the target site. These would need to followed through by the software to see what they ultimately point to.

Hope this helps a little.
0
 
SqueezeOJAuthor Commented:
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now