Solved

How to Spam Filter Text in Document

Posted on 2011-03-08
4
313 Views
Last Modified: 2012-05-11
Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
0
Comment
Question by:SqueezeOJ
  • 2
  • 2
4 Comments
 
LVL 4

Expert Comment

by:rd707
ID: 35074829
Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
0
 

Author Comment

by:SqueezeOJ
ID: 35076488
Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
0
 
LVL 4

Accepted Solution

by:
rd707 earned 500 total points
ID: 35082216
I'm not sure an email spam filter would suffice, even if it could be pointed at text files as opposed to incoming email.

I used to work for a marketing company and we'd have to be careful on a number of fronts to make sure our marketing emails didn't bounce or get stored in junk email folders due to over-zealous spam filters.

Some themes I'm aware that various filters use (or have used) to block emails:

- Media : Some image types may be blocked (e.g. JPG) or content consisting purely of images may be blocked
- Links : Emails pointing to known phishing URLs may be blocked
- Content : Emails containing certain combinations of words may be blocked
- Names : Emails where the recipient forename or surname are all the same may be blocked
- Sender : Emails from known spurious addresses or domains may be blocked
- Attachments : Content with an executable content - e.g. .EXE or script
- User reported : Emails where the sender has been repeatedly reported as a junk mail sender (we had particular problems with this due to the Hotmail reporting feature)

As you can see, very few of these relate directly to twitter - probably only content and links. Filtering links is complicated even more by the fact that there are now forwarding address services e.g. tinyurl that can mask the target site. These would need to followed through by the software to see what they ultimately point to.

Hope this helps a little.
0
 

Author Closing Comment

by:SqueezeOJ
ID: 35234285
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question