Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

How to Spam Filter Text in Document

Posted on 2011-03-08
4
Medium Priority
?
326 Views
Last Modified: 2012-05-11
Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
0
Comment
Question by:SqueezeOJ
  • 2
  • 2
4 Comments
 
LVL 4

Expert Comment

by:rd707
ID: 35074829
Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
0
 

Author Comment

by:SqueezeOJ
ID: 35076488
Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
0
 
LVL 4

Accepted Solution

by:
rd707 earned 1000 total points
ID: 35082216
I'm not sure an email spam filter would suffice, even if it could be pointed at text files as opposed to incoming email.

I used to work for a marketing company and we'd have to be careful on a number of fronts to make sure our marketing emails didn't bounce or get stored in junk email folders due to over-zealous spam filters.

Some themes I'm aware that various filters use (or have used) to block emails:

- Media : Some image types may be blocked (e.g. JPG) or content consisting purely of images may be blocked
- Links : Emails pointing to known phishing URLs may be blocked
- Content : Emails containing certain combinations of words may be blocked
- Names : Emails where the recipient forename or surname are all the same may be blocked
- Sender : Emails from known spurious addresses or domains may be blocked
- Attachments : Content with an executable content - e.g. .EXE or script
- User reported : Emails where the sender has been repeatedly reported as a junk mail sender (we had particular problems with this due to the Hotmail reporting feature)

As you can see, very few of these relate directly to twitter - probably only content and links. Filtering links is complicated even more by the fact that there are now forwarding address services e.g. tinyurl that can mask the target site. These would need to followed through by the software to see what they ultimately point to.

Hope this helps a little.
0
 

Author Closing Comment

by:SqueezeOJ
ID: 35234285
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!
0

Featured Post

Threat Trends for MSPs to Watch

See the findings.
Despite its humble beginnings, phishing has come a long way since those first crudely constructed emails. Today, phishing sites can appear and disappear in the length of a coffee break, and it takes more than a little know-how to keep your clients secure.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When there is a disconnect between the intentions of their creator and the recipient, when algorithms go awry, they can have disastrous consequences.
What do responsible coders do? They don't take detrimental shortcuts. They do take reasonable security precautions, create important automation, implement sufficient logging, fix things they break, and care about users.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…

916 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question