Solved

How to Spam Filter Text in Document

Posted on 2011-03-08
4
307 Views
Last Modified: 2012-05-11
Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
0
Comment
Question by:SqueezeOJ
  • 2
  • 2
4 Comments
 
LVL 4

Expert Comment

by:rd707
ID: 35074829
Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
0
 

Author Comment

by:SqueezeOJ
ID: 35076488
Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
0
 
LVL 4

Accepted Solution

by:
rd707 earned 500 total points
ID: 35082216
I'm not sure an email spam filter would suffice, even if it could be pointed at text files as opposed to incoming email.

I used to work for a marketing company and we'd have to be careful on a number of fronts to make sure our marketing emails didn't bounce or get stored in junk email folders due to over-zealous spam filters.

Some themes I'm aware that various filters use (or have used) to block emails:

- Media : Some image types may be blocked (e.g. JPG) or content consisting purely of images may be blocked
- Links : Emails pointing to known phishing URLs may be blocked
- Content : Emails containing certain combinations of words may be blocked
- Names : Emails where the recipient forename or surname are all the same may be blocked
- Sender : Emails from known spurious addresses or domains may be blocked
- Attachments : Content with an executable content - e.g. .EXE or script
- User reported : Emails where the sender has been repeatedly reported as a junk mail sender (we had particular problems with this due to the Hotmail reporting feature)

As you can see, very few of these relate directly to twitter - probably only content and links. Filtering links is complicated even more by the fact that there are now forwarding address services e.g. tinyurl that can mask the target site. These would need to followed through by the software to see what they ultimately point to.

Hope this helps a little.
0
 

Author Closing Comment

by:SqueezeOJ
ID: 35234285
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Homework Help 5 51
Device same like our heart 12 47
Currency Conversion? 1 37
Not needed 13 57
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now