Solved

How to Spam Filter Text in Document

Posted on 2011-03-08
4
315 Views
Last Modified: 2012-05-11
Hello,

I have been handed a comma-delimited file containing 50,000 Twitter tweets - and many of the tweets are Spam tweets.

So, each row is an individual tweet.  I need to clean this file of all suspicious rows.

Can you please recommend a method to identify the Spam lines in this file? Any examples or source-code is greatly appreciated.

Thanks,
Jason
0
Comment
Question by:SqueezeOJ
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 4

Expert Comment

by:rd707
ID: 35074829
Filters are never perfect. They'll either knock out kosher data or some rogue data will get through.

Think you're gonna have to maybe take a 1 in 100 sample, write something and see what falls into the keep and reject piles and gradually refine the algorithm.

The rules are going to depend very much on what the tweets are about. Tweets about say, football are likely to contain more colourful language than a set of tweets on a company website.

If you've got the usernames of who posted the tweets that would help as serial spammers could easily be filtered out.
0
 

Author Comment

by:SqueezeOJ
ID: 35076488
Hi rd707,

Good points and nice insight into the idea that the kind of tweets will influence the content that will be considered spam.

I guess what I'm trying to do is build a generic spam filter that checks a file full of text rather than incoming email.  I'd be happy to use a COTS spam filter IF it allowed me to process text rather than email.

Any ideas?

Thanks!
0
 
LVL 4

Accepted Solution

by:
rd707 earned 500 total points
ID: 35082216
I'm not sure an email spam filter would suffice, even if it could be pointed at text files as opposed to incoming email.

I used to work for a marketing company and we'd have to be careful on a number of fronts to make sure our marketing emails didn't bounce or get stored in junk email folders due to over-zealous spam filters.

Some themes I'm aware that various filters use (or have used) to block emails:

- Media : Some image types may be blocked (e.g. JPG) or content consisting purely of images may be blocked
- Links : Emails pointing to known phishing URLs may be blocked
- Content : Emails containing certain combinations of words may be blocked
- Names : Emails where the recipient forename or surname are all the same may be blocked
- Sender : Emails from known spurious addresses or domains may be blocked
- Attachments : Content with an executable content - e.g. .EXE or script
- User reported : Emails where the sender has been repeatedly reported as a junk mail sender (we had particular problems with this due to the Hotmail reporting feature)

As you can see, very few of these relate directly to twitter - probably only content and links. Filtering links is complicated even more by the fact that there are now forwarding address services e.g. tinyurl that can mask the target site. These would need to followed through by the software to see what they ultimately point to.

Hope this helps a little.
0
 

Author Closing Comment

by:SqueezeOJ
ID: 35234285
Thanks for working on this question.

I was really hoping for a more technical solution but the comments provided were correct and theoretically helpful..

Thanks!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This is an explanation of a simple data model to help parse a JSON feed
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question