- Community Pick
- Experts Exchange Approved
Disclaimer
I am not advocating you go out and "screen-scrape" anyone's profile or any message boards. The intent of my article is to demonstrate the inherent vulnerability of certain obfuscation patterns, not to instigate some "script-kiddie" to go address fishing. If you are a script-kiddie, or just a genuine degenerate, please don't hold me responsible for your actions. You have been warned!
Prerequisites
Before going any further, you may want to become familiar with, if not already, regular expressions. Regular expressions are a type of programming language (more like a meta-language) that can be used to identify certain patterns within a target string. If you are unfamiliar with regular expressions, there are hundreds of articles and sites all over the web dedicated to the topic. BatuhanCetin has written an introductory article to them and my site of preference is Regular-Expression.info.
Email Address Structure
Everyone reading this should be familiar with email address structure. The one thing an email address cannot live without is the "at" symbol ( @ ). It is the character that separates the "who" from the "where" in an email address. Generally speaking, an email address will only have one @ symbol. (It is possible, per RFCs 5321 and 5322 to have an @ within a quoted-string. For the sake of simplicity, I will assume only one @ symbol occurs within any example address shown.) In addition to the @ symbol, both the "who" and the "where" are required. To extract a simple, non-obfuscated address, we might use the following regular expression:
The pattern above will search for one or more non-whitespace characters, followed by the @ symbol, and then one or more non-whitespace characters. It should be easy to see that if we had the address jsmith@flibberty-flabber-
Enter Obfuscation
When one realizes how easy it is to extract one's email address from a web page's HTML code, one might then start to ponder how to disguise said address so that a colleague can still decipher the address, but the spammer cannot. One of the more popular ways to disguise an address is by simply obfuscating the dot and the @ symbol. Taking this approach against my previous address example, I could propose any of the following:
I could keep going, but hopefully you get the idea. So how could I create a regular expression to extract such variations in the address format? Quite simply in fact.
The pattern above will match any of the examples listed above. I could easily include other characters in the special characters list. What to do, what to do?
Improving Obfuscation
I'm sure the above begs the question, "how can i improve my obfuscation techniques?" In order to accomplish that, we must think like a regex engine. So what does said engine do? Well, it processes each character in some input string to see if it conforms to part of some pattern. "Ok. How do I break it, then?" Excellent question. Simply changing characters is not enough, because a regex engine is very adept at inspecting characters. What you should be aiming for is semantic obfuscations. What do I mean by this? Instead of changing the value of a character, change what it means. Replace or insert some character that does not exist in your original email address and provide a note explaining this quirk. For example, I might post the following as my address in some forum:
Above, I have used the letter "z" as filler for my address. I staggered the insertions of this letter to prevent a pattern from emerging in my obfuscation (e.g. every other letter is a "z"). I have also provided a note explaining to anyone reading my address how to decipher it. You might ask why I used the mnemonic "zees" instead of just saying the letter "z". This is again, to make it harder for spammers to extract the address. I could easily write a regex pattern that would extract the letter to remove from the note, and then include other logic that would filter the address based on the found letter. Something like this:
Again, is this foolproof? No. Given a reason and time, a spammer can still grab your address. The point is to make it more difficult for them to do so. I once had a boss who said to me, "Does the security system make our store safer? No. But if Johnny Snatchalot sees that we have a security system, but the store next door doesn't, which target will be easier for him to score from?" I believe the same concept holds here.
Another technique you could use, but might be less feasible, would be to encrypt your address with some weak cipher. For example, you could use a Caesar cipher with a shift of 1. Again, you could include a note that the displayed address is incorrect and it should be deciphered using method "X". While this won't prevent the address from being extracted, per se, it would cause a spammer to have to go through another layer to determine your email address.
Good-Intentioned Obfuscations
I have seen this one from time to time: posting an email address as an image. I'm sure most of you reading this are familiar with Captcha. If not, its the picture with the swirled letters that you have to look at and then type what you see in the accompanying text box. If you're familiar with Captcha, then you may be familiar with why it came about. Optical character recognition (OCR) is a way of extracting text from an image. Long ago, some clever individual figured out that you could programmatically create web request to certain pages in order to "work" a web site to your advantage (think pinging Ticketmaster to grab all the high-dollar seats for later scalping). Captcha was invented to prevent this. Subsequently, another set of clever individuals realized you could perform OCR against a Captcha image to defeat it. This is the reason you now see swirly, sometimes indecipherable, letters inside Captcha windows. Why did I tell you this? Because all of the image-based email addresses I have see to date have been simple, unobfuscated text, in an image. This is ripe for an OCR attack. While I am not suggesting you cannot use an image to post your email address, try to take the Captcha route and obfuscate the image a bit.
Summary
Communication across the globe has increased exponentially over the past few decades. With the advent of the internet and email, so too has the increase in spam been notable. In an electronic world, we often desire instant communication, and email has granted us that. However, there is no reason for us to make it easy for spammers to turn our in-boxes electronic dumpsters. I'm a firm believer in "if it can be made, it can be broken." While the methods above are not foolproof and won't prevent all attacks against your in-box, they should ward off the simpleton spammers. While the most foolproof plan to protect against that is not to post in the first place, it makes sense to use some form of obfuscation. Don't be afraid to have some fun with your obfuscation techniques; just make sure you don't go overboard and lose the meaning of what you set out to convey!
by: lherrou on 2011-05-10 at 05:20:23ID: 26734