Regular Expressions

A regular expression ("regex") is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. Regular expression processors are found in several search engines, search and replace dialogs of several word processors and text editors, and in the command lines of text processing utilities, such as sed and AWK. Many programming languages provide regular expression capabilities, some built-in, for example Perl, JavaScript, Ruby, AWK, and Tcl, and others via a standard library, for example .NET languages, Java, Python and C++ (since C++11). Most other languages offer regular expressions via a library.

Share tech news, updates, or what's on your mind.

Sign up to Post

We are witnesses that everyone is saying that our children shouldn't "play" with a technology because it is dangerous. This article is going to prove that they are wrong.
1

Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the fancy obfuscations of said addresses that I see posted on profiles (and message boards). Don't get me wrong, I think security-through-obscurity is a good approach to enabling communication with another party via your established email address. What gets me is how simple the obfuscations appear to be. In this article I am going to demonstrate how simple it is to de-obfuscate some of the simpler patterns I have seen using regular expressions (regex). I will also offer some alternative obfuscation methods. Though I am going to focus on email, the ideas presented here could be applied to any online moniker that could be used for spamming purposes (e.g. Twitter, Facebook, etc.). Although some of the discussion will be technical, the audience of this article is anyone who actively posts their email to public web sites.



Disclaimer

I am not advocating you go out and "screen-scrape" anyone's profile or any message boards. The intent of my article is to demonstrate the inherent vulnerability of certain obfuscation patterns, not to instigate some "script-kiddie" to go address fishing. If you are a script-kiddie, or just a genuine degenerate, please don't hold me responsible for your actions. You have been warned!



Prerequisites

Before going any further, you may want to become familiar with, if not already, regular expressions. Regular expressions are a type of programming language (more like a meta-language) that can be used to identify certain patterns within a target string. If you are unfamiliar with regular expressions, there are hundreds of articles and sites all over the web dedicated to the topic. BatuhanCetin has written an introductory article to them and my site of preference is Regular-Expression.info. Covering the ins and outs of regular expressions is outside the scope of this article.



Email Address Structure

Everyone reading this should be familiar with email address structure. The one thing an email address cannot live without is the "at" symbol ( @ ). It is the character that separates the "who" from the "where" in an email address. Generally speaking, an email address will only have one @ symbol. (It is possible, per RFCs 5321 and 5322 to have an @ within a quoted-string. For the sake of simplicity, I will assume only one @ symbol occurs within any example address shown.) In addition to the @ symbol, both the "who" and the "where" are required. To extract a simple, non-obfuscated address, we might use the following regular expression:

\S+@\S+

The pattern above will search for one or more non-whitespace characters, followed by the @ symbol, and then one or more non-whitespace characters. It should be easy to see that if we had the address jsmith@flibberty-flabber-jab.com, the regular expression would have no trouble extracting that address.



Enter Obfuscation

When one realizes how easy it is to extract one's email address from a web page's HTML code, one might then start to ponder how to disguise said address so that a colleague can still decipher the address, but the spammer cannot. One of the more popular ways to disguise an address is by simply obfuscating the dot and the @ symbol. Taking this approach against my previous address example, I could propose any of the following:

jsmith [at] flibberty-flabber-jab [dot] com
                                        jsmith {at} flibberty-flabber-jab {dot} com
                                        jsmith __AT__ flibberty-flabber-jab __DOT__ com


I could keep going, but hopefully you get the idea. So how could I create a regular expression to extract such variations in the address format? Quite simply in fact.

\S+\s*[[{_-]*[aA][tT][]}_-]*\s*\S+\s*[[{_-]*[dD][oO][tT][]}_-]*\s*\S+

The pattern above will match any of the examples listed above. I could easily include other characters in the special characters list. What to do, what to do?



Improving Obfuscation

I'm sure the above begs the question, "how can i improve my obfuscation techniques?" In order to accomplish that, we must think like a regex engine. So what does said engine do? Well, it processes each character in some input string to see if it conforms to part of some pattern. "Ok. How do I break it, then?" Excellent question. Simply changing characters is not enough, because a regex engine is very adept at inspecting characters. What you should be aiming for is semantic obfuscations. What do I mean by this? Instead of changing the value of a character, change what it means. Replace or insert some character that does not exist in your original email address and provide a note explaining this quirk. For example, I might post the following as my address in some forum:


jzsmithz@flibbertyz-flabberz-jazb.cozm (remove all zees)


Above, I have used the letter "z" as filler for my address. I staggered the insertions of this letter to prevent a pattern from emerging in my obfuscation (e.g. every other letter is a "z"). I have also provided a note explaining to anyone reading my address how to decipher it. You might ask why I used the mnemonic "zees" instead of just saying the letter "z". This is again, to make it harder for spammers to extract the address. I could easily write a regex pattern that would extract the letter to remove from the note, and then include other logic that would filter the address based on the found letter. Something like this:


remove(?: all)?(?: the)? [a-z]

Again, is this foolproof? No. Given a reason and time, a spammer can still grab your address. The point is to make it more difficult for them to do so. I once had a boss who said to me, "Does the security system make our store safer? No. But if Johnny Snatchalot sees that we have a security system, but the store next door doesn't, which target will be easier for him to score from?" I believe the same concept holds here.


Another technique you could use, but might be less feasible, would be to encrypt your address with some weak cipher. For example, you could use a Caesar cipher with a shift of 1. Again, you could include a note that the displayed address is incorrect and it should be deciphered using method "X". While this won't prevent the address from being extracted, per se, it would cause a spammer to have to go through another layer to determine your email address.



Good-Intentioned Obfuscations

I have seen this one from time to time: posting an email address as an image. I'm sure most of you reading this are familiar with Captcha. If not, its the picture with the swirled letters that you have to look at and then type what you see in the accompanying text box. If you're familiar with Captcha, then you may be familiar with why it came about. Optical character recognition (OCR) is a way of extracting text from an image. Long ago, some clever individual figured out that you could programmatically create web request to certain pages in order to "work" a web site to your advantage (think pinging Ticketmaster to grab all the high-dollar seats for later scalping). Captcha was invented to prevent this. Subsequently, another set of clever individuals realized you could perform OCR against a Captcha image to defeat it. This is the reason you now see swirly, sometimes indecipherable, letters inside Captcha windows. Why did I tell you this? Because all of the image-based email addresses I have see to date have been simple, unobfuscated text, in an image. This is ripe for an OCR attack. While I am not suggesting you cannot use an image to post your email address, try to take the Captcha route and obfuscate the image a bit.



Summary

Communication across the globe has increased exponentially over the past few decades. With the advent of the internet and email, so too has the increase in spam been notable. In an electronic world, we often desire instant communication, and email has granted us that. However, there is no reason for us to make it easy for spammers to turn our in-boxes electronic dumpsters. I'm a firm believer in "if it can be made, it can be broken." While the methods above are not foolproof and won't prevent all attacks against your in-box, they should ward off the simpleton spammers. While the most foolproof plan to protect against that is not to post in the first place, it makes sense to use some form of obfuscation. Don't be afraid to have some fun with your obfuscation techniques; just make sure you don't go overboard and lose the meaning of what you set out to convey!

22
LVL 77

Author Comment

by:kaufmed
tfewster,

Interesting angle. Not foolproof, at least not to someone actively reviewing the addresses returned by their scraper, but a good approach. I just hope you don't have to change your address too often  ; )
0
LVL 61

Expert Comment

by:mbizzzup
Nice article!

Voted "yes" above.
0
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power, at least not to the average programmer, but it is the kind of power that can be used to save numerous lines of code. One of more complicated regex tools I'd like to describe to you is that of lookaround. When executed properly, lookaround can supercharge your patterns to provide you pattern-matching capabilities otherwise achieved through numerous procedures and even more numerous lines of code.

Regular expression lookaround is not a glaringly simple concept when you first see it. For this reason, readers of this article should at least be familiar with regular expressions in general. EE contributor BatuhanCetin has written a nice introduction to regular expressions here: Regular Expressions Starter Guide.

Outside of its complexity, another thing to be mindful of is that not every regex engine supports lookaround. If you plan on experimenting with any of the patterns demonstrated in this article, you should confirm that your editor or language supports lookaround. As described in the section Types of Lookaround, the two directions of lookaround are lookahead and lookbehind.  Regex engines can implement none, one, or both directions. Be sure …
14
LVL 45

Expert Comment

by:footech
Nice article!  One point, near the start, in the "Lookahead" section, you have a regex which is
^(?=.*[0-9])[a-zA-Z0-9]+$
and you talk about the dot-star being non-greedy...  Shouldn't this then be like the following?
^(?=.*?[0-9])[a-zA-Z0-9]+$

Hah!  I'm getting a headache and dizzy trying to work through this... though I'm pretty sure I came up with the same thing once when banging my head against the keyboard. :)
\s+|(?<=\w)(?=\W)(?!(?<=\d)(?=([-/])\d\d?\1(?:\d\d){1,2}))(?!(?<=\d([-/])\d\d?)(?=\2(?:\d\d){1,2}))|(?<=\W)(?=\W)|(?<=\W)(?=\w)(?!(?<=\d([-/]))(?=\d\d?\3(?:\d\d){1,2}))(?!(?<=\d\4\d\d?([-/]))(?=(?:\d\d){1,2}))

Open in new window

0
LVL 77

Author Comment

by:kaufmed
@footech

Ah the difference a single character can make  = )

Yes, you are correct. I have put in for the correction to be made.

(Sorry for the delay; I'm terrible about checking my email [for EE notices]!)
0
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration (minimum).

You can do these by using JavaScript (client-side) or through any language like PHP, Perl etc. (Back end side). As right now I am working on Perl, so my validation codes would be in Perl only.  

Date Validation using Regular expression in Perl
Date consists of three main things i.e. Date of the month, month of the year and year itself and one more thing that is required is the separator (/ or –  ). You may choose any other separator, no issues. So, after combining all these things we will get date like MM/DD/YYYY or DD/MM/YYYY or YYYY/MM/DD (you can use – or anything instead of /, that I used here). DD being the date, MM being the month and YYYY being the year.

So , to  accomplish the task, we have to check that:
a.     DD of the date should be between 1 and 31 ,
b.    MM should be between 1 and 12
c.   YYYY should be from 1900 and till date (Instead of today’s date you can put  boundary of any date)
d.   If MM is 02 i.e. February, then DD should be between 1 and 28 and if its leap year then DD can extend up to 29.
So, here is the magic of line which will do all validation itself, using regular expression:

It would match dd-mm-yyyy or dd/mm/yyyy pattern for rest of the patterns you have to …
0
LVL 6

Author Comment

by:Sanjeev Jaiswal
Yes you are right. Thanks for your review.
I just tried to keep it as simple as i can. Otherwise validating i na single would make it more complex and less preferable.
0
LVL 77

Expert Comment

by:kaufmed
I see time and time again posts for people asking how to validate dates and other strange values that would be better serviced by full logic, yet people still want to use Regex to do the validation. I'm guessing it's because they don't fully understand what Regex is or is useful for. I think the article is well intentioned and useful to those looking for that sort of thing. Keep at it  :)
0
I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code on a daily basis and have been working on getting the system migrated to an up to date implementation of PHP 5.3.1 and I’ve run across some issues in the migration that I thought warranted documenting.

Problem 1
The application as it stands is currently on a Linux box running PHP 4.4.2 which allows you to use variables without pre-defining them.  So if you want to write a conditional loop that takes a variable named $var and loop through a query adding things to the variable. You don’t need to pre-define the variable you just put in the loop $var .= “new conditions” and the variable gets appended including the new string.

The problem is security of course and the most recent implementations do not allow a variable to be appended unless it pre-exists. So I needed to devise a way to find every existence of the
   $var .=
no matter what the variable was called and then append the code so that it now says
    if (!isset($variable)) { $variable=”";} $variable .= “

Solution:
Adobe Dreamweaver (or any other API that includes a find and replace utilizing regular expressions) I just like using Dreamweaver because I’ve been using it for so long. I am sure you can do the same thing in many other APIs like …
0
by Batuhan Cetin

Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables.

There are many RegEx engines for use and these engines have different syntax and compilation. Perl5 is the most popular syntax which runs on NFA engine. There are three main types of engines: NFA, POSIX and DFA. Please see the references section at the end of the article for deatiled information.

Regular expressions are hard to explain by words and looks frightening. But if you have the patience and courage to jump into, it is one of the most useful and funny languages you may ever learn. So, here are the most used special characters, with examples.

Special Characters Used in Regular Expressions

"()" character

Matches the pattern between the parenthesis or used to logically group patterns or characters together.

RegEx: (exchange)
Match: exchange in expertsexchange

"." character

The "dot" matches a single character. Note that it does not match line breaks unless the engine is operating in single line mode.

RegEx: experts.
Match: experts, experts1, expertsa, ...

"*" character

This returns a result with zero or more occurences of the character before this. For example:

RegEx: experts*
Match: expert, experts, expertss, expertsss, ...

Regex: exper(ts)*
Match: exper, experts, expertsts, expertststs ...

"?" character
13
LVL 77

Expert Comment

by:kaufmed
Hello BatuhanCetin,

It seems I'm now the one who has been away for some time! Thanks. I'm finishing up one now  :)
0

Expert Comment

by:xenium
Thanks a lot this guide is proving useful having come from google docs complete lack of help on the topic. I hadn't even heard of "Regular expression" which must be one of the biggest misnomers in programming!

I've still a way to go...if anyone can help i've got a question open on the topic..
http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_28531374.html

Thanks a lot
0

Regular Expressions

A regular expression ("regex") is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. Regular expression processors are found in several search engines, search and replace dialogs of several word processors and text editors, and in the command lines of text processing utilities, such as sed and AWK. Many programming languages provide regular expression capabilities, some built-in, for example Perl, JavaScript, Ruby, AWK, and Tcl, and others via a standard library, for example .NET languages, Java, Python and C++ (since C++11). Most other languages offer regular expressions via a library.