RegEx: how to  extend this Regex

Posted on 2004-10-26
Last Modified: 2010-07-27
Hi all,
I have the following RegEx to find and replace URLs in a string.
Regex re = new Regex(@"(\[URL=|\[url=)*((?<!\[img=|\[IMG=)(http|ftp|https)://[\w-]+(\.[\w-]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)(\])*");
I want to extend this query: it should ONLY do this, if the URL is NOT in an area surrounded by [html] and [/html].

Examples (this already works):
.... [url=....] ....      ==>     do not change this
.... http://www..../ ....     ==>     make it: .... [url=http://www....] ....

Examples (what I want additionally):
.... [html] ... http://www... [/html] ....    ==> do NOT change this!

Any ideas?

Question by:Smoerble
    LVL 6

    Expert Comment

    What you need is called "lookaround". In other words, lookahead and lookbehind features. Many modern regex languages support this. Which language are you using? It looks like Perl or Dotnet: in either case you're in luck.

    The general gist of things is that lookaround expressions don't actually match characters themselves, but positions, in the same way as for example a word boundary match doesn't consume any characters either. Usually the syntax is (?=XXX) for a lookahead and (?<=XXX) for a lookbehind. Maybe an example is useful.

    If you had the string:


    then a regex of


    would match the first b, but not the rest.

    to match all the b's you could write something like


    In other words, match characters as long as what comes before the first one is [html] and what comes after the last one is [\html]

    You might have to look up the documentation for the specific language you're using to get the details.

    Author Comment

    I think here's a little misunderstanding:
    I want everthing EXCEPT the stuff between [html] and [/html].
    And this regex needs to include the regEx above.

    About the language: it's C#, yes.

    Author Comment

    Hmm... maybe a different approach: do it with several steps:

    1) get all strings between [html] and [/html] (there might be more than one block), save it somehow
    2) replace all URLs with the regEx from above
    3) get all [html][/html] blocks in the modified string and replace them with the original strings saved in step 1.

    Any idea how the code would have to look like? Possible? Clever?
    LVL 6

    Accepted Solution

    You might be able to do something with negative lookaround, using a ! instead of =:


    The reference for this behaviour in dotnet is here:

    but in fact, that's probably not the way to go. The point is that if you can get the [html]...[/html] blocks to be matched by one part of your regex, they won't be being matched by the other parts of your regex. This means you can do something like:


    and only emit the text matched by the first group. (I haven't tested this, and with a memory like mine, it's bound to be buggy, but hopefully it's a pointer in the right direction.)

    I would also suggest that you look at the /G assertion, which forces the match to begin where the last match finished.


    Have a look at the examples in

    Hope this helps.


    Author Comment

    sorry, I think I only understand some parts of your forum. Can you please give me some pseudo-code?
    LVL 6

    Expert Comment

    If you post the code you are using now, I'll see if I can show you how to fit this in.
    LVL 6

    Expert Comment

    I don't think this should be deleted. Although the questioner didn't understand me, I think the explanations I've given and the links to relevant references should be quite adequate to help some other people facing this sort of problem.

    Author Comment

    Oh sorry, totally missed that one.
    I made a complete different approach (a checkbox that says "do not translate URLs"), as I missed your question about my code.
    So I will grant you the points anyway, sorry for the delay.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Should I Do With This Threat Intelligence?

    Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

    Suggested Solutions

    This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
    This is an explanation of a simple data model to help parse a JSON feed
    Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
    In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

    856 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    13 Experts available now in Live!

    Get 1:1 Help Now