Solved

Regular Expression

Posted on 2004-10-11
19
493 Views
Last Modified: 2010-07-27
I need a regular expression that will match a specific word in a string of words but not a word with the characters = / \ . either before or after it

Example string
This is a link to the <a href="www.abc.com">abc</a> website.


Word to match
abc

I want the second abc to be matched but not the abc in www.abc.com



0
Comment
Question by:CUTTHEMUSIC
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
  • 2
  • +1
19 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 12283900
qr((?<![=/\\.])(abc)(?![=/\\.]))
0
 
LVL 6

Expert Comment

by:etmendz
ID: 12284411
You'll normally extract a string bounded by delimiters by first isolating or removing the delimiters from the string. A simple trick is to create a pattern to match the delimiters. When you parse the string, match the opening delimiter and skip it. Read the content that follows until the closing delimiter is matched.

To match an HTML, XML or SGML opening tag (and similar mark-up languages), the following works:

/<[^>]+>/

You use this to signal that an opening tag is matched. Parse through the string and extract the content until the closing tag is matched:

/<\/[^>]+>/

Have fun.
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12285605
<([\w][\w\d]*)[^>]*>(.*?)<\/\1> this will get you the correct text considering the html tags, attributes etc

so your
<a href="www.abc.com">abc</a> will give you abc
and so will <a>abc</a>

Pratap
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 11

Expert Comment

by:pratap_r
ID: 12285618
and if it is just the <A> tag you are looking for then this is a simpler one..
<A[^>]*>(.*?)</A>

make sure you turn off case sensitivity

Pratap
0
 
LVL 2

Author Comment

by:CUTTHEMUSIC
ID: 12286541
Ok let me explain more. I am trying to create an appliacation that searches through text and hightligts a certain word. Here is what I am using

Private Function findAndHighlight(ByVal Search_Str As String, ByVal InputTxt As String, ByVal StartTag As String, ByVal EndTag As String) As String

        Return Regex.Replace(InputTxt, "\b(" & Regex.Escape(Search_Str) & ")\b", StartTag & "$1" & EndTag, RegexOptions.IgnoreCase)

End Function

I would call the function like this
findAndHighlight("abc"), "This is a link to the <a href=www.abc.com>abc</a> website.", "<B>", "</B>")

The output that my current code produces is
<a href=www.<B>abc</B>.com><B>abc</B></a> website.
This would obviously cause problems when the user clicked the link.

I also have links that look like this
<a href=www.xyz.com?id=abc>abc</a>

What current code returns this
<a href=www.xyz.com?id=<B>abc</B>><B>abc</B></a>

What should be returned is
<a href=www.xyz.com?id=abc><B>abc</B></a>



0
 
LVL 84

Expert Comment

by:ozo
ID: 12287216
(?<![=\/\\.])(abc)(?![=\/\\.])
fulfills your original specification,
but it now sounds like you want to ignore strings in <tags>
that can get tricky with things like:
<IMG SRC = "foo.gif"
         ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>


and what would you want to do with
<a href="www.xxxabcyyy.com">xxxabcyyy</a> website.
0
 
LVL 2

Author Comment

by:CUTTHEMUSIC
ID: 12287446
ozo,
I'm not sure how to implement your original solution into my code.
I am increasing the points because of all of the revisions that I have made.

If I was searching for abc and I had the following string
<a href="www.xxxabcyyy.com">xxxabcyyy</a> website.

I would want to get this
<a href="www.xxxabcyyy.com">xxxabcyyy</a> website.

but if I were searching for xxxabcyyy then I would want this

<a href="www.xxxabcyyy.com"><B>xxxabcyyy</B></a> website.

Again, what I am trying to do is search data that comes in the form of a string. The string is from a database that I did a full text search on. I need a function that will take the string and highlight the searched words that produced the output from the full text search. But in the string there may be links, I don't want the code to highlight the searched words if it exists in the link. Hope this helps.
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12289274
CUTTHEMUSIC, the regex i mentioned in my previous post would work for you.

heres the code in c#
      string MyFunc(Match m)
      {
            return m.Groups[1].ToString() + "<b>" + m.Groups[3].ToString() + "</b></" + m.Groups[2].ToString() + ">";
      }
      private void button2_Click(object sender, System.EventArgs e)
      {
            Regex r=new Regex(@"(<([\w][\w\d]*)[^>]*>)?(.*?)</\2>");
            MessageBox.Show(r.Replace("<a href=www.xyz.com?id=abc>abc</a>",new MatchEvaluator(MyFunc)));
      }


Enjoy!
Pratap
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12289301
you may change the MyFunc in my code to do the proper formatting as required..

The messagebox for the above code displays this

<a href=www.xyz.com?id=abc><b>abc</b></a>


Pratap
0
 
LVL 6

Expert Comment

by:etmendz
ID: 12294281
You have tags that you need to ignore in order to extract the content. This is usually not easy so it is not a one line solution. The best way is to be able to isolate the tags and then grab only the text inside the tag. You can perform a recursive loop if needed to isolate even the tags within tags within tags and extract only the content you want. You can do this manually or you can use (in C#):

//Create the XmlDocument.
XmlDocument doc = new XmlDocument();
//Create a document fragment.
XmlDocumentFragment docFrag = doc.CreateDocumentFragment();
//Set the contents of the document fragment.
docFrag.InnerXml ="<a href='www.abc.com'>abc</a>";
//Display the document fragment.
Console.WriteLine(docFrag.InnerXml);
Console.WriteLine(docFrag.InnerText); // <-- THIS IS THE TRICK ;-)

Have fun...
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12295707
using XMLDocument for just extracting text might be a performance overhead, since this involves creation of the DOM object, validation etc. regx on the other hand is a one liner solution.. one pattern matches all your requirements.

a single regx replace will replace all occurances no loops required.. so an input of
<a href=www.xyz.com?id=abc>abc</a><a href=www.xyz.com?id=def>def</a>

for my function will provide
<a href=www.xyz.com?id=abc><b>abc</b></a><a href=www.xyz.com?id=def><b>def</b></a>

you just have to write the pattern properly

Pratap
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12827326
ozo's solution centered around the text being hardcoded.. (i.e, abc being static)

my post solves the problem, both my first one and the 3rd from the last one.

Have Fun!
Pratap
0
 
LVL 84

Expert Comment

by:ozo
ID: 12855720
The original question specifies a static abc

Regex(@"(<([\w][\w\d]*)[^>]*>)?(.*?)</\2>");
will match any pair of matching tags, and would change
"<body> xxx <a href="www.xxxabcyyy.com">xxxabcyyy</a> website. <a href="www.xxxabcyyy.com">xxxabcyyy</a> yyy <body>"
into
"<body><b> xxx <a href="www.xxxabcyyy.com">xxxabcyyy</a> website. <a href="www.xxxabcyyy.com">xxxabcyyy</a> yyy </b></body>"
0
 
LVL 11

Expert Comment

by:pratap_r
ID: 12862617
it specifies the static abc as an example, not as part of requirement.

you are right about the regex you have mentioned, thats why i had answered it with
<([\w][\w\d]*)[^>]*>(.*?)<\/\1> in my post above (3rd from the top).
0
 
LVL 84

Expert Comment

by:ozo
ID: 12862978
CUTTHEMUSIC also clarified later that when searching for abc,
<a href="www.xxxabcyyy.com">xxxabcyyy</a> website.
should remain unchanged.

<([\w][\w\d]*)[^>]*>(.*?)<\/\1>
would also change
<body> xxx <a href="www.xxxabcyyy.com">xxxabcyyy</a> website. <a href="www.xxxabcyyy.com">xxxabcyyy</a> yyy <body>
into
<body><b> xxx <a href="www.xxxabcyyy.com">xxxabcyyy</a> website. <a href="www.xxxabcyyy.com">xxxabcyyy</a> yyy </b></body>
0
 
LVL 84

Accepted Solution

by:
ozo earned 150 total points
ID: 12866263
neither of us solved the full problem as revised in http:#12286541
but that version is problematic to solve with a regular expression alone.
0
 
LVL 11

Assisted Solution

by:pratap_r
pratap_r earned 150 total points
ID: 12870164
yeah i guess the requirement got skewed in #12286541
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
Computer science students often experience many of the same frustrations when going through their engineering courses. This article presents seven tips I found useful when completing a bachelors and masters degree in computing which I believe may he…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question