?
Solved

Regular Expression - using regex...

Posted on 2003-03-27
10
Medium Priority
?
441 Views
Last Modified: 2011-09-20
I'm having a difficult time implementing the regex package with my regular expression.  hopefully someone here can help me out.

I am interested in parsing through an HTML string to find an <img src=".."> tag and replace the variable directory contents of the src attribute with a constant value.  Noting that the src attribute can come anywhere in the img tag declaration and that the src could have forward slashes such as: <img src="/tmp/test/image.gif" or the src could have backward slashes: such as <img height="8" src="C:\tmp\test.gif">.

Given the following html example, I've come up with three expressions that work in an editor.
<img height="9" src="/test/this/junk.gif" width="3">
<img src="c:\test\this\junk.gif" alt="helpme">

This first expression removes all the directories from the img tag if it has a forward slash in it:
$s/<img\([^>]*\)src="[^"]*\/\([^\/]*\)"/<img\1src="\2"/g

results in:
<img height="9" src="junk.gif" width="3">
<img src="c:\test\this\junk.gif" alt="helpme">


The second expression removes all the directories from the img tag if it has a backward slash in it:
$s/<img\([^>]*\)src="[^"]*\\\([^\/]*\)"/<img\1src="\2"/g

results in:
<img height="9" src="junk.gif" width="3">
<img src="junk.gif" alt="helpme">

And the last expression, takes the modified html code and inserts the new constant directory:
$s/<img\([^>]*\)src="\([^"]*\)"/<img\1src="newDirectory\/\2"/

results in:
<img height="9" src="newDirectory/junk.gif" width="3">
<img src="newDirectory/junk.gif" alt="helpme">

I can't get this to work using the regex package and can't find much reference to tutorials that use complex regex.  Anyone able to assist?

0
Comment
Question by:sapientconceptions
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
10 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221269
You don't really have the regex s//g (substitute and global) in the java.regex implementatoin.

The following class does what you want.
Let me know if it's not clear how it does it or why.


import java.util.regex.*;

public class RegExTest{

     static String[] tags = new String[]{
          "<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">",
          "<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">"
     };

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("<img([^>]*) src=\"([^\"]*)/.*",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("<img([^>]*) src=\"([^\"]*\\\\).*",Pattern.CASE_INSENSITIVE);

          for (int j = 0; j < tags.length; j++){
               Matcher m1 = p1.matcher(tags[j]);
               if (m1.matches()){
                    String update = tags[j].substring(0,m1.start(2)) + replacement + tags[j].substring(m1.end(2));
                    System.out.println(tags[j] + " was changed to " + update);
                    continue;
               }
               Matcher m2 = p2.matcher(tags[j]);
               if (m2.matches()){
                    String update = tags[j].substring(0,m2.start(2)) + replacement + "/" + tags[j].substring(m2.end(2));
                    System.out.println(tags[j] + " was changed to " + update);
               }
          }
     }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221297
Actually, by rearranging the capturing group brackets you can get a slightly more elegant solution that avoids having to chop the string up using substring and instead you can just re-assemble the groups that matcher has found.

import java.util.regex.*;

public class RegExTest{

     static String[] tags = new String[]{
          "<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">",
          "<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">"
     };

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<img[^>]* src=\")([^\"]*)(/.*)",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("(<img[^>]* src=\")([^\"]*\\\\)(.*)",Pattern.CASE_INSENSITIVE);

          for (int j = 0; j < tags.length; j++){
               Matcher m1 = p1.matcher(tags[j]);
               if (m1.matches()){
                    String update = m1.group(1) + replacement + m1.group(3);
                    System.out.println(tags[j] + " was changed to " + update);
                    continue;
               }
               Matcher m2 = p2.matcher(tags[j]);
               if (m2.matches()){
                    String update = m2.group(1) + replacement + "/" + m2.group(3);
                    System.out.println(tags[j] + " was changed to " + update);
               }
          }
     }
}
0
 

Author Comment

by:sapientconceptions
ID: 8221509
ozymandias,

that was great! thanks for your help.  But even though you satisfied what I was looking for, I'm still in need of additional assistance.  The 'string' of html will actually be a complete blob (text field in database lingo), not an array of strings that are images.  I tried modifying the code to reflect a regular blob, but of course, it didn't match anything anymore.  [consider reading in an html file...just in my example the data is coming from a backend instead of from the user]

Also, what if i want to insert something immediately after the '<' and before the last '>' while updating the remaining src information you provided above?  so given: <img src="..."> it would then become <test:img src="..." />?

thanks again...
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 15

Expert Comment

by:ozymandias
ID: 8221670
OK.

Part 1 : you have a big long string of html instead of an array of tags. This is the code modified to do the same replace but on all the img tags in a big long string.

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html><head></head><body><img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\"><img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\"></body></html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile(".*(<img[^>]* src=\")([^\"]*)(/.*>).*",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile(".*(<img[^>]* src=\")([^\"]*\\\\)(.*>).*",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.start(2)) + replacement + m1.group(3);
               m1 = p1.matcher(tmphtml.substring(m1.end(2)));
          }

          tmphtml = newhtml;

          newhtml = "";

          Matcher m2 = p2.matcher(tmphtml);

          while (m2.find()){
               newhtml += tmphtml.substring(0,m2.start(2)) + replacement + "/" + m2.group(3);
               m2 = p2.matcher(tmphtml.substring(m2.end(2)));
          }

          System.out.println(newhtml);

     }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221851
Part2 : so given: <img src="..."> it would then become <test:img src="..." />?

OK. We have to add some more capture groups here and fiddle with the patterns a bit, but this should do the trick :

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)(/[^>]*)(>)",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*\\\\)([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
               tmphtml = newhtml;
               newhtml = "";
               match = false;
          }

          Matcher m2 = p2.matcher(tmphtml);

          while (m2.find()){
               newhtml += tmphtml.substring(0,m2.end(1)) + "test:" + m2.group(2) + replacement + "/" + m2.group(4) + "/" + m2.group(5);
               tmphtml = tmphtml.substring(m2.end(5));
               m2 = p2.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(html + "\n\n was changed to\n\n" + newhtml );

     }
}
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 2000 total points
ID: 8221911
Finally, here is an optimized version of the code.
I have found 1 pattern that will match either of the previous 2 so that makes thins much simpler and neater.

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)[/\\\\]([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + "/" + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(match + "\n" + html + "\n\n was changed to\n\n" + newhtml );

     }
}
0
 

Author Comment

by:sapientconceptions
ID: 8222024
wow.  The condensed regular expression alone would have taken me weeks to figure out.  

Thanks for all your help.

(added another 250 for great help quickly)...
0
 

Author Comment

by:sapientconceptions
ID: 8222025
Terrific helper!
0
 

Author Comment

by:sapientconceptions
ID: 8222077
Ok, lol, sorry.  But one last question for the road.

Is it possible to grab the value that's being replaced before it's replaced and maybe add those values to a string array?

soo...
String[] replacedValues = new String[];
<img src="c:\temp\junk.gif"...> would make replacedValues[0]="c:\temp\junk.gif" ?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8223454
Yes.

Basically, in the code above, we break the tag/string into 5 capture groups. Group 3 is the section we are replacing, and group 4 is the rest of the image name. So, if you wanted a record of every replaced value you would do something like this :

import java.util.regex.*;
import java.util.Vector;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)[/\\\\]([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          Vector replacedValues = new Vector();
          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               replacedValues.add(m1.group(3) + m1.group(4).substring(0,m1.group(4).indexOf("\"")));
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + "/" + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(match + "\n" + html + "\n\n was changed to\n\n" + newhtml );
          System.out.println("\n" + replacedValues.size() + " replacements were made.\n\nReplaced Values : \n");
          for (int v = 0; v < replacedValues.size(); v++){
               System.out.println(replacedValues.elementAt(v).toString());
          }
     }
}
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: narshlob
If you've ever programmed in Ruby and have come across either a proc or a lambda, you might have been wondering what the difference is between the two and when you would use one over the other. This article will try to explain the difference between…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers learn about the third conditional statement “else if” and use it in an example program. Then additional information about conditional statements is provided, covering the topic thoroughly. Viewers learn about the third conditional statement …
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
Suggested Courses
Course of the Month15 days, 8 hours left to enroll

741 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question