Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Regular Expression - using regex...

Posted on 2003-03-27
10
Medium Priority
?
444 Views
Last Modified: 2011-09-20
I'm having a difficult time implementing the regex package with my regular expression.  hopefully someone here can help me out.

I am interested in parsing through an HTML string to find an <img src=".."> tag and replace the variable directory contents of the src attribute with a constant value.  Noting that the src attribute can come anywhere in the img tag declaration and that the src could have forward slashes such as: <img src="/tmp/test/image.gif" or the src could have backward slashes: such as <img height="8" src="C:\tmp\test.gif">.

Given the following html example, I've come up with three expressions that work in an editor.
<img height="9" src="/test/this/junk.gif" width="3">
<img src="c:\test\this\junk.gif" alt="helpme">

This first expression removes all the directories from the img tag if it has a forward slash in it:
$s/<img\([^>]*\)src="[^"]*\/\([^\/]*\)"/<img\1src="\2"/g

results in:
<img height="9" src="junk.gif" width="3">
<img src="c:\test\this\junk.gif" alt="helpme">


The second expression removes all the directories from the img tag if it has a backward slash in it:
$s/<img\([^>]*\)src="[^"]*\\\([^\/]*\)"/<img\1src="\2"/g

results in:
<img height="9" src="junk.gif" width="3">
<img src="junk.gif" alt="helpme">

And the last expression, takes the modified html code and inserts the new constant directory:
$s/<img\([^>]*\)src="\([^"]*\)"/<img\1src="newDirectory\/\2"/

results in:
<img height="9" src="newDirectory/junk.gif" width="3">
<img src="newDirectory/junk.gif" alt="helpme">

I can't get this to work using the regex package and can't find much reference to tutorials that use complex regex.  Anyone able to assist?

0
Comment
Question by:sapientconceptions
  • 6
  • 4
10 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221269
You don't really have the regex s//g (substitute and global) in the java.regex implementatoin.

The following class does what you want.
Let me know if it's not clear how it does it or why.


import java.util.regex.*;

public class RegExTest{

     static String[] tags = new String[]{
          "<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">",
          "<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">"
     };

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("<img([^>]*) src=\"([^\"]*)/.*",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("<img([^>]*) src=\"([^\"]*\\\\).*",Pattern.CASE_INSENSITIVE);

          for (int j = 0; j < tags.length; j++){
               Matcher m1 = p1.matcher(tags[j]);
               if (m1.matches()){
                    String update = tags[j].substring(0,m1.start(2)) + replacement + tags[j].substring(m1.end(2));
                    System.out.println(tags[j] + " was changed to " + update);
                    continue;
               }
               Matcher m2 = p2.matcher(tags[j]);
               if (m2.matches()){
                    String update = tags[j].substring(0,m2.start(2)) + replacement + "/" + tags[j].substring(m2.end(2));
                    System.out.println(tags[j] + " was changed to " + update);
               }
          }
     }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221297
Actually, by rearranging the capturing group brackets you can get a slightly more elegant solution that avoids having to chop the string up using substring and instead you can just re-assemble the groups that matcher has found.

import java.util.regex.*;

public class RegExTest{

     static String[] tags = new String[]{
          "<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">",
          "<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">"
     };

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<img[^>]* src=\")([^\"]*)(/.*)",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("(<img[^>]* src=\")([^\"]*\\\\)(.*)",Pattern.CASE_INSENSITIVE);

          for (int j = 0; j < tags.length; j++){
               Matcher m1 = p1.matcher(tags[j]);
               if (m1.matches()){
                    String update = m1.group(1) + replacement + m1.group(3);
                    System.out.println(tags[j] + " was changed to " + update);
                    continue;
               }
               Matcher m2 = p2.matcher(tags[j]);
               if (m2.matches()){
                    String update = m2.group(1) + replacement + "/" + m2.group(3);
                    System.out.println(tags[j] + " was changed to " + update);
               }
          }
     }
}
0
 

Author Comment

by:sapientconceptions
ID: 8221509
ozymandias,

that was great! thanks for your help.  But even though you satisfied what I was looking for, I'm still in need of additional assistance.  The 'string' of html will actually be a complete blob (text field in database lingo), not an array of strings that are images.  I tried modifying the code to reflect a regular blob, but of course, it didn't match anything anymore.  [consider reading in an html file...just in my example the data is coming from a backend instead of from the user]

Also, what if i want to insert something immediately after the '<' and before the last '>' while updating the remaining src information you provided above?  so given: <img src="..."> it would then become <test:img src="..." />?

thanks again...
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 15

Expert Comment

by:ozymandias
ID: 8221670
OK.

Part 1 : you have a big long string of html instead of an array of tags. This is the code modified to do the same replace but on all the img tags in a big long string.

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html><head></head><body><img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\"><img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\"></body></html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile(".*(<img[^>]* src=\")([^\"]*)(/.*>).*",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile(".*(<img[^>]* src=\")([^\"]*\\\\)(.*>).*",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.start(2)) + replacement + m1.group(3);
               m1 = p1.matcher(tmphtml.substring(m1.end(2)));
          }

          tmphtml = newhtml;

          newhtml = "";

          Matcher m2 = p2.matcher(tmphtml);

          while (m2.find()){
               newhtml += tmphtml.substring(0,m2.start(2)) + replacement + "/" + m2.group(3);
               m2 = p2.matcher(tmphtml.substring(m2.end(2)));
          }

          System.out.println(newhtml);

     }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8221851
Part2 : so given: <img src="..."> it would then become <test:img src="..." />?

OK. We have to add some more capture groups here and fiddle with the patterns a bit, but this should do the trick :

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)(/[^>]*)(>)",Pattern.CASE_INSENSITIVE);
          Pattern p2 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*\\\\)([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
               tmphtml = newhtml;
               newhtml = "";
               match = false;
          }

          Matcher m2 = p2.matcher(tmphtml);

          while (m2.find()){
               newhtml += tmphtml.substring(0,m2.end(1)) + "test:" + m2.group(2) + replacement + "/" + m2.group(4) + "/" + m2.group(5);
               tmphtml = tmphtml.substring(m2.end(5));
               m2 = p2.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(html + "\n\n was changed to\n\n" + newhtml );

     }
}
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 2000 total points
ID: 8221911
Finally, here is an optimized version of the code.
I have found 1 pattern that will match either of the previous 2 so that makes thins much simpler and neater.

import java.util.regex.*;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)[/\\\\]([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + "/" + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(match + "\n" + html + "\n\n was changed to\n\n" + newhtml );

     }
}
0
 

Author Comment

by:sapientconceptions
ID: 8222024
wow.  The condensed regular expression alone would have taken me weeks to figure out.  

Thanks for all your help.

(added another 250 for great help quickly)...
0
 

Author Comment

by:sapientconceptions
ID: 8222025
Terrific helper!
0
 

Author Comment

by:sapientconceptions
ID: 8222077
Ok, lol, sorry.  But one last question for the road.

Is it possible to grab the value that's being replaced before it's replaced and maybe add those values to a string array?

soo...
String[] replacedValues = new String[];
<img src="c:\temp\junk.gif"...> would make replacedValues[0]="c:\temp\junk.gif" ?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8223454
Yes.

Basically, in the code above, we break the tag/string into 5 capture groups. Group 3 is the section we are replacing, and group 4 is the rest of the image name. So, if you wanted a record of every replaced value you would do something like this :

import java.util.regex.*;
import java.util.Vector;

public class RegExTest2{

     static String html = new String("<html>\n<head>\n</head>\n<body>\n<img height=\"9\" src=\"/test/this/junk.gif\" width=\"3\">\n<img src=\"c:\\test\\this\\junk.gif\" alt=\"helpme\">\n<img height=\"9\" src=\"/another/test/picture.gif\" width=\"3\" alt=\"text\">\n</body>\n</html>");

     static String replacement = "newDirectory";

     public static void main(String[] args){

          Pattern p1 = Pattern.compile("(<)(img[^>]* src=\")([^\"]*)[/\\\\]([^>]*)(>)",Pattern.CASE_INSENSITIVE);

          Vector replacedValues = new Vector();
          String tmphtml = html;
          String newhtml = "";
          boolean match = false;

          Matcher m1 = p1.matcher(tmphtml);

          while (m1.find()){
               replacedValues.add(m1.group(3) + m1.group(4).substring(0,m1.group(4).indexOf("\"")));
               newhtml += tmphtml.substring(0,m1.end(1)) + "test:" + m1.group(2) + replacement + "/" + m1.group(4) + "/" + m1.group(5);
               tmphtml = tmphtml.substring(m1.end(5));
               m1 = p1.matcher(tmphtml);
               match = true;
          }

          if (match){
               newhtml += tmphtml;
          }

          System.out.println(match + "\n" + html + "\n\n was changed to\n\n" + newhtml );
          System.out.println("\n" + replacedValues.size() + " replacements were made.\n\nReplaced Values : \n");
          for (int v = 0; v < replacedValues.size(); v++){
               System.out.println(replacedValues.elementAt(v).toString());
          }
     }
}
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
Suggested Courses
Course of the Month13 days, 7 hours left to enroll

581 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question