regex for <body> tag

I'm working on java code that reads in a html file and checks for the <body> tag and then inserts text after it.  The body tag may vary from looking like this <body> to
<body lang=EN-US
style='tab-interval:.5in'>

How do I check for the second version of the body tag?  I currently have this as my regex string:
<body[a-zA-Z0-9]*>  but that does not work as it never finds a match.  Any ideas what the regex needs to look like?
LVL 4
newbiealAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ddrudikCommented:
<body[^>]*>
0
newbiealAuthor Commented:
Thanks, but that doesn't seem to work:
The <body> tag is split over two lines (that may not always be the case but I have to account for that in the regex):

<body lang=EN-US
style='tab-interval:.5in'>
0
ddrudikCommented:
That works for me, show your code.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<body[^>]*>",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount(); groupIdx++ ){
        System.out.println( "[" + groupIdx + "] = " + m.group(groupIdx));
      }
    }
  }
}

Open in new window

0
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

Peter KwanAnalyst ProgrammerCommented:
You may try replacing the carriage return in your string before you do a pattern matching with regular expression.
0
ddrudikCommented:
Any carraige returns would be in the character set of [^>]* so I would need to see the code used to understand the issue, I assume the pattern used is not as shown in my code post.
0
newbiealAuthor Commented:
Here is what I have:

String thisLine = "";
String regex = "<body[^>]*>";
Pattern p = Pattern.compile(regex);
Matcher m;
 
while ((thisLine = in.readLine()) != null) 
		{	
			
			m = p.matcher(thisLine.toLowerCase());
			//save each line read to new html file
			out.println(thisLine);
			while(m.find()){
				//add new content at this string position
				out.println(lineToBeInserted);
			}
			
		}

Open in new window

0
ddrudikCommented:
I would need to see what was in thisLine to see why it is not working, however the pattern given is correct to match a body tag regardless of content in the tag.
0
newbiealAuthor Commented:
thisLine contains the first line of this:

<body lang=EN-US
style='tab-interval:.5in'>
0
ddrudikCommented:

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "<body lang=EN-US \r\n"+"style='tab-interval:.5in'>";
  System.out.println(sourcestring);
  Pattern re = Pattern.compile("<body[^>]*>",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      System.out.println("[0] = " + m.group(0));
    }
  }
}

Open in new window

0
newbiealAuthor Commented:
Thanks, but I have to keep it more generic than that as this:

<body lang=EN-US
style='tab-interval:.5in'>

Some of the docs might just have this:
<body>

Or this:
<body style="">

and so on....
0
ddrudikCommented:
new bieal, read again line 7 in 22886439 and see that it will match "<body" followed by anything until finally a ">", the code shown would match all of your examples.

Feel free to change the sourcestring in my code example to any of your desired body tags and test it to see the sourcestring used and the match found.
0
Peter KwanAnalyst ProgrammerCommented:
Of course that does not work. Since you have:

thisLine = "<body lang=EN-US";

and

thisLine = "style='tab-interval:.5in'> "

in two loops. You may consider the following:


			while ((thisLine = in.readLine()) != null) 
            {
				if (thisLine.indexOf("<body") >= 0) {
					if (thisLine.indexOf('>') > 0)
						; // add your content
					else {
						do {
							thisLine = in.readLine();
						} while (thisLine.indexOf(">") == -1);
						// add your content
					}
				}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.