Link to home
Start Free TrialLog in
Avatar of newbieal
newbiealFlag for United States of America

asked on

regex for <body> tag

I'm working on java code that reads in a html file and checks for the <body> tag and then inserts text after it.  The body tag may vary from looking like this <body> to
<body lang=EN-US
style='tab-interval:.5in'>

How do I check for the second version of the body tag?  I currently have this as my regex string:
<body[a-zA-Z0-9]*>  but that does not work as it never finds a match.  Any ideas what the regex needs to look like?
Avatar of ddrudik
ddrudik
Flag of United States of America image

<body[^>]*>
Avatar of newbieal

ASKER

Thanks, but that doesn't seem to work:
The <body> tag is split over two lines (that may not always be the case but I have to account for that in the regex):

<body lang=EN-US
style='tab-interval:.5in'>
SOLUTION
Avatar of ddrudik
ddrudik
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You may try replacing the carriage return in your string before you do a pattern matching with regular expression.
Any carraige returns would be in the character set of [^>]* so I would need to see the code used to understand the issue, I assume the pattern used is not as shown in my code post.
Here is what I have:

String thisLine = "";
String regex = "<body[^>]*>";
Pattern p = Pattern.compile(regex);
Matcher m;
 
while ((thisLine = in.readLine()) != null) 
		{	
			
			m = p.matcher(thisLine.toLowerCase());
			//save each line read to new html file
			out.println(thisLine);
			while(m.find()){
				//add new content at this string position
				out.println(lineToBeInserted);
			}
			
		}

Open in new window

I would need to see what was in thisLine to see why it is not working, however the pattern given is correct to match a body tag regardless of content in the tag.
thisLine contains the first line of this:

<body lang=EN-US
style='tab-interval:.5in'>
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, but I have to keep it more generic than that as this:

<body lang=EN-US
style='tab-interval:.5in'>

Some of the docs might just have this:
<body>

Or this:
<body style="">

and so on....
new bieal, read again line 7 in 22886439 and see that it will match "<body" followed by anything until finally a ">", the code shown would match all of your examples.

Feel free to change the sourcestring in my code example to any of your desired body tags and test it to see the sourcestring used and the match found.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial