Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Seperation of tags from text in HTML file

Posted on 2007-12-04
5
Medium Priority
?
393 Views
Last Modified: 2010-05-18
Hi,

I'm trying to find a regex (in Ruby) to split up an html document so that tags are on their own lines, seperate from the text. See examples below.

There is a complication in that I'm evaluating the file line by line, but sometimes a tag covers more than one line.
<div align="center"><i>All Rights Reserved</i> </div>
 
<!-- should become: -->
 
<div align="center">
<i>
All Rights Reserved
</i>
</div>
 
 
<!-- the regex I've developed is:
    /(<[^>]*>?)*([^<]*)?/
    but this fails to note some tags after text.
    To exemplify, the above example becomes: -->
 
<div align="center">
<i>
All Rights Reserved
</div>
 
<!-- missing the </i> -->

Open in new window

0
Comment
Question by:Synthetics
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
5 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 20402653
I would assume that this regex is OK
/(<[^>]*>?)*([^<]*)?/
but you have to make sure that you use a technique that allows multiple replacements in a single line

do you actually want to remove the line breaks inside the tag?

I would go for this regex
/(<[^<>]+(>?))/

I have an extra pair of (), so you can test whether the start of the tag falls on the end of line
then I would put a newline before the tag opener and a newline before the tag closer, if it is there
No need for matching the text in between tags
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 20402663
PS, in order for having this working on all occurences of a tag, use gsub()

cheers

Geert
0
 
LVL 5

Author Comment

by:Synthetics
ID: 20402672
That only seemed to capture the <!DOCTYPE ... > for some reason. My output code is below.
f = File.open("input.htm")
out = File.open("output.htm","w")
f.each do |line|
	tags = line.scan(/(<[^<>]+(>?))/)
	newlines = tags*"\n"
	newlines.each do |newline|
		if newline.chomp!.length > 0 then out.puts(newline) end
	end
end #f.each
out.close
f.close

Open in new window

0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 20402780
have you tried something along these lines?
f = File.open("an-html-file.htm")
f.each do |line|
      puts line.gsub(/(<[^<>]+(>?))/){|s| puts $1}
end #f.each
f.close
0
 
LVL 5

Accepted Solution

by:
dberner9 earned 2000 total points
ID: 20833469
I've just done the following in irb:

line = %(<div align="center"><i>All Rights Reserved</i> </div>)
divided = line.scan(/(<[^>]*>|[^<]*)/).flatten.join("\n")
puts divided

which yields:

<div align="center">
<i>
All Rights Reserved
</i>
 
</div>

Does that work for you?
line.scan(/<[^>]*>|[^<]*/).join("\n")

Open in new window

0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Without even knowing it, most of us are using web applications on a daily basis.  In fact, Gmail and Yahoo email, Twitter, Facebook, and eBay are used by most of us daily—and they are web applications. We generally confuse these web applications to…
In this tutorial viewers will learn how add a scalable full-width header using CSS3. Create a new HTML document with an internal stylesheet. Set a tiled background.:  Create a new div and name it Header. Position it with position:absolute at the top…
In this tutorial viewers will learn how to embed an audio file in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: : The declaration should display (CODE) HTML5 is supported by the most recent versions of all major browsers…

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question