[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 235
  • Last Modified:

Regexp instead of XSLT

I need to process HTML which is not wellformed and tools like Tidy *cannot* make wellformed. I decided to apply some regexps to fulfill this task.

My structure is HTML with some extra tags, that I need to extract e.g.:
...
<table border="0" cellspacing="0" cellpadding="0" width="100%">
  <tr>
    <my-contenttype name="foo">
    <td>
      <table border="0" cellspacing="0" cellpadding="0" width="100%">
        <tr>
          <my-attribute name="bar">
          <td class="headline_01">
             FooBar
          </td>
          </my-attribute>    
        </tr>
      </table>
    </td>
    </my-contenttype>
  </tr>
<table>
...

I need to extract every opening and closing my tag and also extract all text between my-attribute tags.
The result of the regexp should be:

<my-contenttype name="foo">
  <my-attribute name="bar">
    FooBar
  </my-attribute>
</my-contenttype>

Can anybody help.
0
Smoerble
Asked:
Smoerble
1 Solution
 
OrtokoboldCommented:
Here is the perl script to do the trick:
perl -pe 's/(\s*<\/?(?:my-attribute|my-contenttype).*\/?>\s*)|(?:\s*<\/?\w+.*\/?>\s*)/$1/e' test.html
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now