Link to home
Start Free TrialLog in
Avatar of CarlosScheidecker
CarlosScheidecker

asked on

Write a string manipulation algorithm for html source code


Say that I have a string object which holds the source code for an HTML page.

Within that, I want to remove its tables if there are some.

If I get the first <table and the last </table> I can remove all the section that includes the tables.

However, I need to remove just the strings of those tables and I do not want to use DOM or Sax due to their time cost.

The problem is very easy if I have the following situation:

<table>
  <tr>
     <th>One</th>
     <th>Two</th>
  </tr>
</table>
<table>
  <tr>
    <th>Three</th>
    <th>Four</th>
  </tr>
</table>

I have 2 tables in sequence, I can easily copy those substrings and put into a String list (List<String>)

Now, when the tables are in nested state, such as:

<table>
  <tr>
    <th>One</th>
    <th>Two</th>
  </tr>
  <tr>
     <td>
       <table>
         <tr>
           <th>Three</th>
           <th>Four</th>
         </tr>
       </table>
    </td>
  </tr>
</table>


Then, if I want to separate the 2 tables into a string list I would have, the parent table as it is as the first entry on the list:


<table>
  <tr>
    <th>One</th>
    <th>Two</th>
  </tr>
  <tr>
     <td>
       <table>
         <tr>
           <th>Three</th>
           <th>Four</th>
         </tr>
       </table>
    </td>
  </tr>
</table>


And then the second table on the second entry on the list:

<table>
  <tr>
    <th>Three</th>
    <th>Four</th>
  </tr>
</table>

Now, to determine that tables are nested, all I have to do is to verify that after the first <table, there no following </table> but another <table

I need to do this via string manipulation without involving any parsers like DOM or SAX. Any ideas how to make it easier?

Thanks.


Avatar of for_yan
for_yan
Flag of United States of America image

Well, if you don't want to use SAX, you can actually rather simply
 write your own portion of it - just read lines one by one and check if indexOf("<table>")
and indexOf("</table>") and incremnement level paramtere so that you know at which
level of table you are currently parsing. It will not be as general as SAX,
as some people may write say <  table> wit a space or spread it across two lines, but this is not probable.
consider using something like this http://htmlparser.sourceforge.net/ 
Avatar of CarlosScheidecker
CarlosScheidecker

ASKER

Mwochnick. I have tried that before and it is not efficient. It creates objects and that takes time. Like I have stated on my question, string manipulation is what I need because it is much faster.
But SAX is rather efficient.  If you write your own string manipulations
they will certainly be less general and where is the guarantee that they
would be faster?

Are you dealing with really huge files?
for_yan. Not fast enough. I wrote a parser on SAX. I need that to be able to do it in tenths of a second, not within seconds.
for_yan, your first comment is what I will be exploring. That is what I need to do. So I will create a small project to do that and see how it goes.
I'd consider using ByteArrayInputStream and ByteArrayOutputStream and using arrays to be pointers to your <table> and </table> tags.  
Yes, there is always a trade-off between speed and generality. Wish you success.
Mwochnick, that is what I am doing to parse XML. Pointers. As for the rest, I want to yank the tables out first.
ASKER CERTIFIED SOLUTION
Avatar of CarlosScheidecker
CarlosScheidecker

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This is the correct answer on an easier way