Solved

Write a string manipulation algorithm for html source code

Posted on 2011-03-02
11
467 Views
Last Modified: 2012-05-11

Say that I have a string object which holds the source code for an HTML page.

Within that, I want to remove its tables if there are some.

If I get the first <table and the last </table> I can remove all the section that includes the tables.

However, I need to remove just the strings of those tables and I do not want to use DOM or Sax due to their time cost.

The problem is very easy if I have the following situation:

<table>
  <tr>
     <th>One</th>
     <th>Two</th>
  </tr>
</table>
<table>
  <tr>
    <th>Three</th>
    <th>Four</th>
  </tr>
</table>

I have 2 tables in sequence, I can easily copy those substrings and put into a String list (List<String>)

Now, when the tables are in nested state, such as:

<table>
  <tr>
    <th>One</th>
    <th>Two</th>
  </tr>
  <tr>
     <td>
       <table>
         <tr>
           <th>Three</th>
           <th>Four</th>
         </tr>
       </table>
    </td>
  </tr>
</table>


Then, if I want to separate the 2 tables into a string list I would have, the parent table as it is as the first entry on the list:


<table>
  <tr>
    <th>One</th>
    <th>Two</th>
  </tr>
  <tr>
     <td>
       <table>
         <tr>
           <th>Three</th>
           <th>Four</th>
         </tr>
       </table>
    </td>
  </tr>
</table>


And then the second table on the second entry on the list:

<table>
  <tr>
    <th>Three</th>
    <th>Four</th>
  </tr>
</table>

Now, to determine that tables are nested, all I have to do is to verify that after the first <table, there no following </table> but another <table

I need to do this via string manipulation without involving any parsers like DOM or SAX. Any ideas how to make it easier?

Thanks.


0
Comment
Question by:CarlosScheidecker
  • 6
  • 3
  • 2
11 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 35019997
Well, if you don't want to use SAX, you can actually rather simply
 write your own portion of it - just read lines one by one and check if indexOf("<table>")
and indexOf("</table>") and incremnement level paramtere so that you know at which
level of table you are currently parsing. It will not be as general as SAX,
as some people may write say <  table> wit a space or spread it across two lines, but this is not probable.
0
 
LVL 12

Expert Comment

by:mwochnick
ID: 35020031
consider using something like this http://htmlparser.sourceforge.net/ 
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35020109
Mwochnick. I have tried that before and it is not efficient. It creates objects and that takes time. Like I have stated on my question, string manipulation is what I need because it is much faster.
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 
LVL 47

Expert Comment

by:for_yan
ID: 35020110
But SAX is rather efficient.  If you write your own string manipulations
they will certainly be less general and where is the guarantee that they
would be faster?

Are you dealing with really huge files?
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35020125
for_yan. Not fast enough. I wrote a parser on SAX. I need that to be able to do it in tenths of a second, not within seconds.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35020138
for_yan, your first comment is what I will be exploring. That is what I need to do. So I will create a small project to do that and see how it goes.
0
 
LVL 12

Expert Comment

by:mwochnick
ID: 35020540
I'd consider using ByteArrayInputStream and ByteArrayOutputStream and using arrays to be pointers to your <table> and </table> tags.  
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35020641
Yes, there is always a trade-off between speed and generality. Wish you success.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35021239
Mwochnick, that is what I am doing to parse XML. Pointers. As for the rest, I want to yank the tables out first.
0
 
LVL 1

Accepted Solution

by:
CarlosScheidecker earned 0 total points
ID: 35031263
I was able to do that even simpler than that. From my last post

http://www.experts-exchange.com/Programming/Languages/Java/Q_26856859.html 

I have changed the code so that it would get the table tags instead.

HTMLElement[] tbls = resp.getElementsByTagName("table");

for (int i = 0; i < tbls.length; i++) {
                        aux = setXMLHeader(encoding);
                        aux += "<collections>";
                        aux += print(tbls.getNode());
                        aux += "</collections>";
                        strTables.add(aux);
                  }

And that is the answer to it.
0
 
LVL 1

Author Closing Comment

by:CarlosScheidecker
ID: 35067778
This is the correct answer on an easier way
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
diffSum example 4 50
couple of eclipse 5 46
How do I remove an object from a 3 40
Why doesn't this text field show up on my Applet frame? 2 19
After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question