Link to home
Create AccountLog in
Avatar of Billy Ma
Billy MaFlag for Hong Kong

asked on

How to extract data from HTML table

I have some HTML code below, I want to do the following in Java, anyone can help?
For each <table></table>, it will set the array's i position, get the data from each <td></td>, then store it in an array's j position
So finally, I will get a 2D array, e.g. array[i][j]

array[0][0] = Index
array[0][1] = CM MAC
array[0][2] = Uid
array[0][3] = IP
array[0][4] = RX
array[0][5] = TX
array[0][6] = SNR

array[1][0] = 003
array[1][1] = 000CE52A7788
array[1][2] = Upstream 1
array[1][3] = 10.113.184.32
array[1][4] = -7.1
array[1][5] = 59.3
array[1][6] = 36.1

etc
<HTML>
<HEAD> <TITLE>CMTS ABE0C0508BSR Monitor</TITLE> </HEAD>
<H1><CENTER>ABE0C0508BSR</H1></CENTER><HR>
<PRE>
<center>
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0005</td>
   <td width='20%' align='center'>000E5CE42AF2</td>
   <td width='20%' align='center'>Upstream 0</td>
   <td width='22%' align='center'>10.113.184.86</td>
   <td width='10%' align='center'>-2.1</td>
   <td width='10%' align='center'>58.5</td>
   <td width='10%' align='center'>36.2</td>
</tr>
</table>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of cmalakar
cmalakar
Flag of India image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of Billy Ma

ASKER

If you remove all the ending tr tags, then replace First tr with "", you now have no more tr tags, how do you use test.split("<tr>");????
>> If you remove all the ending tr tags, then replace First tr with "", you now have no more tr tags

No, still middle tr tags will be there, which basically specify the end of each 1-D array..

For ex..

you print the value of test after line 31, and check whether tr tags exist or not..
I got what you mean, however, I make each <table> become a line.....

int first_pos = 0;
int last_pos = 0;
ArrayList<String> result = new ArrayList<String>();
            
while(true){
      first_pos = data.indexOf("<table", last_pos);
      last_pos = data.indexOf("</table>", first_pos);
                  
      if(first_pos == -1 || last_pos == -1){
            break;
      }
                  
      String t1 = data.substring(first_pos, last_pos + 8);
      t1 = t1.replaceAll("<[/]{0,1}table[^>]*>", "");
      t1 = t1.replaceAll("<(tr|td)[^>]*>", "<$1>");
      System.out.println(t1+"\n");
                  
      result.add(data.substring(first_pos, last_pos));
}
            
return result;
I want to remove all tags but NOT <td> and </td> how can I do that?
Do you mean to say, you want to remove the table and tr tags..

then..

test = test.replaceAll("<[/]{0,1}(tr|table)[^>]*>", "");
Avatar of Mick Barry
you should be able to adapt the following

http://www.exampledepot.com/egs/javax.swing.text.html/GetText.html

or use httpunit or httpclient
cmalakar,
I want to remove all tags and only remain the td tags
I think I should use your first solution...
>> I think I should use your first solution...

Got it.. ?
I think we didn't remove the space in advance, that's why the final result have some unwanted spaces.
How to remove that?
>> How to remove that?

Did you solved it.. ? ie., removing spaces..
Yes, solved.

One more question.

Is it possible to get everything between the first table tag and the last table tag (includes the table tags itself) using replaceAll rather than

data = data.substring(data.indexOf("<table"), data.lastIndexOf("</table>") + 8);
Is this what you are asking.. ?
String test = "<table><a><b></a></b><table></table>last</table>";
System.out.println(test.replaceFirst("<table>(.*)</table>", "$1"));

Open in new window

hmm...not quite....
If I have a String in the code snippet below
I want to finally get "
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>

<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
"

String test = "
<html>
<head>
</head>
 
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>
 
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
 
</html>"

Open in new window

I dont think, you can do that, with replaceFirst..

But you can use Pattern and Matcher here..

Matcher matcher = Pattern.compile("(<table>.*</table>)").matcher(test);
if(matcher.find())
  System.out.println(matcher.group());

replaceFirst function also uses the Pattern and Matcher internally..