Avatar of Billy Ma
Billy Ma
Flag for Hong Kong asked on

How to extract data from HTML table

I have some HTML code below, I want to do the following in Java, anyone can help?
For each <table></table>, it will set the array's i position, get the data from each <td></td>, then store it in an array's j position
So finally, I will get a 2D array, e.g. array[i][j]

array[0][0] = Index
array[0][1] = CM MAC
array[0][2] = Uid
array[0][3] = IP
array[0][4] = RX
array[0][5] = TX
array[0][6] = SNR

array[1][0] = 003
array[1][1] = 000CE52A7788
array[1][2] = Upstream 1
array[1][3] = 10.113.184.32
array[1][4] = -7.1
array[1][5] = 59.3
array[1][6] = 36.1

etc
<HTML>
<HEAD> <TITLE>CMTS ABE0C0508BSR Monitor</TITLE> </HEAD>
<H1><CENTER>ABE0C0508BSR</H1></CENTER><HR>
<PRE>
<center>
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0005</td>
   <td width='20%' align='center'>000E5CE42AF2</td>
   <td width='20%' align='center'>Upstream 0</td>
   <td width='22%' align='center'>10.113.184.86</td>
   <td width='10%' align='center'>-2.1</td>
   <td width='10%' align='center'>58.5</td>
   <td width='10%' align='center'>36.2</td>
</tr>
</table>

Open in new window

JavaHTML

Avatar of undefined
Last Comment
cmalakar

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
cmalakar

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Billy Ma

ASKER
If you remove all the ending tr tags, then replace First tr with "", you now have no more tr tags, how do you use test.split("<tr>");????
cmalakar

>> If you remove all the ending tr tags, then replace First tr with "", you now have no more tr tags

No, still middle tr tags will be there, which basically specify the end of each 1-D array..

For ex..

you print the value of test after line 31, and check whether tr tags exist or not..
Billy Ma

ASKER
I got what you mean, however, I make each <table> become a line.....

int first_pos = 0;
int last_pos = 0;
ArrayList<String> result = new ArrayList<String>();
            
while(true){
      first_pos = data.indexOf("<table", last_pos);
      last_pos = data.indexOf("</table>", first_pos);
                  
      if(first_pos == -1 || last_pos == -1){
            break;
      }
                  
      String t1 = data.substring(first_pos, last_pos + 8);
      t1 = t1.replaceAll("<[/]{0,1}table[^>]*>", "");
      t1 = t1.replaceAll("<(tr|td)[^>]*>", "<$1>");
      System.out.println(t1+"\n");
                  
      result.add(data.substring(first_pos, last_pos));
}
            
return result;
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Billy Ma

ASKER
I want to remove all tags but NOT <td> and </td> how can I do that?
cmalakar

Do you mean to say, you want to remove the table and tr tags..

then..

test = test.replaceAll("<[/]{0,1}(tr|table)[^>]*>", "");
Mick Barry

you should be able to adapt the following

http://www.exampledepot.com/egs/javax.swing.text.html/GetText.html

or use httpunit or httpclient
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Billy Ma

ASKER
cmalakar,
I want to remove all tags and only remain the td tags
Billy Ma

ASKER
I think I should use your first solution...
cmalakar

>> I think I should use your first solution...

Got it.. ?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Billy Ma

ASKER
I think we didn't remove the space in advance, that's why the final result have some unwanted spaces.
How to remove that?
cmalakar

>> How to remove that?

Did you solved it.. ? ie., removing spaces..
Billy Ma

ASKER
Yes, solved.

One more question.

Is it possible to get everything between the first table tag and the last table tag (includes the table tags itself) using replaceAll rather than

data = data.substring(data.indexOf("<table"), data.lastIndexOf("</table>") + 8);
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
cmalakar

Is this what you are asking.. ?
String test = "<table><a><b></a></b><table></table>last</table>";
System.out.println(test.replaceFirst("<table>(.*)</table>", "$1"));

Open in new window

Billy Ma

ASKER
hmm...not quite....
If I have a String in the code snippet below
I want to finally get "
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>

<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
"

String test = "
<html>
<head>
</head>
 
<table border='0' width='70%'>
<tr bgcolor='#F7EDA6' FONT="courier">
   <td width='8%' align='center'>Index</td>
   <td width='20%' align='center'>CM MAC</td>
   <td width='20%' align='center'>Uid</td>
   <td width='22%' align='center'>IP</td>
   <td width='10%' align='center'>RX</td>
   <td width='10%' align='center'>TX</td>
   <td width='10%' align='center'>SNR</td>
</tr>
</table>
 
<table border='0' width='70%'>
<tr bgcolor='#b4b4b4'>
   <td width='8%' align='center'>0003</td>
   <td width='20%' align='center'>000CE52A7788</td>
   <td width='20%' align='center'>Upstream 1</td>
   <td width='22%' align='center'>10.113.184.32</td>
   <td width='10%' align='center'>-7.1</td>
   <td width='10%' align='center'>59.3</td>
   <td width='10%' align='center'>36.1</td>
</tr>
</table>
 
</html>"

Open in new window

cmalakar

I dont think, you can do that, with replaceFirst..

But you can use Pattern and Matcher here..

Matcher matcher = Pattern.compile("(<table>.*</table>)").matcher(test);
if(matcher.find())
  System.out.println(matcher.group());

replaceFirst function also uses the Pattern and Matcher internally..
Your help has saved me hundreds of hours of internet surfing.
fblack61