We help IT Professionals succeed at work.

HTML Parsing

jazk0
jazk0 asked
on
Hello everyone i have a function which made an post request and getting the response html in string

String sourceUrlString="http://www.url.com";
String result = postData(sourceUrlString);

public String postData(String url){  
    	
	    // Create a new HttpClient and Post Header  
	    HttpClient httpclient = new DefaultHttpClient();  
	    HttpPost httppost = new HttpPost(url);  

    	try {  
    		// Add your data  
    		List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair>(1);  
	    	nameValuePairs.add(new BasicNameValuePair("p", "bm02new"));
	    	nameValuePairs.add(new BasicNameValuePair("town", "¿¿¿¿¿"));
	    	nameValuePairs.add(new BasicNameValuePair("addr", ""));
	    	nameValuePairs.add(new BasicNameValuePair("code", "0"));
	    	nameValuePairs.add(new BasicNameValuePair("bank", "0"));
	    	nameValuePairs.add(new BasicNameValuePair("submit", "ok")); 
	    	httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs));  

	    	// Execute HTTP Post Request  
	    	HttpResponse response = httpclient.execute(httppost);
	    	
	    	InputStream is = response.getEntity().getContent();
	    	BufferedInputStream bis = new BufferedInputStream(is);
	    	ByteArrayBuffer baf = new ByteArrayBuffer(20);

	    	 int current = 0;  
	    	 while((current = bis.read()) != -1){  
	    	 	baf.append((byte)current);  
	    	 }  
	    	   
	    	/* Convert the Bytes read to a String. */  
	    	String text = new String(baf.toByteArray()); 
	    	
	    	return text;

    	} catch (ClientProtocolException e) {  
    		// TODO Auto-generated catch block  
    	} catch (IOException e) {  
    		// TODO Auto-generated catch block  
    	}
		String text = null;
		return text;
	 
}

Open in new window


the html is like that

<html>
<head>
......
</head>
<body>
.......
.......
<table cellspacing="0" cellpadding="2" border="0">
<tbody>
<tr bgcolor="#3493e0">
 <td width="90">
<img height="1" width="90" border="0" src="space.gif"><br>
<span class="smallw"><b>¿¿¿¿¿¿¿¿<br> ¿¿¿¿¿</b></span>
</td>
<td><img height="1" width="10" border="0" src="space.gif"></td>
 <td width="160"><img height="1" width="160" border="0" src="space.gif"><br>
<span class="smallw"><b>¿¿¿¿¿</b></span>
</td>
<td><img height="1" width="10" border="0" src="space.gif"></td>
<td width="145"><img height="1" width="145" border="0" src="space.gif"><br>
<span class="smallw"><b>¿¿¿¿¿</b></span></td>
<td><img height="1" width="10" border="0" src="space.gif">
</td>
<td width="125"><img height="1" width="125" border="0" src="space.gif"><br>
<span class="smallw"><b>¿¿¿¿¿¿, ¿¿¿¿¿¿¿¿¿<br> 14.06.2010, 18:00</b></span>
</td>
</tr>
<tr height="42" bgcolor="#eeeeee">
<td><span class="normal">[b]DATA1[/b]</span></td>
<td></td>
<td><span class="normal">[b]DATA2[/b]</span></td>
<td></td>
<td><span class="normal">[b]DATA3[/b]</span></td>
<td></td>
<td align="center"> 
<span class="small"><a href="[b]URLDATA[/b]" onclick="window.open('','bmDetails056118','toolbar=no,location=no,menubar=no,directories=no,status=no,resizable=no,scrollbars=no,width=400,height=600, titlebar=no');" target="bmDetails056118"><img border="0" src="/magnifier.gif" alt="¿¿¿¿¿¿"></a></span></td>
</tr>
<tr height="42" bgcolor="">
<td><span class="normal">[b]DATA1[/b]</span></td>
<td></td>
<td><span class="normal">[b]DATA2[/b]</span></td>
<td></td>
<td><span class="normal">[b]DATA3[/b]</span></td>
<td></td>
<td align="center"> <span class="small"><a href="[b]URLDATA[/b]" onclick="window.open('','bmDetails643001','toolbar=no,location=no,menubar=no,directories=no,status=no,resizable=no,scrollbars=no,width=400,height=600, titlebar=no');" target="bmDetails643001"><img border="0" src="/magnifier.gif" alt="¿¿¿¿¿¿"></a></span></td>
</tr>
...............
...............
...............
</tbody>
</table>

Open in new window



I need a way to get DATA1 DATA2 DATA3 and URLDATA into vector array

I need the code for android app but it is Java
Comment
Watch Question

Commented:
Try to use getelementby Name or getelementby Id or getelementbyclassANmeto grab any data you want in html
For example


document means all HTML document
body means HTML <body> tag

a = document.body.getelementsbyclassName("normal").items(0)
b= document.body.getelementsbyclassName("normal").items(1)

Hope it will
a_b
Top Expert 2009
Commented:
Since you already have the required data in your java class.....have you tried to convert it into and DOM object and then retrieving the data?
a_b
Top Expert 2009

Commented:
If you need only this fields it will be beter to use SAX parser and to get only the fields you are interested in. There are two things you have to have in mind.
1. Your html must be well formed.
2. You have to recognize the fields you are interested in somehow. In you case element "span" with attribute "class" value "normal"?! or may be you have to add your own id of the span in order to recognize your fields.
Here is an example how to trace attributes
http://www.brics.dk/~amoeller/XML/programming/saxexample.html

Author

Commented:
Hello Valeri "<span class="small">" is surrounding all the data that i need but how can i get them into vector array can you give me a piece of code
In order to get content of the tag you have to implement characters() method. You have to use endElement() method as well. Here is an piece of code :
Vector<Object> myData = new Vector<Object>();
boolean addToVector = false;
public void startElement(String namespaceURI, String localName, String qName, Attributes atts)  {
    if (localName.equals("span")) {
       String n = atts.getValue("","class");
       if (n.equals("small")) {
           addToVector = true;
       }
    }
  }
public void characters(char[] ch, int start, int length)  {
 String s = new String (ch, start, length); String value = s.trim();
 if (addToVector) myData.add(value);
}
public void endElement(String namespaceUri, String localName, String qName)  {
addToVector = false;
}
well, I kinda missed the fact that this was asked 3 months ago, so I'll still post my answer and see if anybody still cares :-)

So here we go: This is "normal" HTML, not XHTML, so you'll have a hard time parsing it as a DOM - there are some libraries available, but it can still be a big mess sometimes.
And why would you? "Getting something from a long, semi-structured text" can be achieved nicely by using regular expressions, like this:

code is tested, Have fun....
Pattern rowPattern = Pattern.compile("<tr height=\"42\".*?</tr>", Pattern.DOTALL);
Pattern dataPattern = Pattern.compile("<span class=\"normal\">\\[b\\](.*?)\\[/b\\]</span>");
Pattern urlDataPattern = Pattern.compile("<span class=\"small\"><a href=\"\\[b\\](.*?)\\[/b\\]");
Matcher rowMatcher = rowPattern.matcher(text);
while (rowMatcher.find()) {
	String row = rowMatcher.group();
	Matcher dataMatcher = dataPattern.matcher(row);
	while (dataMatcher.find()) {
		System.out.println("data: " + dataMatcher.group(1));
	}
	Matcher urldataMatcher = urlDataPattern.matcher(row);
	while (urldataMatcher.find()) {
		System.out.println("url:  " + urldataMatcher.group(1));
	}
}

Open in new window

Mick BarryJava Developer
Top Expert 2010
Commented:
you could adapt something like this

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

or use a 3rd party html parser like httpunit
Kevin CrossChief Technology Officer
Most Valuable Expert 2011

Commented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.