Link to home
Start Free TrialLog in
Avatar of cutie_smily
cutie_smily

asked on

How to parse html page



I am trying to learn to parse html page. What i did is i connected to one the web page got it contents in char array / string.

Now how do i get the my required data . I need to have an idea how to get from here.

For example iam connecting to a page http://www.ussearch.com/consumer/index.jsp and type fn,ln,m, age and you get result in a table ..i would like to grab that. I have the resulted page in str. Now from here how to go



Avatar of Mick Barry
Mick Barry
Flag of Australia image

Avatar of cutie_smily
cutie_smily

ASKER

I do not want the above links. I have already got the huge text in a string. I need to use string methods. So how do i go to that particular line and get the text i want.

example : part of my string has shown below

I should go here and get firstname lastname  city age
how do grab from this text from below string.

need to grab words shown below
searchCity=NEW+YORK
searchState=NY
-----------------------------------------------------------
 Preliminary Search Results for:
"Twinky R Winky"
displayDisplayName('1', "http://www.ussearch.com/consumer/cwf?adID=10002101&action=browseproduct&searchtab=people&pid=3064&searchPerson=ENH1078249456&searchFName=TEXTILES&searchMName=&searchLName=WINKY&searchCity=NEW+YORK&searchState=NY&searchApproxAge=29&searchStateJurisdiction=NY&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC", 'TEXTILES','','WINKY', '0', 'ENH1078249456', 'off'); displayAgeCityState('-', 'NEW YORK', 'NY'); displayPremiumUrls('&searchFName=TEXTILES&searchMName=&searchLName=WINKY&searchCity=NEW+YORK&searchState=NY&searchApproxAge=29&searchStateJurisdiction=NY&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC', 'ENH1078249456', '**/**/00', '51540c04140a03510'); displayL2Result('0', 'ENH1078249456', 'off');
displayDisplayName('2', "http://www.ussearch.com/consumer/cwf?adID=10002101&action=browseproduct&searchtab=people&pid=3064&searchPerson=ENH1078249457&searchFName=TIMOTHY&searchMName=J&searchLName=WINKY&searchCity=NEW+LENOX&searchState=IL&searchApproxAge=29&searchStateJurisdiction=IL&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC", 'TIMOTHY','J','WINKY', '1', 'ENH1078249457', 'off'); displayAgeCityState('-', 'NEW LENOX', 'IL'); displayPremiumUrls('&searchFName=TIMOTHY&searchMName=J&searchLName=WINKY&searchCity=NEW+LENOX&searchState=IL&searchApproxAge=29&searchStateJurisdiction=IL&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC', 'ENH1078249457', '**/**/00', '65050a021d5c4d4a8'); displayL2Result('1', 'ENH1078249457', 'off');
2
import java.util.regex.*;

public class P
{
      public static void main(String st[])
      {
                                     String str = "Twinky R Winky" +
                        "displayDisplayName('1', \"http://www.ussearch.com/consumer/cwf?adID=10002101&action=browseproduct&searchtab=people&pid=3064&searchPerson=ENH1078249456&searchFName=TEXTILES&searchMName=&searchLName=WINKY&searchCity=NEW+YORK&searchState=NY&searchApproxAge=29&searchStateJurisdiction=NY&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC\", 'TEXTILES','','WINKY', '0', 'ENH1078249456', 'off'); displayAgeCityState('-', 'NEW YORK', 'NY'); displayPremiumUrls('&searchFName=TEXTILES&searchMName=&searchLName=WINKY&searchCity=NEW+YORK&searchState=NY&searchApproxAge=29&searchStateJurisdiction=NY&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC', 'ENH1078249456', '**/**/00', '51540c04140a03510'); displayL2Result('0', 'ENH1078249456', 'off');" +
                             "displayDisplayName('2', \"http://www.ussearch.com/consumer/cwf?adID=10002101&action=browseproduct&searchtab=people&pid=3064&searchPerson=ENH1078249457&searchFName=TIMOTHY&searchMName=J&searchLName=WINKY&searchCity=NEW+LENOX&searchState=IL&searchApproxAge=29&searchStateJurisdiction=IL&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC\", 'TIMOTHY','J','WINKY', '1', 'ENH1078249457', 'off'); displayAgeCityState('-', 'NEW LENOX', 'IL'); displayPremiumUrls('&searchFName=TIMOTHY&searchMName=J&searchLName=WINKY&searchCity=NEW+LENOX&searchState=IL&searchApproxAge=29&searchStateJurisdiction=IL&searchGender=&searchZip=&vid=cfc&searchAgentNotes=PREVIEW-CFC', 'ENH1078249457', '**/**/00', '65050a021d5c4d4a8'); displayL2Result('1', 'ENH1078249457', 'off');";

                   Pattern pattern = Pattern.compile("search(City|State|Person)=([^&]*)?");
                   Matcher matcher = pattern.matcher(str);

                   while (matcher.find())
                         System.out.println(matcher.group(1)  + "=" + matcher.group(2));
      }
}
Forgot the output:
G:\java-temp>java P
Person=ENH1078249456
City=NEW+YORK
State=NY
City=NEW+YORK
State=NY
Person=ENH1078249457
City=NEW+LENOX
State=IL
City=NEW+LENOX
State=IL
can you explain me in detail. What is compile doing? what does the pattern represent here.

And how are you gettin g output i.e person,city state, and again repeating the same..

Thanks
ASKER CERTIFIED SOLUTION
Avatar of aozarov
aozarov

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
thanks.
"search(City|State|Person)=([^&]*)?"

if i would get Age value so my pattern should be

"search(City|State|Person|Age)=([^&]*)?"

Is search a function?? doesn't look like that.

what is search. Can u tell me

(----stands for
[--stands for
The caret ^ matches the position before the first character in the string
&--
* is repititive


i know it is very hard to explain. I would like to know for what pattern you are looking for and how you came p with pattern.

Thanks
>> if i would get Age value so my pattern should be "search(City|State|Person|Age)=([^&]*)?"
Yes.

>> what is search. Can u tell me
search is the prefix for city,state,person... -> in your text they are written as searchCity, searchState, ...

(....) -> will "capture" the match inside the brackets so you can later on get it via the group(index) command
[...] -> says match any character inside the squared brackets. e.g [abc] will mactch any character which is either a or b or c.
^ -> this actually has two meanings. in our case (where inside [..]) it means any character which is NOT the character that
comes after it. hence [^a] says match any character which is not a.
& is just & (which is your name=value delimiter)
* -> is repititive (right) -> zero or more matches of what precede it.
[^&] means any character which is no & and [^&]* means the same but zero or more characters which are not &

For short regular expression tutorials check this: http://www.regular-expressions.info/quickstart.html
thanks
:-)