sudama
asked on
Need to extract company name, address, phone numbers, etc. from webpages
I need to extract some data from the pages linked on this URL:
http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query
I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer. What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file. In other words, I want a spreadsheet with the fields:
HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS
Need to extract the data from this part of the webpage:
<table height="119" width="100%">
<tr valign="top">
<td align="right" height="50"><font face="verdana,arial,helvet ica" size="2">Name
and address:</font></td>
<td align="left" height="50"> <strong> <font face="verdana,arial,helvet ica" size="2">
Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
<td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.htm l"><img border="0" src="enhanced/report_image s/Default- banner.gif " width="200" height="200"></a></td>
</tr>
<tr>
<td align="right" height="17"><font face="verdana,arial,helvet ica" size="2">Telephone:</font> </td>
<td align="left" height="17"><font face="verdana,arial,helvet ica" size="2">
(915) 695-9900 </font></td>
</tr>
<tr>
<td align="right" height="17"><font face="verdana,arial,helvet ica" size="2">Hospital
Web site:</font></td>
<td align="left" height="17"><u><font face="verdana,arial,helvet ica" size="2" >
<A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A> </font></u></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet ica" size="2">Medicare
Provider Number:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet ica" size="2">
450558 </font></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet ica" size="2">Type
of Control:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet ica" size="2">
FORPROFIT CORPORATION </font></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet ica" size="2">Total
Staffed Beds:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet ica" size="2">
187 </font></td>
</tr>
</table>
http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query
I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer. What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file. In other words, I want a spreadsheet with the fields:
HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS
Need to extract the data from this part of the webpage:
<table height="119" width="100%">
<tr valign="top">
<td align="right" height="50"><font face="verdana,arial,helvet
and address:</font></td>
<td align="left" height="50"> <strong> <font face="verdana,arial,helvet
Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
<td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.htm
</tr>
<tr>
<td align="right" height="17"><font face="verdana,arial,helvet
<td align="left" height="17"><font face="verdana,arial,helvet
(915) 695-9900 </font></td>
</tr>
<tr>
<td align="right" height="17"><font face="verdana,arial,helvet
Web site:</font></td>
<td align="left" height="17"><u><font face="verdana,arial,helvet
<A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A> </font></u></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet
Provider Number:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet
450558 </font></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet
of Control:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet
FORPROFIT CORPORATION </font></td>
</tr>
<tr>
<td align="right" height="18"><font face="verdana,arial,helvet
Staffed Beds:</font></td>
<td align="left" height="18"><font face="verdana,arial,helvet
187 </font></td>
</tr>
</table>
ASKER
Thanks for the comment. Basically what I would like to do is 'snip' out the relevant portions of each webpage. I thought maybe it could be done by searching for matching strings before and after each data field.
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
these strings are what I want you to specify
ASKER
Well, in the first example it would be:
<td align="left" height="50"> <strong> <font face="verdana,arial,helvet ica" size="2">
That is right before the address:
Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299
Then there's this after the address:
</font></strong></td>
Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
<td align="left" height="50"> <strong> <font face="verdana,arial,helvet
That is right before the address:
Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299
Then there's this after the address:
</font></strong></td>
Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
<td align="left" height="50"> <strong> <font face="verdana,arial,helvet ica" size="2"> Abilene Regional Medical
^ snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
^ ^ snip2 ^
sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvet ica" size="2">\([^<]*\)<br>\([^ <]*\).*/\1 \2/' input_file
^ snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
^ ^ snip2 ^
sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvet
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
sudama, can you please explain why pbhj's answer was accepted ?
what string (tags or some fixed characteristics) should be used for identifying which string ...
is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information