asked on

Need to extract company name, address, phone numbers, etc. from webpages

I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer. What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file. In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
<tr valign="top">
<td align="right" height="50">Name
and address:</td>
<td align="left" height="50"> 
Abilene Regional Medical Ctr 6250 Highway 83/84 ABILENE, TX 79606-5299 </td>
<td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

</tr>
<tr>
<td align="right" height="17">Telephone:</td>
<td align="left" height="17">
(915) 695-9900 </td>
</tr>
<tr>
<td align="right" height="17">Hospital
Web site:</td>
<td align="left" height="17">
<A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A> </td>
</tr>
<tr>
<td align="right" height="18">Medicare
Provider Number:</td>
<td align="left" height="18">
450558 </td>
</tr>
<tr>
<td align="right" height="18">Type
of Control:</td>
<td align="left" height="18">
FORPROFIT CORPORATION </td>
</tr>
<tr>
<td align="right" height="18">Total
Staffed Beds:</td>
<td align="left" height="18">
187 </td>
</tr>
</table>

sunnycoder

sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information

sudama

ASKER

Thanks for the comment. Basically what I would like to do is 'snip' out the relevant portions of each webpage. I thought maybe it could be done by searching for matching strings before and after each data field.

sunnycoder

>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify

sudama

ASKER

Well, in the first example it would be:

<td align="left" height="50"> 

That is right before the address:

Abilene Regional Medical Ctr 6250 Highway 83/84 ABILENE, TX 79606-5299

Then there's this after the address:

</td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween

sunnycoder

<td align="left" height="50"> Abilene Regional Medical
^ snip 1
Ctr 6250 Highway 83/84 ABILENE, TX 79606-5299 </td>
^ ^ snip2 ^

sed 's:<td align="left" height="50"> \([^<]*\) \([^<]*\).*/\1 \2/' input_file

ASKER CERTIFIED SOLUTION

pbhj

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

sunnycoder

sudama, can you please explain why pbhj's answer was accepted ?