Link to home
Start Free TrialLog in
Avatar of sudama
sudama

asked on

Need to extract company name, address, phone numbers, etc. from webpages

I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer.  What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file.  In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
  <tr valign="top">
    <td align="right" height="50"><font face="verdana,arial,helvetica" size="2">Name
      and address:</font></td>
    <td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">
      Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      </font></strong></td>
      <td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Telephone:</font></td>
    <td align="left" height="17"><font face="verdana,arial,helvetica" size="2">
      (915) 695-9900      </font></td>
  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Hospital
      Web site:</font></td>
    <td align="left" height="17"><u><font face="verdana,arial,helvetica" size="2" >
      <A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A>      </font></u></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Medicare
      Provider Number:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      450558      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Type
      of Control:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      FORPROFIT CORPORATION      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Total
      Staffed Beds:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      187      </font></td>
  </tr>
</table>
Avatar of sunnycoder
sunnycoder
Flag of India image

sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information
Avatar of sudama
sudama

ASKER

Thanks for the comment.  Basically what I would like to do is 'snip' out the relevant portions of each webpage.  I thought maybe it could be done by searching for matching strings before and after each data field.
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
Avatar of sudama

ASKER

Well, in the first example it would be:

<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">

That is right before the address:

Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      

Then there's this after the address:

      </font></strong></td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2"> Abilene Regional Medical
                                                                                                                                         ^    snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
    ^      ^                               snip2                                       ^

sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">\([^<]*\)<br>\([^<]*\).*/\1  \2/'    input_file
ASKER CERTIFIED SOLUTION
Avatar of pbhj
pbhj

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
sudama, can you please explain why pbhj's answer was accepted ?