Need to extract company name, address, phone numbers, etc. from webpages

I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer.  What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file.  In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
  <tr valign="top">
    <td align="right" height="50"><font face="verdana,arial,helvetica" size="2">Name
      and address:</font></td>
    <td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">
      Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      </font></strong></td>
      <td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Telephone:</font></td>
    <td align="left" height="17"><font face="verdana,arial,helvetica" size="2">
      (915) 695-9900      </font></td>
  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Hospital
      Web site:</font></td>
    <td align="left" height="17"><u><font face="verdana,arial,helvetica" size="2" >
      <A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A>      </font></u></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Medicare
      Provider Number:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      450558      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Type
      of Control:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      FORPROFIT CORPORATION      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Total
      Staffed Beds:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      187      </font></td>
  </tr>
</table>
sudamaAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

sunnycoderCommented:
sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information
0
sudamaAuthor Commented:
Thanks for the comment.  Basically what I would like to do is 'snip' out the relevant portions of each webpage.  I thought maybe it could be done by searching for matching strings before and after each data field.
0
sunnycoderCommented:
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
0
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

sudamaAuthor Commented:
Well, in the first example it would be:

<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">

That is right before the address:

Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      

Then there's this after the address:

      </font></strong></td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
0
sunnycoderCommented:
<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2"> Abilene Regional Medical
                                                                                                                                         ^    snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
    ^      ^                               snip2                                       ^

sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">\([^<]*\)<br>\([^<]*\).*/\1  \2/'    input_file
0
pbhjCommented:
This is probably illegal - yeah, 'fraid so. There are intellectual property rights in databases and copying of the information without consent (ie circumventing the provided method - human view via a webpage) is likely illegal. The data (on hospitals) may be available from an alternate source though, eg local health authority.

That said, assuming you wanted to copy and have gained the permission of the site maintainers, just ask for a database dump and do some sql to produce your table.

You may also like to google for information on (using perl I think) screenscraping for getting television listings information ... I've seen tutorials out there about this.

HTH, keep it legal.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
sunnycoderCommented:
sudama, can you please explain why pbhj's answer was accepted ?
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux OS Dev

From novice to tech pro — start learning today.