• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 276
  • Last Modified:

Need to extract company name, address, phone numbers, etc. from webpages

I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer.  What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file.  In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
  <tr valign="top">
    <td align="right" height="50"><font face="verdana,arial,helvetica" size="2">Name
      and address:</font></td>
    <td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">
      Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      </font></strong></td>
      <td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Telephone:</font></td>
    <td align="left" height="17"><font face="verdana,arial,helvetica" size="2">
      (915) 695-9900      </font></td>
  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Hospital
      Web site:</font></td>
    <td align="left" height="17"><u><font face="verdana,arial,helvetica" size="2" >
      <A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A>      </font></u></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Medicare
      Provider Number:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      450558      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Type
      of Control:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      FORPROFIT CORPORATION      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Total
      Staffed Beds:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      187      </font></td>
  </tr>
</table>
0
sudama
Asked:
sudama
  • 4
  • 2
1 Solution
 
sunnycoderCommented:
sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information
0
 
sudamaAuthor Commented:
Thanks for the comment.  Basically what I would like to do is 'snip' out the relevant portions of each webpage.  I thought maybe it could be done by searching for matching strings before and after each data field.
0
 
sunnycoderCommented:
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
sudamaAuthor Commented:
Well, in the first example it would be:

<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">

That is right before the address:

Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      

Then there's this after the address:

      </font></strong></td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
0
 
sunnycoderCommented:
<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2"> Abilene Regional Medical
                                                                                                                                         ^    snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
    ^      ^                               snip2                                       ^

sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">\([^<]*\)<br>\([^<]*\).*/\1  \2/'    input_file
0
 
pbhjCommented:
This is probably illegal - yeah, 'fraid so. There are intellectual property rights in databases and copying of the information without consent (ie circumventing the provided method - human view via a webpage) is likely illegal. The data (on hospitals) may be available from an alternate source though, eg local health authority.

That said, assuming you wanted to copy and have gained the permission of the site maintainers, just ask for a database dump and do some sql to produce your table.

You may also like to google for information on (using perl I think) screenscraping for getting television listings information ... I've seen tutorials out there about this.

HTH, keep it legal.
0
 
sunnycoderCommented:
sudama, can you please explain why pbhj's answer was accepted ?
0

Featured Post

Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now