Solved

Need to extract company name, address, phone numbers, etc. from webpages

Posted on 2003-11-26
7
266 Views
Last Modified: 2010-04-22
I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer.  What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file.  In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
  <tr valign="top">
    <td align="right" height="50"><font face="verdana,arial,helvetica" size="2">Name
      and address:</font></td>
    <td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">
      Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      </font></strong></td>
      <td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Telephone:</font></td>
    <td align="left" height="17"><font face="verdana,arial,helvetica" size="2">
      (915) 695-9900      </font></td>
  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Hospital
      Web site:</font></td>
    <td align="left" height="17"><u><font face="verdana,arial,helvetica" size="2" >
      <A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A>      </font></u></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Medicare
      Provider Number:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      450558      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Type
      of Control:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      FORPROFIT CORPORATION      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Total
      Staffed Beds:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      187      </font></td>
  </tr>
</table>
0
Comment
Question by:sudama
  • 4
  • 2
7 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9830784
sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information
0
 

Author Comment

by:sudama
ID: 9832295
Thanks for the comment.  Basically what I would like to do is 'snip' out the relevant portions of each webpage.  I thought maybe it could be done by searching for matching strings before and after each data field.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9835175
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:sudama
ID: 9883167
Well, in the first example it would be:

<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">

That is right before the address:

Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      

Then there's this after the address:

      </font></strong></td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9887948
<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2"> Abilene Regional Medical
                                                                                                                                         ^    snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
    ^      ^                               snip2                                       ^

sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">\([^<]*\)<br>\([^<]*\).*/\1  \2/'    input_file
0
 
LVL 5

Accepted Solution

by:
pbhj earned 500 total points
ID: 10096057
This is probably illegal - yeah, 'fraid so. There are intellectual property rights in databases and copying of the information without consent (ie circumventing the provided method - human view via a webpage) is likely illegal. The data (on hospitals) may be available from an alternate source though, eg local health authority.

That said, assuming you wanted to copy and have gained the permission of the site maintainers, just ask for a database dump and do some sql to produce your table.

You may also like to google for information on (using perl I think) screenscraping for getting television listings information ... I've seen tutorials out there about this.

HTH, keep it legal.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 10101746
sudama, can you please explain why pbhj's answer was accepted ?
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command l…
The purpose of this article is to fix the unknown display problem in Linux Mint operating system. After installing the OS if you see Display monitor is not recognized then we can install "MESA" utilities to fix this problem or we can install additio…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now