Solved

Need to extract company name, address, phone numbers, etc. from webpages

Posted on 2003-11-26
7
267 Views
Last Modified: 2010-04-22
I need to extract some data from the pages linked on this URL:

http://www.ahd.com/freelist.php3?mname=&mcity=&mstate=TX&mzip=&mphone=&submitted=Submit+Query

I suppose a simple linux shell script would be able to do the trick, but I'm not an expert programmer.  What I would like it to do is retreive each url listed on that page, and extract the data from the table listed below, and plop it into a CSV or similar file.  In other words, I want a spreadsheet with the fields:

HOSPITAL_NAME
HOSPITAL_ADDRESS
CITY
STATE
ZIP
PHONE NUMBER
WEBSITE
MEDICARE_NUMBER
CONTROL_TYPE
NO_OF_BEDS

Need to extract the data from this part of the webpage:

<table height="119" width="100%">
  <tr valign="top">
    <td align="right" height="50"><font face="verdana,arial,helvetica" size="2">Name
      and address:</font></td>
    <td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">
      Abilene Regional Medical Ctr<BR>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      </font></strong></td>
      <td align="left" height="138" rowspan="6" width="200"> <a href="EnhancedListings.html"><img border="0" src="enhanced/report_images/Default-banner.gif" width="200" height="200"></a></td>

  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Telephone:</font></td>
    <td align="left" height="17"><font face="verdana,arial,helvetica" size="2">
      (915) 695-9900      </font></td>
  </tr>
  <tr>
    <td align="right" height="17"><font face="verdana,arial,helvetica" size="2">Hospital
      Web site:</font></td>
    <td align="left" height="17"><u><font face="verdana,arial,helvetica" size="2" >
      <A href="http://www.armc.info" style="text-decoration: none; color: #006699;">www.armc.info</A>      </font></u></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Medicare
      Provider Number:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      450558      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Type
      of Control:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      FORPROFIT CORPORATION      </font></td>
  </tr>
  <tr>
    <td align="right" height="18"><font face="verdana,arial,helvetica" size="2">Total
      Staffed Beds:</font></td>
    <td align="left" height="18"><font face="verdana,arial,helvetica" size="2">
      187      </font></td>
  </tr>
</table>
0
Comment
Question by:sudama
  • 4
  • 2
7 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9830784
sudama can you refine your question a bit:

what string (tags or some fixed characteristics) should be used for identifying which string ...

is it based on line number / formatting etc (which does not change for different pages) ... the script has to identify these regions and extract relevant portion but first you need to give the input format and the above information
0
 

Author Comment

by:sudama
ID: 9832295
Thanks for the comment.  Basically what I would like to do is 'snip' out the relevant portions of each webpage.  I thought maybe it could be done by searching for matching strings before and after each data field.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9835175
>I thought maybe it could be done by searching for matching strings before and after each data field.
these strings are what I want you to specify
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:sudama
ID: 9883167
Well, in the first example it would be:

<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">

That is right before the address:

Abilene Regional Medical Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299      

Then there's this after the address:

      </font></strong></td>

Same would be to separate out the phone number, just identify what's before and after and snip out anything inbetween
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9887948
<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2"> Abilene Regional Medical
                                                                                                                                         ^    snip 1
Ctr<br>6250 Highway 83/84<BR>ABILENE, TX 79606-5299 </font></strong></td>
    ^      ^                               snip2                                       ^

sed 's:<td align="left" height="50"> <strong> <font face="verdana,arial,helvetica" size="2">\([^<]*\)<br>\([^<]*\).*/\1  \2/'    input_file
0
 
LVL 5

Accepted Solution

by:
pbhj earned 500 total points
ID: 10096057
This is probably illegal - yeah, 'fraid so. There are intellectual property rights in databases and copying of the information without consent (ie circumventing the provided method - human view via a webpage) is likely illegal. The data (on hospitals) may be available from an alternate source though, eg local health authority.

That said, assuming you wanted to copy and have gained the permission of the site maintainers, just ask for a database dump and do some sql to produce your table.

You may also like to google for information on (using perl I think) screenscraping for getting television listings information ... I've seen tutorials out there about this.

HTH, keep it legal.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 10101746
sudama, can you please explain why pbhj's answer was accepted ?
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command l…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…
Both in life and business – not all partnerships are created equal. As the demand for cloud services increases, so do the number of self-proclaimed cloud partners. Asking the right questions up front in the partnership, will enable both parties …

895 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now