Advertisement

03.10.2008 at 07:06AM PDT, ID: 23228394
[x]
Attachment Details
[x]
The Solution Rating System

With so many solutions, how can you tell which solutions are most likely to help you and which ones are not? To provide you with a tool to use, we rate our solutions based on various elements that most accurately determine if a solution is a quality solution. To explain what factors affect the solution rating, here are the elements we take into consideration when formulating our solution rating.

  • The Grade of the Solution
  • The Zone Rank of the Expert Providing the Solution
  • The Number of Author and Expert Comments
  • The Number of Experts Contributing
  • The Feedback of the Community

Your Input Matters
Because of the way the system is set up, the most important variable in this equation is you. As a member of Experts Exchange, you are able to cast your vote on the quality of the solutions in regard to how complete, accurate, helpful and easy to understand each solution is. When you provide your feedback, each rating is adjusted accordingly. So, if you see a solution that has a poor rating that you think is a good solution, let us know by rating it. As you do, the rating will be adjusted and will become more accurate for other members of our site.

If you have any suggestions that you would like to make for our rating system, please ask a question in the Suggestions Zone of Community Support.

Thank you!

How to extract data (screen scrape) from webpages?

Tags: PHP MYSQL PERL
MYSQL PHP PERL


Hello,

I have a bunch of webpages (offline and online) from ancient history that I wish to extract data from. And automatically insert the data into a mysql database.

For example...

CONTENTS ON WEBPAGE:

<b>Name:</b> LARGE TRUCK INC.
<b>Address:</b> 123 White Horse Road, White Horse, CA 12345, USA
<b>Telephone:</b> 123 1234567
<b>Website:</b> http://www.largetruck.com

<b>Name:</b> SMALL TRUCK INC.
<b>Address:</b> 321 Black Rabbit Road, Black Rabbit, TX 54321, USA
<b>Telephone:</b> 999 7654321
<b>Website:</b> http://www.smalltruck.com

...

MYSQL TABLE FIELDS:
co_name
co_address1
co_address2
co_city
co_state
co_postcode
co_country
co_telephone
co_website


How do I extract the data and insert into a mysql database?

Thanks in advance.
Start your free trial to view this solution
Question Stats
Zone: Web Development
Question Asked By: gingera
Solution Provided By: adrpo
Participating Experts: 2
Solution Grade: A
Views: 185
Translate:
Loading Advertisement...
03.10.2008 at 10:17AM PDT, ID: 21088257

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
03.10.2008 at 02:25PM PDT, ID: 21090607

Rank: Master

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
03.12.2008 at 02:51PM PDT, ID: 21111169

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
 
Loading Advertisement...
Microsoft
  • Internet Protocols
  • Applications
  • Development
  • OS
  • Hardware
  • Windows Security
Apple
  • Operating Systems
  • Hardware
  • Programming
  • Networking
  • Software
Internet
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Spy / Ad Blockers
  • Web Browsers
  • New Net Users
  • Web Development
  • Chat / IM
  • Anti Spam
  • Web Servers
  • Anti-Virus
  • Email Clients
Gamers
  • Tips
  • Online / MMORPG
  • Puzzle
  • Emulators
  • Action / Adventure
  • Role Playing
  • Consoles
  • Game Programming
  • Strategy
  • Sports
  • Misc
  • Computer Games
Digital Living
  • Hardware
  • New Net Users
  • New Users
  • Software
  • Digital Music
  • Gaming World
  • Home Security
  • Apple
  • Networking Hardware
Virus & Spyware
  • Vulnerabilities
  • IDS
  • Encryption
  • Anti-Virus
  • Operating Systems Security
  • Software Firewalls
  • WebApplications
  • Cell Phones
  • Operating Systems
  • Internet
  • Hardware Firewalls
Hardware
  • Handhelds / PDAs
  • Displays / Monitors
  • Components
  • Networking Hardware
  • Peripherals
  • Laptops/Notebooks
  • Storage
  • Servers
  • Desktops
  • New Users
  • Misc
  • Apple
Software
  • System Utilities
  • Industry Specific
  • Network Management
  • Photos / Graphics
  • Page Layout
  • VMWare
  • Misc
  • Web Development
  • OS
  • CYGWIN
  • Voice Recognition
  • Message Queue
  • Quality Assurance
  • Security
  • Firewalls
  • MultiMedia Applications
  • Development
  • Database
  • Office / Productivity
  • Business Management
  • OS/2 Apps
  • Server Software
  • Internet / Email
ITPro
  • OS
  • Storage
  • Encryption
  • Operating Systems Security
  • Apple Hardware
  • Laptops & Notebooks
  • Servers
  • Networking Hardware
  • Peripherals
  • Devices
  • Displays / Monitors
  • WebTrends / Stats
  • Search Engines
  • Firewalls
  • WebApplications
  • IDS
  • Vulnerabilities
  • Email Clients
  • File Sharing
  • Spy / Ad Blockers
  • Web Browsers
  • Web Servers
  • Networking
  • Anti-Virus
  • Chat / IM
  • Anti Spam
Developer
  • Web Servers
  • Web Browsers
  • Game Programming
  • Dev Tools
  • Industry Specific
  • Office / Productivity
  • Database
  • CYGWIN
  • Web Development
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Programming
  • Content Management
  • Application Servers
  • Protocols
Storage
  • Removable Backup Media
  • Storage Technology
  • Servers
  • Grid
  • Remote Access
  • Backup / Restore
  • Misc
  • Hard Drives
OS
  • Miscellaneous
  • Security
  • Development
  • Linux
  • VMWare
  • MainFrame OS
  • Unix
  • Apple
  • OS / 2
  • AS / 400
  • BeOS
  • Microsoft
  • VMS / OpenVMS
Database
  • Oracle
  • Miscellaneous
  • MySQL
  • Software
  • Sybase
  • Contact Management
  • PostgreSQL
  • Data Manipulation
  • Clarion
  • InterSystems Cache
  • Siebel
  • MUMPS
  • OLAP
  • SQLBase
  • SAS
  • GIS & GPS
  • 4GL
  • Berkeley DB
  • DB2
  • Informix
  • Interbase / Firebird
  • FoxPro
  • Reporting
  • LDAP
  • Filemaker Pro
  • MS SQL Server
  • dBase
  • MS Access
Security
  • Misc
  • Web Browsers
  • Software Firewalls
  • Operating Systems Security
  • File Sharing
  • Spy / Ad Blockers
  • Vulnerabilities
  • WebApplications
  • IDS
  • Anti-Virus
  • Encryption
  • Anti Spam
  • Email Clients
  • VPN
  • Chat / IM
Programming
  • Editors IDEs
  • Installation
  • Handhelds / PDAs
  • Multimedia Programming
  • System / Kernel
  • Algorithms
  • Game
  • Signal Processing
  • Project Management
  • Open Source
  • Database
  • Misc
  • Languages
  • Processor Platforms
  • Theory
Web Development
  • Scripting
  • Blogs
  • Web Servers
  • Software
  • Search Engines
  • Web Graphics
  • Images
  • Internet Marketing
  • Images and Photos
  • Components
  • Document Imaging
  • Web Languages/Standards
  • Illustration
  • WebApplications
  • Fonts
  • WebTrends / Stats
  • Authoring
  • Digital Camera Software
  • Miscellaneous
Networking
  • Protocols
  • Apple Networking
  • Network Management
  • Message Queue
  • Application Servers
  • Content Management
  • File Servers
  • Email Servers
  • Misc
  • Java Editors & IDEs
  • Wireless
  • Networking Hardware
  • Backup / Restore
  • System Utilities
  • ISPs & Hosting
  • Web Servers
  • Storage Technology
  • Removable Backup Media
  • Servers
  • Broadband
  • Grid
  • OS / 2
  • Novell Netware
  • Unix Networking
  • Windows Networking
  • Security
  • Telecommunications
  • Operating Systems
  • Linux Networking
Other
  • Community Advisor
  • Lounge
  • Community Support
  • New Net Users
  • Philosophy / Religion
  • Math / Science
  • Miscellaneous
  • URLs
  • Expert Lounge
  • Politics
  • Puzzles / Riddles
Community Support
  • Suggestions
  • New to EE
  • New Topics
  • Community Advisor
  • CleanUp
  • Announcements
  • General
  • Feedback
  • Input
  • EE Bugs
 
03.10.2008 at 10:17AM PDT, ID: 21088257
U should user curl:

http://uk3.php.net/curl

Store the data in an array or variables then insert the database normally
Assisted Solution
 
03.10.2008 at 02:25PM PDT, ID: 21090607

Rank: Master


Hi,

Here it goes. Check the perl script below.

I hope you know a bit about perl regular expressions as you
might need to change the way the Address is split into pieces.

Cheers,
za-k/
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
88:
89:
90:
91:
92:
93:
94:
95:
96:
97:
98:
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
#!/usr/bin/perl
use DBI;
 
# I HAD ONLY PostgreSQL, so you need to test this!
my $conn = DBI->connect("dbi:Pg:dbname=assets;host=localhost;port=5432", "postgres", "YOUWISH");
# my $conn = DBI->connect("dbi:mysql:dbname=YOUR_DATABASE;host=localhost", "YOUR_USER", "YOUR_PASS");
 
# delete everything from the history table
my $queryDelete = $conn->prepare("delete from history;");
$queryDelete->execute();
 
 
# print HTML from a URL
use LWP;
use HTML::TreeBuilder;
 
# the URL that we need to fetch and parse
my $url = "file:C:/bin/cygwin/home/adrpo/webpage-html/webpage.html";
# my $url = "http://your-host.org/file.html";
 
my $browser = LWP::UserAgent->new;
 
$response = $browser->get($url); 
 die "Can't get $url -- ", $response->status_line
   unless $response->is_success;
 
 die "Was expecting HTML, not ", $response->content_type
   unless $response->content_type eq 'text/html';
     # or whatever content-type you're equipped to deal with
 
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse($response->content);
# dump the tree
# $tree->dump;
 
#get the body
@bodies = $tree->look_down('_tag' => 'body');
 
my ($co_name, $co_address, $co_telephone, $co_website, 
    $co_address1, $co_address2, $co_city, $co_state, 
    $co_postcode, $co_country);
 
foreach $body (@bodies) {
    my @tags = $body->content_list();
    my $i = 0;
    $size = @tags;
    while($i < $size) {
       # get the Name: tag
       if (ref($tags[$i]) eq "HTML::Element" and $tags[$i]->as_trimmed_text eq 'Name:')
       {	
	   $i++;
	   $co_name = $tags[$i]; 
       }
       $i++; # advance to Address: tag
       if (ref($tags[$i]) eq "HTML::Element" and $tags[$i]->as_trimmed_text eq 'Address:')
       {	
	   $i++;
	   $co_address = $tags[$i];
	   if ($co_address =~ /(.+)\,\s(.+)\,\s([a-zA-Z]*)\s([0-9]*)\,\s(.+)/)      
	   {     
	       $co_address1 = $1;
	       $co_address2 = "";
	       $co_city     = $2;
	       $co_state    = $3;
	       $co_postcode = $4;
	       $co_country  = $5;
	   }
	   else
	   {
	       print("Address: $co_address DID NOT MATCH!\n");
	   }
       }
       $i++; # advance to Telephone: tag
       if (ref($tags[$i]) eq "HTML::Element" and $tags[$i]->as_trimmed_text eq 'Telephone:')
       {	
	   $i++;
	   $co_telephone = $tags[$i];
       }
       $i++; # advance to Website: tag
       if (ref($tags[$i]) eq "HTML::Element" and $tags[$i]->as_trimmed_text eq 'Website:')
       {	
	   $i++;
	   $co_website = $tags[$i];
       }
       print "Inserting: ";
       print "\t" . $co_name . "\n";
       print "\t\t" . $co_address1 . " " . $co_address2 . " " . $co_postcode . ", " . $co_city . " ";
       print $co_state . " " .  $co_country . "\n";
       print "\t\t" . $co_telephone . "\n";
       print "\t\t" . $co_website . "\n";
       my $queryInsert = $conn->prepare("insert into history(co_name, co_address1, co_address2, co_city, co_state, co_postcode, co_country, co_telephone, co_website) values(?, ?, ?, ?, ?, ?, ?, ?, ?)");
       $queryInsert->execute($co_name, $co_address1, $co_address2, $co_city, $co_state, $co_postcode, $co_country, $co_telephone, $co_website);
       $i++;
    }
}
 
 
# check the inserted data
my $query = $conn->prepare("select * from history");
$query->execute();
 
while (@data = $query->fetchrow_array()) {
    print "Name: $data[0]\n";
    print "\tAddress 1: $data[1]\n";
    print "\tAddress 2: $data[2]\n";
    print "\tCity:      $data[3]\n";
    print "\tState:     $data[4]\n";
    print "\tPostcode:  $data[5]\n";
    print "\tCountry:   $data[6]\n";
    print "\tPhone:     $data[7]\n";
    print "\tWebsite:   $data[8]\n";
}
 
print ("******* Done with the processing! ********\n");
Open in New Window
 
Here is a trace of the execution.
 
Accepted Solution
 
03.12.2008 at 02:51PM PDT, ID: 21111169
Hi adrpo,

Thanks very much. I am a total idiot in Perl, so I am trying to figure how it works. I will let you know when I am successful or get totally stuck.
 
 
20080236-EE-VQP-29 / EE_QW_2_20070628