Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
224 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />
            Walmart Center<br />
            San Diego, CA 92101<br />
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />
            100 Main St.<br />
            Wheatland, CA 95692<br />
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
Comment
Question by:binovpd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
ID: 24818892
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
ID: 24819434
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
ID: 31601856
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
ID: 24837855
Here is the slight changes to output and append results to a file (test2.dat).

use strict;
use HTML::TokeParser;
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( "test2.html" );
$root->empty_element_tags(1);    # configure its behaviour
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/\(.*?\)//g;
	$addr     =~ s/\(//g;
	$addr     =~ s/\)//g;
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	}
	$f1 =~ s/\(</b>\//;
	$f1 =~ s/\)</b>\//;
	print "$name|$f1|$f2|$f3|$f4\n";
 
  open (MYFILE, '>>test2.dat');
  print MYFILE "$name|$f1|$f2|$f3|$f4\n";
  close (MYFILE);
 
}

Open in new window

0

Featured Post

The Ultimate Checklist to Optimize Your Website

Websites are getting bigger and complicated by the day. Video, images, custom fonts are all great for showcasing your product/service. But the price to pay in terms of reduced page load times and ultimately, decreased sales, can lead to some difficult decisions about what to cut.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is becoming increasingly popular to have a front-page slider on a web site. Nearly every TV website,  magazine or online news has one on their site, and even some e-commerce sites have one. Today you can use sliders with Joomla, WordPress or …
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

695 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question