Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
221 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />
            Walmart Center<br />
            San Diego, CA 92101<br />
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />
            100 Main St.<br />
            Wheatland, CA 95692<br />
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
Comment
Question by:binovpd
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
ID: 24818892
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
ID: 24819434
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
ID: 31601856
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
ID: 24837855
Here is the slight changes to output and append results to a file (test2.dat).

use strict;
use HTML::TokeParser;
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( "test2.html" );
$root->empty_element_tags(1);    # configure its behaviour
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/\(.*?\)//g;
	$addr     =~ s/\(//g;
	$addr     =~ s/\)//g;
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	}
	$f1 =~ s/\(</b>\//;
	$f1 =~ s/\)</b>\//;
	print "$name|$f1|$f2|$f3|$f4\n";
 
  open (MYFILE, '>>test2.dat');
  print MYFILE "$name|$f1|$f2|$f3|$f4\n";
  close (MYFILE);
 
}

Open in new window

0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
In threads here at EE, each comment has a unique Identifier (ID). It is easy to get the full path for an ID via the right-click context menu. However, we often want to post a short link within a thread rather than the full link. This article shows a…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question