Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
215 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>

</head>

<body>

<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">

  <tr>

    <td align="center">

      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">

        <tr>

          <td>

            <p />

            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />

            Walmart Center<br />

            San Diego, CA 92101<br />

            <p />

            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />

            100 Main St.<br />

            Wheatland, CA 95692<br />

            <p />

            </font><br clear="all" />

            <hr size="1" width="100%" color="#000000" noshade />

            <p /></td>

        </tr>

      </table></td>

  </tr>

</table>

</body>

</html>

Open in new window

0
Comment
Question by:binovpd
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
Comment Utility
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
Comment Utility
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
Comment Utility
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
Comment Utility
Here is the slight changes to output and append results to a file (test2.dat).

use strict;

use HTML::TokeParser;

## Parse the HTML ... replace "\$html" with the filename

my $root = HTML::TokeParser->new( "test2.html" );

$root->empty_element_tags(1);    # configure its behaviour

## This is all a bit ugly, determining the tags to anchor

## the retrieval off was difficult given the poor html structure

while ( my $t    = $root->get_tag("b") ) {

	my $name  = $root->get_text("/b");

	$name     =~ s/^\s+|\s+$//g;

	my $addr  = $root->get_text("a");

	$addr     =~ s/\(.*?\)//g;

	$addr     =~ s/\(//g;

	$addr     =~ s/\)//g;

	$addr     =~ s/^\s+|\s+$/|/gm;

	$addr     =~ s/^\||\|$//g;

	$addr     =~ s/\|\|/|/g;

	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");

	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {

		$f1 = $1;

		$f2 = $2;

		$f3 = $3;

		$f4 = $4;

	}

	$f1 =~ s/\(</b>\//;

	$f1 =~ s/\)</b>\//;

	print "$name|$f1|$f2|$f3|$f4\n";
 

  open (MYFILE, '>>test2.dat');

  print MYFILE "$name|$f1|$f2|$f3|$f4\n";

  close (MYFILE);
 

}

Open in new window

0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

It is becoming increasingly popular to have a front-page slider on a web site. Nearly every TV website,  magazine or online news has one on their site, and even some e-commerce sites have one. Today you can use sliders with Joomla, WordPress or …
I hope you'll find this tutorial useful and interesting. So let's try to extend Tcl with a new package.  For anyone more deeply interested please check out the book "Practical Programming in Tcl and Tk". It's really one of the best written books abo…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now