Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
216 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>

</head>

<body>

<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">

  <tr>

    <td align="center">

      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">

        <tr>

          <td>

            <p />

            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />

            Walmart Center<br />

            San Diego, CA 92101<br />

            <p />

            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />

            100 Main St.<br />

            Wheatland, CA 95692<br />

            <p />

            </font><br clear="all" />

            <hr size="1" width="100%" color="#000000" noshade />

            <p /></td>

        </tr>

      </table></td>

  </tr>

</table>

</body>

</html>

Open in new window

0
Comment
Question by:binovpd
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 125 total points
ID: 24818892
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
ID: 24819434
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
ID: 31601856
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
ID: 24837855
Here is the slight changes to output and append results to a file (test2.dat).

use strict;

use HTML::TokeParser;

## Parse the HTML ... replace "\$html" with the filename

my $root = HTML::TokeParser->new( "test2.html" );

$root->empty_element_tags(1);    # configure its behaviour

## This is all a bit ugly, determining the tags to anchor

## the retrieval off was difficult given the poor html structure

while ( my $t    = $root->get_tag("b") ) {

	my $name  = $root->get_text("/b");

	$name     =~ s/^\s+|\s+$//g;

	my $addr  = $root->get_text("a");

	$addr     =~ s/\(.*?\)//g;

	$addr     =~ s/\(//g;

	$addr     =~ s/\)//g;

	$addr     =~ s/^\s+|\s+$/|/gm;

	$addr     =~ s/^\||\|$//g;

	$addr     =~ s/\|\|/|/g;

	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");

	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {

		$f1 = $1;

		$f2 = $2;

		$f3 = $3;

		$f4 = $4;

	}

	$f1 =~ s/\(</b>\//;

	$f1 =~ s/\)</b>\//;

	print "$name|$f1|$f2|$f3|$f4\n";
 

  open (MYFILE, '>>test2.dat');

  print MYFILE "$name|$f1|$f2|$f3|$f4\n";

  close (MYFILE);
 

}

Open in new window

0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

919 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now