?
Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
Medium Priority
?
225 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />
            Walmart Center<br />
            San Diego, CA 92101<br />
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />
            100 Main St.<br />
            Wheatland, CA 95692<br />
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
Comment
Question by:binovpd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 500 total points
ID: 24818892
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
ID: 24819434
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
ID: 31601856
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
ID: 24837855
Here is the slight changes to output and append results to a file (test2.dat).

use strict;
use HTML::TokeParser;
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( "test2.html" );
$root->empty_element_tags(1);    # configure its behaviour
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/\(.*?\)//g;
	$addr     =~ s/\(//g;
	$addr     =~ s/\)//g;
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	}
	$f1 =~ s/\(</b>\//;
	$f1 =~ s/\)</b>\//;
	print "$name|$f1|$f2|$f3|$f4\n";
 
  open (MYFILE, '>>test2.dat');
  print MYFILE "$name|$f1|$f2|$f3|$f4\n";
  close (MYFILE);
 
}

Open in new window

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question