[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

move data from html to a pipe delimited file

Posted on 2009-07-09
4
Medium Priority
?
226 Views
Last Modified: 2012-05-07
Hi all,

I have a bunch single html files that I need to look through an extract some data and put it into a pipe delimited file. Im looking for any suggestion whether its using perl, dos batch scripts etc.

The html for each page looks something like this (ugly yes I know)


The data is not in individual cells and rows which makes it a bit more difficult. In the code example I would want to pull out all the store listings and put it into a pipe delimited file like so:

Store|Address|City|State|Zip
Joes Bakery|In The Albertson's Center|San Diego|CA|92101
Jim's Cafe|100 Main St.|Wheatland,|CA|95692

or if this is easier
Store|Address
Joes Bakery|In The Albertson's Center San Diego, CA 92101
Jim's Cafe|100 Main St. Wheatland, CA 95692
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a><br />
            Walmart Center<br />
            San Diego, CA 92101<br />
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a><br />
            100 Main St.<br />
            Wheatland, CA 95692<br />
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
Comment
Question by:binovpd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 25

Accepted Solution

by:
lwadwell earned 500 total points
ID: 24818892
Hi binovpd,

I would choose perl.  I have attached a small sample script below that uses the perl package HTML::TokeParser to better scan the HTML.


lwadwell
use strict;
use HTML::TokeParser; 
## READ the file into a variable ...
## HTML::TokeParser can easily read from a file so this 
## can be removed ... this was done purely for testing.
my $html;
while ( my $l = <DATA> ) {
	$html .= $l;
} 
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( \$html );
$root->empty_element_tags(1);    # configure its behaviour 
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	} 
	print "$name|$f1|$f2|$f3|$f4\n";
} 
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%" height="90%" align="center">
  <tr>
    <td align="center">
      <table border="0" cellpadding="0" cellspacing="0" width="590" align="center">
        <tr>
          <td>
            <p />
            <font size=2><a href="http://www.site.com"><b>Joes Bakery</b></a> 
            Walmart Center 
            San Diego, CA 92101 
            <p />
            <font size=2><a href="http://www.site.com"><b>Jim's Cafe</b></a> 
            100 Main St. 
            Wheatland, CA 95692 
            <p />
            </font><br clear="all" />
            <hr size="1" width="100%" color="#000000" noshade />
            <p /></td>
        </tr>
      </table></td>
  </tr>
</table>
</body>
</html>

Open in new window

0
 

Author Comment

by:binovpd
ID: 24819434
Thank you lwadwell that did the trick. I appreciate the sample although Ill probably load the files since I have to look through a directory of these html files.
0
 

Author Closing Comment

by:binovpd
ID: 31601856
Did the trick thanks lwadwell.
0
 

Author Comment

by:binovpd
ID: 24837855
Here is the slight changes to output and append results to a file (test2.dat).

use strict;
use HTML::TokeParser;
## Parse the HTML ... replace "\$html" with the filename
my $root = HTML::TokeParser->new( "test2.html" );
$root->empty_element_tags(1);    # configure its behaviour
## This is all a bit ugly, determining the tags to anchor
## the retrieval off was difficult given the poor html structure
while ( my $t    = $root->get_tag("b") ) {
	my $name  = $root->get_text("/b");
	$name     =~ s/^\s+|\s+$//g;
	my $addr  = $root->get_text("a");
	$addr     =~ s/\(.*?\)//g;
	$addr     =~ s/\(//g;
	$addr     =~ s/\)//g;
	$addr     =~ s/^\s+|\s+$/|/gm;
	$addr     =~ s/^\||\|$//g;
	$addr     =~ s/\|\|/|/g;
	my ($f1, $f2, $f3, $f4) = ($addr, "", "", "");
	if ( $addr =~ /^(.+)\|(.+),\s+(.+)\s+(\d+)$/ ) {
		$f1 = $1;
		$f2 = $2;
		$f3 = $3;
		$f4 = $4;
	}
	$f1 =~ s/\(</b>\//;
	$f1 =~ s/\)</b>\//;
	print "$name|$f1|$f2|$f3|$f4\n";
 
  open (MYFILE, '>>test2.dat');
  print MYFILE "$name|$f1|$f2|$f3|$f4\n";
  close (MYFILE);
 
}

Open in new window

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this tutorial I will show you how to make a simple HTML bar chart with the usage of WhizBase, If you want more information about WhizBase please read my previous articles at http://www.experts-exchange.com/ARTH_5123186.html (http://www.experts-ex…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Six Sigma Control Plans
Suggested Courses

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question