convert HTML table to .csv
Posted on 2002-04-10
I need to write a script to convert the contents of a HTML table to a .csv file, which will then be imported into an Excel spreadsheet. So far, thanks to some help from the good folks at this site, I have a way to extract the table from the rest of the content on the web page. What I need now is a way to convert this to comma seperated values.
My plan has been to use regular expressions to convert all of the </TD> tags to commas, all of the </TR> tags to new lines, and strip aaway all other tags, leaving a .csv file. I started writing regular expressions to remove 1 tag at a time, sort of like this:
# turn the </TR> tag into a newline...
($in_file =~ s/\<\/TR\>/\n/gi);
# turn the </TD> tag into a comma...
($in_file =~ s/\<\/TD\>/,/gi);
# strip out the <TR> tag...
($in_file =~ s/\<TR\>//gi);
# strip out the <TD> tags...
($in_file =~ s/\<TD[\s]*?(.*)\>//gi);
# strip out the <P> tags...
($in_file =~ s/\<P\>//gi);
but to me this just seems awkward and wrong. So, is there a grandaddy of all regular expressions to parse out all HTML tags, leaving only the data? It seems if I can use the first two above, then all I have to do is pull all of the other tags.
I'm looking for suggestions as to how I can best accomplish this. On qualification, I expect there is a module to help parse out HTML, but I can only do this using modules which come with a standard perl5.005_03 build, as I don't have sysad rights here at work to add modules into perl.