LMuadDIb
asked on
Dealing with html tags in a deliminated text file
Hello all,
Im trying to parse html text into a deliminated tstringlist, which can be imported into a database or other applications. Im wondering whats the best way to handle it. I can not use quotes sense I want to keep the html tags intact. But I have been looking into escaping/unescaping the html tags,
example of "escaping": turn "<td class="no">" into "<td class="no">"
or is it better to just leave the html intact, and just use different chars in the tstringlist deliminater & QuoteChar?
will I run into trouble with databases? MySQL and/or MSSQL
Once the data is in a database, and the data is put into a html page later down the road, will there be problems displaying the escaped html?
Or will it have to be unescaped before its put back into a html page?
Sorry this problem doesnt have an exact answer
I will up the points if needed, especially if I get a good explaination on the why and what fors :)
Im trying to parse html text into a deliminated tstringlist, which can be imported into a database or other applications. Im wondering whats the best way to handle it. I can not use quotes sense I want to keep the html tags intact. But I have been looking into escaping/unescaping the html tags,
example of "escaping": turn "<td class="no">" into "<td class="no">"
or is it better to just leave the html intact, and just use different chars in the tstringlist deliminater & QuoteChar?
will I run into trouble with databases? MySQL and/or MSSQL
Once the data is in a database, and the data is put into a html page later down the road, will there be problems displaying the escaped html?
Or will it have to be unescaped before its put back into a html page?
Sorry this problem doesnt have an exact answer
I will up the points if needed, especially if I get a good explaination on the why and what fors :)
ASKER
well, i was thinking about deliminated text file for output...
then I can easily import them into different databases or an xml database from the deliminated text file
think of a html table, each html table tr row would be a line in the deliminated text file, and each table td cell would be deliminated & quoted on that line row
if I want table row 2-4 and table cells 3,7,8 I can easily parse the table by looping th etext file lines and using deliminated text. Is there a better way to go about this?
My text file will not just hold tables though, practically any html tag. DIV tags, List tags, Form tags etc...
Im going to check out bas64 encoding/decoding
then I can easily import them into different databases or an xml database from the deliminated text file
think of a html table, each html table tr row would be a line in the deliminated text file, and each table td cell would be deliminated & quoted on that line row
if I want table row 2-4 and table cells 3,7,8 I can easily parse the table by looping th etext file lines and using deliminated text. Is there a better way to go about this?
My text file will not just hold tables though, practically any html tag. DIV tags, List tags, Form tags etc...
Im going to check out bas64 encoding/decoding
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
got a question about bas64 encoding
is this a standard encoding/decoding across platforms?
if I Encode the strings, will it be compatible if someone using .net or linux system be able to decode it without knowing how I encoded?
this might be a stupid question on my part, so try not to laugh at me lol =)
is this a standard encoding/decoding across platforms?
if I Encode the strings, will it be compatible if someone using .net or linux system be able to decode it without knowing how I encoded?
this might be a stupid question on my part, so try not to laugh at me lol =)
I am not lauching. it is perfectly normal to ask questions about something you don't know ;)
bas64 encoding/decoding is a standard (RFC 2152). you can read more (general) info here: http://en.wikipedia.org/wiki/Base64
bas64 encoding/decoding is a standard (RFC 2152). you can read more (general) info here: http://en.wikipedia.org/wiki/Base64
Why do you have to encode the HTML text at all? I see no reason to do this as it
won't make a difference to the database, or the TStringList, what the HTML Text is.
If I were you, I'd use the freeware FastHTML Parser at:
http://www.jazarsoft.com/main.php
This way, you can catch each tag in the OnFoundTag event
and place it in the Stringlist. Be aware, though, that this control
removes the < and > from the tags.
won't make a difference to the database, or the TStringList, what the HTML Text is.
If I were you, I'd use the freeware FastHTML Parser at:
http://www.jazarsoft.com/main.php
This way, you can catch each tag in the OnFoundTag event
and place it in the Stringlist. Be aware, though, that this control
removes the < and > from the tags.
Since his site is having problems, I suggest downloading it from Torry's:
http://www.torry.net/vcl/internet/html/jshtmpsr.zip
http://www.torry.net/vcl/internet/html/jshtmpsr.zip
ASKER
the encoding is needed because I will use the html text in a xml file as well as html web pages
the data will be stored in a database, but at times in a xml file directly
I built my own html parser component, but its xml based
It allows me alot more control in parsing the html tags then a standard html parser, so instead of dealing with strings I parse by nodes
I know it will not be the fastest, but the ease of use makes up for it
And Im working on a subcomponent that will provide basic output for it (deliminated text file)
the data will be stored in a database, but at times in a xml file directly
I built my own html parser component, but its xml based
It allows me alot more control in parsing the html tags then a standard html parser, so instead of dealing with strings I parse by nodes
I know it will not be the fastest, but the ease of use makes up for it
And Im working on a subcomponent that will provide basic output for it (deliminated text file)
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I don't think he meant including formatted data into XML but using AS XML :) at least that is what I understood
[quote]I will use the html text in a xml file as well as html web pages[/quote]
I read differently...
I read differently...
my bad :)
ASKER
actually both :)
I will use the html as xml, but primarily the html will be inserted into a xml node
thanx for your time
I will use the html as xml, but primarily the html will be inserted into a xml node
thanx for your time
ASKER
.
regarding the loading the html text into a tstringlist ... do you really need to have delimited and quoted text in it? if so, you can try using characters that will not appear in the html like #1 and #2 or whatever non-printable char ;) (you can define them as constants so you will not hardcode it throu your code)
give su more details on why you need the quotes and delimiters in the tstringlist, maybe there are better alternatvies